Offerings
Services
Our Platform
- NucliOS^®
  
  MathCo’s Enterprise-Grade, Context-Aware AI Platform
WHAT'S NEW?
- Unlock Smarter Insights for CPG in 2025 with Generative AI
  Read more
- 4 Data Engineering Challenges Hurting Your Organization
  Read more
Industries
Industries
FEATURED CASE STUDIES
- Optimizing Recruitment Strategy with Talent Acquisition Data Dashboards
  Read more
- IR and Customer 360 Platform Set Up
  Read more
WHAT'S NEW?
- From Data to Discovery: Harnessing the Power of Knowledge Graphs in Drug Development
  Read more
- Leveraging Multi-Touch Attribution for Optimizing Pharmaceutical Content-Level Messaging
  Read more
Insights
Insights
WHAT’S NEW?
- Procurement of the Future
  Read more
- How Google’s Willow Is Paving a New Path for GenAI and Businesses
  Read more
Careers
CAREERS
QUICK LINKS
- Work with MathCo
  
  Join Us
WHAT’S NEW?
- MathCo Certified as a Great Place to Work® in India
  Read more
- TheMathCompany Recognized among the Inspiring Workplaces in North America
  Read more
About
About Us
- Our Company
- Newsroom
- Awards
- Events
QUICK LINKS
- Meet Our Leaders
- AI Xecutive Council
  
  Leveraging Collective Knowledge to Lead AI Innovation
  AIXC COMMUNITY
WHAT'S NEW?
- Redefining Enterprise Transformation: The Power of Generative AI
  Read more
- TheMathCompany is among India’s Best Workplaces™ for Diversity, Equity, Inclusion & Belonging 2023, by GPTW
  Read more

How Much of Data Science Can be Automated?

Article

Justin Dale Collins

September 28, 2020 6 minute read

Introduction:

Why should we hire a data scientist when we can just use one of the many AutoML tools on the market? What if we could have our own, personalized AutoML platform instead of one already on the market? These are among the questions that businesses encounter when toying with the idea of investing in an autoML tool, with a view to cut resources and turnaround time. To answer these questions, let us first look at the usual data science workflow.

The workflow:

Data access and prep:

Data access is largely automatable given you have the correct infrastructure (here’s looking at you, data engineering) to create consistent data format(s) that are going to be accessed for data analysis. Additionally, almost all data visualization tools, AutoML tools, and a variety of other products have automatic ingestion processes based on the standardized data storage schemas. Should you be in a field where none of these fit your needs or are generating a new type of data object, you will likely be out of luck on the automated side of this. Further, when discussing what to do with missing values it can be difficult to determine the appropriate method to deal with this for a series of general use cases; however, if your situation was self-contained enough you could handle this as part of the ETL process (or ELT if you are working with massive data sets).

EDA:

Though technically, EDA requires understanding the problem area and goal, for a list of predetermined use cases you could automate the usual metrics you want to check in the data. This would include the usual visualizations to check about the consistency of the data as well as anything you may have missed in the data ingestion and cleaning stage. The interpretation of the EDA requires human intervention (at least the first time) to set the boundaries and cutoffs for acceptable data.

Feature engineering:

Oftentimes, people argue that this is the step where autoML cannot beat a human data scientist and while that is technically true, many autoML platforms use the best practices of many different data scientists for specific types of problems to automatically optimize feature engineering. This of course presupposes that there have been people to work on your type of problem before but, given that autoML can be set up for specific use cases, it can compete by doing standard feature engineering for a given use case that can be preselected (calculating lags for time series forecasting, for example).

What if your use case is a little different than average or it’s not a commonly used one in the business world? You are likely out of luck but for some, autoML platforms provide the solutions to their usual problems right out of the box.

Model selection and tuning:

For a given problem there are a generally a finite number of algorithm types and a finite number of hyperparameters for each algorithm — you see where this is going. Given an objective function and a range of parameters to use, the models can be automatically optimized without the specification of the algorithm type or hyperparameter value a priori.

Practically, a range for testing needs to be given to implement hyperparameter tuning (to use grid search, for example) but that range doesn’t have to be particularly narrow. After running all the different algorithms with all the different combinations of hyperparameters and scoring them based on your set evaluation criteria, you can see the top performing models. While a data scientist might be able to save you some time and some computational power by narrowing the number of algorithms to try and the ranges of the hyperparameters, ultimately, this is the place where autoML shines brightest. In addition to narrowing the scope, the usual data science workflow will have code already available to do this kind of analysis.

Model assessment and deployment:

Other than deciding where to save the model, how to access it, and what sort of accuracy testing you want to perform in addition to any validation that the model building process underwent, there is nothing here that cannot be automated.

What’s the verdict?

It is evident that there is a great divide in the answer to the original question. To simplify, AutoML might be for you if:

1. You have an immediate need for technical solutions across many well researched, well-defined problem areas where your data already exists in a form that can be leveraged to answer these questions.

2. You have semi-technical domain experts that are working on well-researched, well-defined problem areas that need to be able to generate workable ML models from your existing data quickly and at scale.

To note, these situations both assume that you have the data infrastructure in place to supply these tools with the correct form and amount of data required to create the solutions.

The “Other” option:

A data scientist (or department) and a data engineer (or department) could together create the infrastructure and solution you need in your organization, automate it, and deploy it at scale for your specific use case while taking into consideration the nuances unique to your business challenges, processes, and data. Then, they could do that again for another use case, and so on. This allows for the creation of an autoML platform customized to your company as the use cases grow.

This is the advantage of hiring data scientists and data engineers either as consultants like from TheMathCompany or internally in your organization. Not only can they contextualize the optimal solution to your specific use case (how your data looks for time series forecasting for instance), but also automate the personalized solution at scale for your company.

AutoML solutions have their place in business, but the code to create these automated solutions were written by data scientists and data engineers (of course with help of your friendly neighborhood software developers!). This increases their ease of use but decreases their flexibility. Instead, have your data science consultants solve your problem and then automate it for you.

Justin Dale Collins

Principal

Justin Collins is a visionary leader, consultant, author, and researcher specializing in oncology, retail, and data science. Currently, he leads the Client Value and Strategic Partnerships function for MathCo. Justin has been with MathCo for nearly half of the company's existence and has excelled in various leadership capacities. His strategic vision and innovative approach have been instrumental in propelling the company's expansion and establishing it as a leader in the industry. Justin's work in oncology uncovered unique genetic mutations and profiles, offering new avenues for targeted therapies to improve patient outcomes. With over 17 years of experience, Justin's contributions continue to drive innovation and help his clients create self-sufficiency via customized capabilities that drive enterprise value.

How Much of Data Science Can be Automated?

Introduction:

The workflow:

Data access and prep:

EDA:

Feature engineering:

Model selection and tuning:

Model assessment and deployment:

What’s the verdict?

The “Other” option:

Justin Dale Collins

Chicago

TheMathCompany Inc.
515 N State St.,
Unit 13-137
Chicago, IL 60654
United States
+1 (646) 220-0901

Amsterdam

TheMathCompany BV
Keizersgracht 555,
2e verdieping, 1017 DR
Amsterdam, Netherlands
+31 627613405

Bengaluru

TheMathCompany Pvt Ltd,
8th Floor, Tower A, IWF Campus,
Whitefield Main Rd, B Narayanapura,
Mahadevapura, Bengaluru, Karnataka,
India – 560048
+91-80-46245900