How Much of Data Science Can be Automated?

Article
By
Justin Dale Collins
September 28, 2020 6 minute read

Introduction:

Why should we hire a data scientist when we can just use one of the many AutoML tools on the market? What if we could have our own, personalized AutoML platform instead of one already on the market? These are among the questions that businesses encounter when toying with the idea of investing in an autoML tool, with a view to cut resources and turnaround time. To answer these questions, let us first look at the usual data science workflow.

The workflow:

Data access and prep:

Data access is largely automatable given you have the correct infrastructure (here’s looking at you, data engineering) to create consistent data format(s) that are going to be accessed for data analysis. Additionally, almost all data visualization tools, AutoML tools, and a variety of other products have automatic ingestion processes based on the standardized data storage schemas. Should you be in a field where none of these fit your needs or are generating a new type of data object, you will likely be out of luck on the automated side of this. Further, when discussing what to do with missing values it can be difficult to determine the appropriate method to deal with this for a series of general use cases; however, if your situation was self-contained enough you could handle this as part of the ETL process (or ELT if you are working with massive data sets).

EDA:

Though technically, EDA requires understanding the problem area and goal, for a list of predetermined use cases you could automate the usual metrics you want to check in the data. This would include the usual visualizations to check about the consistency of the data as well as anything you may have missed in the data ingestion and cleaning stage. The interpretation of the EDA requires human intervention (at least the first time) to set the boundaries and cutoffs for acceptable data.

Feature engineering:

Oftentimes, people argue that this is the step where autoML cannot beat a human data scientist and while that is technically true, many autoML platforms use the best practices of many different data scientists for specific types of problems to automatically optimize feature engineering. This of course presupposes that there have been people to work on your type of problem before but, given that autoML can be set up for specific use cases, it can compete by doing standard feature engineering for a given use case that can be preselected (calculating lags for time series forecasting, for example).

What if your use case is a little different than average or it’s not a commonly used one in the business world? You are likely out of luck but for some, autoML platforms provide the solutions to their usual problems right out of the box.

Model selection and tuning:

For a given problem there are a generally a finite number of algorithm types and a finite number of hyperparameters for each algorithm — you see where this is going. Given an objective function and a range of parameters to use, the models can be automatically optimized without the specification of the algorithm type or hyperparameter value a priori.

Practically, a range for testing needs to be given to implement hyperparameter tuning (to use grid search, for example) but that range doesn’t have to be particularly narrow. After running all the different algorithms with all the different combinations of hyperparameters and scoring them based on your set evaluation criteria, you can see the top performing models. While a data scientist might be able to save you some time and some computational power by narrowing the number of algorithms to try and the ranges of the hyperparameters, ultimately, this is the place where autoML shines brightest. In addition to narrowing the scope, the usual data science workflow will have code already available to do this kind of analysis.

Model assessment and deployment:

Other than deciding where to save the model, how to access it, and what sort of accuracy testing you want to perform in addition to any validation that the model building process underwent, there is nothing here that cannot be automated.

What’s the verdict?

It is evident that there is a great divide in the answer to the original question. To simplify, AutoML might be for you if:

1. You have an immediate need for technical solutions across many well researched, well-defined problem areas where your data already exists in a form that can be leveraged to answer these questions.

2. You have semi-technical domain experts that are working on well-researched, well-defined problem areas that need to be able to generate workable ML models from your existing data quickly and at scale.

To note, these situations both assume that you have the data infrastructure in place to supply these tools with the correct form and amount of data required to create the solutions.

The “Other” option:

A data scientist (or department) and a data engineer (or department) could together create the infrastructure and solution you need in your organization, automate it, and deploy it at scale for your specific use case while taking into consideration the nuances unique to your business challenges, processes, and data. Then, they could do that again for another use case, and so on. This allows for the creation of an autoML platform customized to your company as the use cases grow.

This is the advantage of hiring data scientists and data engineers either as consultants like from TheMathCompany or internally in your organization. Not only can they contextualize the optimal solution to your specific use case (how your data looks for time series forecasting for instance), but also automate the personalized solution at scale for your company.

AutoML solutions have their place in business, but the code to create these automated solutions were written by data scientists and data engineers (of course with help of your friendly neighborhood software developers!). This increases their ease of use but decreases their flexibility. Instead, have your data science consultants solve your problem and then automate it for you.

Leader
Justin Dale Collins
Principal

Justin Collins is a visionary leader, consultant, author, and researcher specializing in oncology, retail, and data science. Currently, he leads the Client Value and Strategic Partnerships function for MathCo. Justin has been with MathCo for nearly half of the company's existence and has excelled in various leadership capacities. His strategic vision and innovative approach have been instrumental in propelling the company's expansion and establishing it as a leader in the industry. Justin's work in oncology uncovered unique genetic mutations and profiles, offering new avenues for targeted therapies to improve patient outcomes. With over 17 years of experience, Justin's contributions continue to drive innovation and help his clients create self-sufficiency via customized capabilities that drive enterprise value.