Automated Machine Learning

How AutoML gradually becomes a new analytics productivity tool

10 June 2020 · 5 minute read

As the global data quantity already follows an exponential trend, machine learning has become present in every application, creating a great demand for general know-how, be it data scientists or computer scientists with related knowledge. Currently, the demand for work to be done surpasses the offer of such professionals, thus automatic solutions have to be found.

⬤

The classical machine learning process involves a few default steps that have become default nowadays, namely:

data engineering

model selection

hyperparameter tuning

the actual model training

Due to the highly repetitive nature of trial and error of these tasks, automation can play a big role in optimizing time spent on them. Automated Machine Learning comes to help the process by adding different optimization techniques that determine data scientists be more productive and achieve similar or better results in a shorter period of time.

Since every new product or proposed algorithm has to solve a real problem in order to be successful, Automated Machine Learning has the main goal of saving time and lowering the error rate of humans. In the classical machine learning flow, it all starts with raw data. As the data producing sources are not perfect and often not optimized for analytics, the datasets they produce are far from ready to be fed to a learning algorithm. This creates the need for data cleaning and engineering, which usually takes more time than expected.

Taking into consideration that the power of any learning algorithm (talking about better scores) relies mostly on the data it receives and less on later optimizations, the step of feature engineering has to be consistent and well planned in order to achieve top results.

Right after the data engineering phase comes the model selection step, which is again, subject to trial and error. The fact that some tasks have well known model types that work well only reduces the search space, but there still leaves oportunity for choice between different approaches. The last step, after a model is chosen, is hyperparameter optimization, which has some degrees of freedom regarding the hyperparameters before the model is started to learn automatically.

After listing only three of the major steps of a classical machine learning pipeline (raw data to trained model), there are clearly places where automation can help and automatically iterating over the search space can yield better and unexplored configurations of the data engineering algorithm, while also producing surprising model configurations which work better than expected. As the statistics show, the time spent on data preparation and hyperparameter selection tasks is sometimes as high as 80% of the total project time. The rest of time is spent on almost automatic training of the model.

Since there is so much time to be optimized, it is worth researching better alternatives to the whole machine learning flow rather than just continuing with the classical approach. Not only are good data scientists scarce, but they are also financially intensive nowadays, so finding a secondary optimal alternative is a priority.

⬤

Automated Machine Learning - the trial of automating automation

Automated Machine Learning is clearly an optimisation challenge. There is no well known algorithm or a one size fits all approach to solve all learning problems. Various configurations have to be searched in order to decide which one is performing best. Taking into account that data scientists are not dealing with a finite and continuous search space, but rather an infinite and non convex one, heuristics have to be developed in order to find a configuration that is near the best solution.

Since the landscape or the theoretically named search space is infinite, heuristics have to be developed with the goal of finding a reasonably good optimum in the least amount of time. These algorithms come in different forms, from conventional search methods, like grid or random search to more advanced methods inspired by nature, namely Bayesian Optimizations, Reinforcement Learning and Evolutionary Algorithms. Since AutoML is a relatively new research field, the first two techniques have been used with decent performances. Since the latter is a more natural way of converging to general global optimums, makes it a worth exploring alternative.

Evolutionary Algorithms are inspired by the Darwinian theory that evolution is subject to a set of primal operations as the survival of the fittest, genome crossover and random mutation as a result of adapting to the environment. They also rely on the fact that small improvements in the genome of individuals can, over time, using the technique of survival of the fittest, lead to greater advancements of the species. Being fit in such a population is highly subjective and is a function of the environment of the specific species. Generally a fitter individual is better in the environment in which he lives and at the tasks he is supposed to carry out. Gradually eliminating less fit members from the population and allowing, with a greater probability, the fittest to reproduce, will, over a number of iterations (also known as epochs), lead to an overall fitter population.

Using the concepts of evolutionary algorithms in Automated Machine Learning, one can search through the configuration space more efficiently, finding gradually better methods. Genomes can be considered as different configurations and their fitness some metric on the dataset, combined with other metrics regarding the training phase, such as convergence rates or the derivative of loss drop over time. Data engineering is also subject to evolutionary optimisations, since the techniques of generating features can be combined in various ways in order to extract the most valuable information from the dataset. Both feature engineering and model training using evolutionary algorithms can lead to interesting results, subject to further analysis by data scientists or artificial intelligence enthusiasts.

⬤

In order to prove the aforementioned concept, I have developed an Automated Machine Learning Pipeline that is able to automatically iterate over the data engineering steps, find a good neural network architecture, set its hyper-parameters optimally and finally, yield the trained model.

The whole idea was to create a framework that could receive any .csv file, regardless of the feature types and yield a trained model, which could eventually be reused in further predictions.

Not only do Evolutionary Algorithms applied in AutoML yield peculiar, yet well performing neural network configurations, but also prove to be capable of directly comparing themselves to already existing frameworks like

AutoKeras

MLBox

TPOT

, as shown in the figures below. On the left are the train set accuracies while on the right are the test set ones. All the models where tested in their default configuration, just to prove the off the box nature of Automated Machine Learning frameworks.

Also, the framework handles regression tasks with the same ease as it does with classification, but the latter was used since accuracies are a better metric for presenting results and testing on benchmark datasets.

⬤

Developing sustainable Automated Machine Learning frameworks drives the whole field towards the big goal of achieving general Artificial Intelligence. The proposed solution in this article is far from that goal, and so are almost all the other available tools, since achieving general Artificial Intelligence in an automated fashion is still far away. Further directions would be towards adding Evolutionary Algorithms to the whole flow, including the data engineering part, to not only discover interesting neural network combinations, but also various ways of blending together feature engineering methods for achieving better scores.

Summing up, the aim of this post is to provide a brief intoroduction into the field of AutoML, emphasizing the importance of it for analysis and time saving purposes. Furthermore, it presented results that confirm the viability, generality and performance of the proposed Evolutionary Algorithms based AutoML solution.

Check the product at aiflow.ltd