Specialized in Machine Learning, Jennifer Prendki, CEO of Alectio and Expert at the International Institute for Analytics, tells us about the creation of less energy consuming and more efficient Artificial Intelligence models. Here’s a summary of the latest Alectio MasterClass on Tuesday, December 8!
What is Machine Learning?
Let's start with the beginning! Machine Learning also means Training Data Set.
To put it in simple terms, a Training Data Set is a set of data that we can use for predictive purposes and that can be processed through a Machine Learning model.
A Training Data Set is made of three data types:
- Harmful data: poor quality data that regresses the performance of the model used. If this data is rejected, then the model will be better ;
- Useful data: relatively clear data that the model can understand. It allows the model to be boosted, and to learn new things and perform better ;
- Useless data: too similar, irrelevant or not applicable data regarding the subject for which it will be used. If it is rejected, this allows both the model’s efficiency improvement and a saving of time.
Thus, Machine Learning can be defined as an Artificial Intelligence technology that allows a computer, from a database, to learn and make predictions automatically.
Alectio is the first startup that focuses on automatic data curation and data collection optimization. Jennifer Prendki and her team are dedicated to helping Machine Learning teams build more efficient models while using less data, and reducing costs and time associated with model training. Thus, Alectio will be able to predict in its own model which data is harmful, useful and useless.
Is it possible to create better predictions with less data?
Of course! And Alectio proves it on a daily basis. According to this startup, not all data is created equal. This means that not all data has the same value. Some are more useful than others, therefore their identification and their selection must be precise and relevant.
Alectio uses a semi-supervised technology called "Active Learning". It allows an active and incremental data selection in order to identify the sample that contains the biggest amount of relevant information.
How does Active Learning work?
This model works with a loop system, which self-modifies and self-improves thanks to the collected and analyzed data. It is composed of four successive phases, which are renewed at the end of each cycle:
- SELECT: Selection of a part of the data. It is better to avoid choosing too similar data, and we call it the inquiring strategy ;
- LABEL: Data labeling by ourselves or by external actors ;
- TRAIN: Training the model with selected and labeled data ;
- INFER: Prediction made on the data that is already labeled.
If the analysts are satisfied enough with the predictions, the process can be stopped. If not, they can select another part of the collected data, add it to the data that has already been labeled and trained, and then go again through each Active Learning phase until they reach a sufficient result.
Active Learning demonstrates that a model can progressively learn from itself thanks to the data it analyzes through the whole process. This system also builds a learning curve, which represents the relationship between the model's performance and the amount of data used.
Although some companies use supervised methods, i.e. they use all the collected data without sorting it out, Active Learning is an essential method. Indeed, only 25% of the data collected is relevant and applicable. In the most extreme cases, this can go down to less than 1%!
What are the benefits and drawbacks of Active Learning?
- Saving on labeling costs ;
- More bias-proof that other supervised learning ;
- More accurate and relevant data selection.
- Longer training time due to progressive data selection ;
- Higher risk of making mistakes ;
- Higher calculation costs.
The good news is, the disadvantages can be avoided by combining Active Learning with other techniques such as Reinforcement Learning and Meta Learning. More and more research is being done in this area, promising significant progress in the coming years.