How i integrated machine learning into my work ?

Introduction

First I want to start this article by saying that machine learning is not a « fashion », same for deep learning. These methods were created before the 2000s. They are based on mathematical models. If they have become known, to the point of making them « buzz words », it’s simply due to recent innovations with more powerful and practical tools (computers, software, libraries). And therefore, it’s now possible to apply them more and more easily. Plus with the internet, it’s also easier to learn, download, and share ressources!

These methods, and those who implement them, give meaning to all this data we accumulate (internally or externally). Better KPIs, better knowledge in a subject, better accuracy in the results, saving time and money.. It’s clear that it brings value to the companies.

Note: To give you a complete idea of my background, I studied applied mathematics and digital projects. I have worked with internal company data, and web data, and also had an entrepreneurial experience. Today (2020), I work at the ArcelorMittal headquarters in Luxembourg as a data analyst.

Requirements and opinion

To make an example of my case, I’ve always worked with data. But I’ve never really had enough data to create more complex processes. And even if I had wanted to collect as much data as possible to start a machine learning model, I would have been selfish because for the company, it would have been useless (professional conscience is important). 

So waiting for a good opportunity, i was doing machine learning on Kaggle datasets and Bitcoin Price predicitons project and it’s ok at the beginning. Hope we’ll agree on that, but the true goal is to use the machine learning in daily work. Being paid for that, and doing that everyday, and that’s why I’m writing this article.

Talking about opportunities, do you know the difference between a data analyst and a data scientist ? A data analyst analyzes past data and ‘makes it talk’. The scientist predicts results and consequently future data. In the practice, as many of us agree, sometimes the distinction can be blurred

So what you have to remember about that, it doesn’t matter if the work title is data analyst or data scientist. The important thing to integrate machine learning into your work is that your work need to validate these conditions:

  1. Your work must involve the use of thousands of data (if not millions)
  2. Your work must involve the necessity of making a prediction (and there are several ways to do this)
  3. Your capacity to collect thousands of supervised data (if supervised learning)
If you have a customer database with purchase history, perhaps you could look at methods of clustering (unsupervised learning) and then predicting purchases on a particular marketing campaign. If you are on financial data, you could integrate prediction (as regression model) with seasonality and integrating external economic data by webscrapping. If you are like me in a situation where there is no prediction process that you can see immediately, maybe by breaking down the steps you could include machine learning. And that’s what I’m going to talk about now.
 
Note: If you’re not familiar with Python, or machine learning processes, this article will give you a practice feedback for sure. But here is a list of Python courses i recommend, most courses should still be up todate.
Okay then, let’s start!

Machine learning integration

Basically, when your work validates the requirements for implementing the machine learning, there are 3 main steps left.

  1. Business acceptance (or similar)
  2. Definition of results and expectations
  3. Analysis of the results and optimizations

If in your case, the machine learning has already been developed in your project, perhaps the rest of the article will not be useful for you.

Note: The purpose of this article is simply to give the main processes of integrating the machine learning in the work. The more technical aspects will not be developed here. If you are interesting in it, details, and work strategy and steps, you’ll find everything on this link: Deduplication with Python and machine learning.

1. Business acceptance

In my case, the project was to find duplicates in technical and multilingual data and written in free text (so very noisy). My managers already heard the term of ‘machine learning’ but without having a clear understanding of the processes. And moreover, historically in the project, no data analyst had clearly implemented a machine learning approach. So at this stage, it was strictly R&D. And until the interest of machine learning is proven, we had to focus on current data analysis methods. But I had an idea in mind… a step-by-step integration.

I knew that i needed binary features (yes/no) to build a model simple to understand. Exploring visually the data with these new features, I was able to get an approximative idea on the ones that have the most impact. In terms of machine learning, it was clearly the notions of coefficients, and feature importances. Then we grouped in confidence level the propositions (propositions with a maximum of ‘good’ features were classified in a confidence level 1 group, the ones with less ‘good’ features in group 1.5, then 2, etc..). In terms of machine learning, it’s exactly the same as a probabilistic predictions with the confidence score. So more or less, i created a very understandable manual machine learning model, and not diverging too much from the current data analysis methods. Following that, i introduced the term of ‘supervised data’, using verified predictions (=results) and doing simple statistics analysis on binary features

Few weeks later, on ‘my hours’ (out of job hours = out of week days from 8am-6pm), i developed the machine learning model. In our case, a 70% accuracy was expected from our work. Since my machine learning model was able to predict +60%, i decided to present it. In my PowerPoint presentation, i just linked the ‘known terms’ as ‘binary features creation’ to machine learning vocabulary that is here ‘feature engineering’. I’m sure this ‘step-by-step’ approach helped a lot for the acceptance of machine learning. They agree spending a bit more time on it, and few days later i was able to finalize the first version, and went from 64% accuracy to 81% (57% precision to 85%). Today, the machine learning is fully used into the deduplication processes.

Note: I wrote an article on ‘how to find a good project to work on in your free time’. And even if i still have projects here and there, I try as much as i can to focus on a 2 in 1 project. Meaning working on a job project, that is at same time interesting for me, instead working a lambda dataset (finance, environnement, sport..). Be sure that i don’t work on weekends if it’s not interesting for me.

2. Definition of results and expectations

Of course, it’s necessary to ‘start with the beginning’ and answer the basic questions: What kind of prediction to do? classification, regression, clustering,…. What are the KPIs? recall, accuracy, precision,… With what expectations ? 60%, 70%, 95%,… How many hours to allocate to machine learning? Is the supervised data available, if not how to get it?  Are there any outliers in the data, or any other problem to take into account? Now that the features are created, is there any correlation? If they are not binary, are there any scaling to apply, or dummies to create? All of these answers should define the action plan for the development.

Also, i wanted to tell you about my experience. In my project it was hard to find out what to predict. Indeed, to propose duplicates, it was necessary to 1. extract the manufacturer, 2. extract the product reference and dimensions, 3. find matches in the database, and 4. send propositions to business for validation. The problem is that in the duplicate submissions, some are not true duplicates. Steps 1, 2 and 3 are simply text analysis with Python (NLP, regex, ..), and step 4 is human validation. So in order to integrate machine learning (and as first demo, the binary features), i inserted a step 3.5. This step 3.5 works as a ‘validator’, in order to propose the minimum of ‘False Positive’ (= the KPI here). 

Initially, the idea was a model able to predict from step 1 to 4, because at first time these sub-tasks were not identified. Since i decomposed the tasks into smaller tasks, i was able to insert the prediction process as a step 3.5 and it’s how i find out how to integrate machine learning into my work.

3. Analysis of results

Surely is one of the most exciting parts! Discovering the first predictions, or analyzing the verified predictions, what worked well or not.. If the model has been well trained (cross-validation, good generalization), and with the right features, the results should be more or less close to the training results.

Of course you have to choose the algorithms correctly, and parameter them logically, so if the results are not good enough, you have to review the features. Even with the best algorithms and parameters (GridSearchCV, and logical parameters), if the features are not good enough, you will never get good results.

With the results, it is possible to analyze the features one by one, to have the confusion matrix and KPIs scores. Also, it’s possible to create a nice dashboard with some visualization on the results (Tableau, data viz, or a simple dataframe, all depends on the time given to it). This analysis allows a better understanding of the model and how it can be optimized.  Remember that the feature engineering is the most important part of a successful machine learning model.

Last note. Even if some algorithms are simple to explain, the notion of ‘black box’ can quickly arise. To counter this ‘mistrust’, I created a ‘string combination’ to explain how the algorithms read the data (vulgarly). Example if you have 3 features, then in your data you can find combinations like 001, 101, etc… where 0 is ‘no’ and ‘1’ yes. So you can say for example, 20% of the supervised data has the feature combination ‘011’ and are +90% true. So all predictions with a combination of features equal to ‘011’ should have also a very high validation score. This helped to give more direct statistics rather than a prediction score where we don’t visualize what happened to have. (Even some features importances graphs don’t brings a nice understanding.)

 

– – –

Okay! Hope you found some useful information here. I tried to compiled as many knowledge i learned on how to integrate machine learning into my work. Please, for any questions, don’t hesitate to contact me.