Here is my kernel on Kaggle : Intuitive data exploration & easy machine learning.
Well. Titanic is unquestionably the first machine learning exercise that beginners go through. This tragedy has the advantage of being known by everyone and having a workable data set.
The goal is to predict the survivors of Titanic. And despite the sadness of the event, it’s a fun exercise. With only 12 columns (variables), and 891 lines (passengers), the train dataset is easy to explore. We quickly understand the nature of each column, and there is no need to be an expert in a particular field. It is very accessible !
So how did I choose to proceed?
0. Importing & exploring
First, we need to import the libraries for data analysis (numpy, pandas) and data visualization (matplotlib, seaborn), then import our data.
From here, I spent almost 3 hours exploring the data (and improving my Python skills). This is called EDA, exploratory data analysis. It consists of becoming familiar with the data, understanding the relationships that exist, discovering trends, in order to establish hypotheses. What variables are we keeping (= feature selection)? Can we group some data? How will we proceed for each of the columns? This is my very favorite part!
In our example, our EDA allowed us to see that children, women, or people who have a ticket in first class are more likely to survive. We also want to go further on a ‘Family Size’ variable to find out if the size of a family had an influence on the chances of survival.
1. Missing values & feature engineering
The third big step is to process the missing values. For example, cabin numbers are not filled in for every passengers. Depending on the number of missing values, the importance of the variable, there are different strategies: delete the variable or fill in the data (by the median, average, adjusted average by grouping, ..).
At this stage is added the « feature engineering » step. Machine Learning’s algorithms need to have understandable and optimized data. We will have to group data. For example for ‘Family Size’, we will create 5 groups according to the number of individuals: 0 = single persons (0), 1 = couples (1), 2 = small families (2, 3), 3 = large families (4, 5), 4 = very large families (6, and more).
Once we have processed all data, the data set is clean and ready to apply our algorithms. Some apply to the train set and test set the same operations step after step. Personally, I prefer to completely prepare my train set, when everything is ok, then I apply the same operations to my test set in one bloc.
2. Predicting values & submission file
Now, it’s finished or almost. Even with knowledge of statistics, machine learning is new to me. Moreover, in the example of Titanic, we can have a satisfactory result by simply applying several algorithms, without having to play with the hyper parameters. I chose for this part to recover the piece of code of a member of the Kaggle community.
You will see on Kaggle a lot of score close to, or of 100% (= percent of predicted data that is accurate). In reality, in a business context, this doesn’t happen. First, because there is a time / money constraint, and because a too high percent is suspect (= over-fitting). The data are naturally flawed. In most cases, a score above 80% is good, close to 90% is great!
When your algorithms have done their job, you have to prepare the submission file, which assembles the passenger ID of your test set, and the predicted values of your target variable ‘Survived’. You import this file on Kaggle and then you have your rank.
Today, I am in the top 40%, with a score close to 80%. The goal is not to over-optimize, but to discover the universe and the steps of machine learning. Once this exercise is done, it’s time to discover new datasets, more complex, that fascinate you, or that could interest the type of company in which you want to work!
3. Publishing the kernel on Kaggle
I chose to publish my kernel, entitled « Intuitive Data Exploration & Easy Machine Learning« . And wrote it as a blog post, with all the information I would have liked when I started data science. The Kaggle community brings together experts, passionate people who share their codes, their experiences, take advantage of that and visit a maximum of kernels! 🙂
So, here was my feedback on my first Kaggle submission, thank you for reading it.– – –
If you have any questions, feel free to ask.
– – –