More we do, more we learn

Python data science, analysis, tools,
and more

We're the data generation

Most posts of this blog will deal with data science and digital innovations. It is an exciting field that is not just the sexiest job of the 21st century.

Health, sport, finance, web, or agriculture, humans have always analyzed data. Only tools and statistics models have evolved. Let’s explore that !

The Blog

Data scientist is a job, and most topics are thought for the professionnal context. But because data science can also be a formidable hobby, i do not exclude to cover various subjects.

Readings & Tips

Statistics refreshment, useful extensions for Jupyter Notebooks, or data scientists work experiences. You will find some relevant posts to improve your skills.

Tool Box

I will share some helpful functions i'm using in my data explorations, and machine learning. Feel free to share some ideas !

Ready to get into it ?

Let me introduce the first big picture of the data scientist work. This overview will help to not being lost in this vast wonderful ocean.

Python, R, SQL

I choosed to focus on Python, because it’s for me much more intuitive than R. It’s possible that you also found some SQL ressources.

Machine Learning is divided into 4 big learning parts : supervised, unsupervised, reinforcement and semi-supervised.

  • Supervised : regression or classification algorithms
  • Unsupervised : clustering, association, autoencoders, anomaly detection

You will have also to deal with operations like :

  • Regularization : Ridge, Lasso, Elastic net
  • Cross-validation : K-fold
  • Evaluation metrics : ROC, CAP, AUC, accuracy, sensitivity (recall), specificity (selectivity), precision
  • Dimensionality reduction : Principal Component Analysis, Linear Discriminant Analysis, Kernel PCA
  • Boost : XG Boost
  • Scaling : standardization, normalization
  • Encoding : TargetEncoder
  • Problems : overfitting, dirty and missing data, dummy trap, outliers, imbalanced dataset

With Python, you can easily use Matplotlib and Seaborn.

If you need more options, and interactive visualizations, i invite you to check what is it possible to do with Bokeh and Dash. 

You can also use Tableau, which is a very interesting software for data visualization. Plotly also.

And ggplot2 for R.

Coursera, Udacity, Udemy, EdX, .. There are a lot of free and not expensive courses (less than 20$ if you wait for offers).

Choose the more recent ones, and be sure it’s in Python 3. Data science is full of novelties, tools evolve.

Here’s the article with Python courses for data science.