After several data analyzes, the same operations were repeated. I once read something that said « if you repeat yourself more than 3 times, write a function« .
New datasets can take several hours to get used to, especially if we are not experts in the field. And when we are finally ok with the data, we have to optimize the data, and keep only the variables with « potential ». Depending on the information given, we also have to decode the names of columns and variables. This sometimes requires more in-depth research on the subject.
To facilitate this work, I needed two functions:
- one allowing me to study, to measure the relations, if existing, between the variables of my dataset and my target variable (the one to predict). A data visualization would be perfect for that!
- the other allowing me to familiarize myself quickly with all the values of the table. What are the different values for this and that column? How many times does it appear? A table will do the trick.
Two hours later, I coded something satisfying. For the example, I use the dataset provided on Kaggle, compiled by Dean De Cock, Boston House Prices.
Function 1 : Overview plot
This first function displays the correlation of the dataset variables with your target variable. In my example, my target variable is SalePrice of houses, from Kaggle dataset House Pricing.
Function 2 : Overview table
This second function displays the different values existing in each of the columns. For exemple third row first column, RL [1151 = 78,84%], where RL is one of the value of MSZoning column, 1151 its occurence and 78,84% its percentage.
Thank you for reading this article. Perhaps it has been helpful for you, that these functions can help you in your data analysis. If you also have tools like those presented in this article, do not hesitate to share it in comment. I will be delighted to discover your portofolio! 🙂