Atanu Maity
Lazy Enough to do the Data Exploration? Wait, there is a way out !!
We people, who spends most of the hours in a day, playing with the dirty data, for us, we sometimes feel like if somebody can explore the data well and let you know the in and out of the data, then your work will be reduced by almost around 70% over the total job. But most of the time what happens is we get lost in exploration of the DIRTY DATA. Most of the time we don't have any idea like where to start and where to end, while exploring the data. That is sometimes may be becuase lack of domain knowledge or sometimes it is becuase of less experience in the domain.
But there is not big but slight hope. Python users have a solution, rather we can say a BIG HELPER in their daily EDA job. Just install a package and thats it. The package API will keep the data profiling (obviously the basic profiling) ready upfront of you, so that you don't have to pour your fingers into the dirty data and make the basic data report ready. The package will done that for you, HOLLA!
Lets stop talking and get your hands into the code.
Package Name: 'pandas_profiling'
Installation: pip install pandas-profiling
Note: If anyone tries to install using Conda, then the command is,
conda install pandas-profiling
How to use it:
I guess its the most easy part of the whole story. Just import the package and just pass your data through it's reporting function, that simple.
how to import: import pandas_profiling
For your reference I am adding here one ipynb, where I tried to show how the reporting function works actually. I have used one sample weather dataset for the experiment.
Link: https://github.com/ytiam/my_personal/blob/master/pandas_profiling_test.ipynb
Note: The images in the ipynb may become destorted due to your screen resultion. Try to run the methods described in the ipynb with your own dataset and experiment with its functionalities.
A single command 'ProfileReport', produces basic EDA report that include:
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman and Pearson Matrixes
In addition to above features, it has a special report feature, 'Warnings' , which can give you a glimpse of a step ahead of basic EDA. This feature can give you a idea of which columns of the data are auto-correlated and so which of those can be rejected. Like this, it can give some other information, which can be an add-on towards understanding the data and also can be useful for further analysis. Â
For more details visit the official github page of 'pandas_profiling': Â https://github.com/pandas-profiling/pandas-profiling
Thanks for reading. Let me know if there is any question or doubt regarding the application of this package by logging in and commenting below. Happy Learning.