top of page
Search
  • Writer's pictureAtanu Maity

Lazy Enough to do the Data Exploration? Wait, there is a way out !!

We people, who spends most of the hours in a day, playing with the dirty data, for us, we sometimes feel like if somebody can explore the data well and let you know the in and out of the data, then your work will be reduced by almost around 70% over the total job. But most of the time what happens is we get lost in exploration of the DIRTY DATA. Most of the time we don't have any idea like where to start and where to end, while exploring the data. That is sometimes may be becuase lack of domain knowledge or sometimes it is becuase of less experience in the domain.


But there is not big but slight hope. Python users have a solution, rather we can say a BIG HELPER in their daily EDA job. Just install a package and thats it. The package API will keep the data profiling (obviously the basic profiling) ready upfront of you, so that you don't have to pour your fingers into the dirty data and make the basic data report ready. The package will done that for you, HOLLA!


Lets stop talking and get your hands into the code.


Package Name: 'pandas_profiling'


Installation: pip install pandas-profiling

Note: If anyone tries to install using Conda, then the command is,

conda install pandas-profiling


How to use it:

I guess its the most easy part of the whole story. Just import the package and just pass your data through it's reporting function, that simple.


how to import: import pandas_profiling


For your reference I am adding here one ipynb, where I tried to show how the reporting function works actually. I have used one sample weather dataset for the experiment.


Link: https://github.com/ytiam/my_personal/blob/master/pandas_profiling_test.ipynb


Note: The images in the ipynb may become destorted due to your screen resultion. Try to run the methods described in the ipynb with your own dataset and experiment with its functionalities.


A single command  'ProfileReport',  produces basic EDA report that include:

  • Essentials: type, unique values, missing values

  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

  • Most frequent values

  • Histogram

  • Correlations highlighting of highly correlated variables, Spearman and Pearson Matrixes


In addition to above features, it has a special report feature, 'Warnings' , which can give you a glimpse of a step ahead of basic EDA. This feature can give you a idea of which columns of the data  are auto-correlated and so which of those can be rejected. Like this, it can give some other information, which can be an add-on towards understanding the data and also can be useful for further analysis.  


For more details visit the official github page of 'pandas_profiling':   https://github.com/pandas-profiling/pandas-profiling


Thanks for reading. Let me know if there is any question or doubt regarding the application of this package by logging in and commenting below. Happy Learning.

80 views0 comments

Recent Posts

See All

Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The main 4 components of Airflow a

bottom of page