Exploratory Data Analysis: Tools, Types & Functions

Exploratory Data Analysis (EDA) tools provide a better understanding of the data variables and their relationships. This, in turn, helps in deciding how to manipulate data for complex analysis, like determining the machine learning algorithm, features selection, and creating new features using business knowledge. It also helps stakeholders confirm whether they are asking the right questions or if the data is suitable for the business problem or opportunity.

As an example, let’s say that we are analyzing marketing data for campaign effectiveness using exploratory data analysis tools. During the EDA process, we observe that the number of website purchases in the last month has decreased even if the website visits increased, i.e., a negative correlation is seen.

exploratory data analysis tools
Campaign Effectiveness EDA

So, we analyze further and see a positive correlation between website visits and the number of deals purchased! So, we now know that deals are effective in increasing web purchases.

The objectives of using exploratory data analysis tools are in order to get insights about the data summary, identify patterns, uncover data quality issues, spot correlations, and highlight outliers for accurate analysis.

EDA with Techcanvass

We will also explore different examples for an in-depth understanding of the concepts. Knowledge of basic descriptive statistics will be helpful. To refresh the concepts, you can check out our post related to Descriptive Statistics.

We will be working with marketing data from Kaggle. This data has to be analyze as the recent marketing campaigns have not been as effective as expected. We start with EDA before the actual analysis. Field description of the variables is as shown below.

Year_Birth Customer’s birth year
Dt_Customer Date of customer’s enrollment with the company
Recency Number of days since the last purchase
MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds Amount spent on wines, fruits, meat, fish, sweets, and gold resp. in last two years
NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases The number of purchases made using deals, website, catalog, and stores.
NumWebVisitsMonth Number of visits to company’s website in the last month
AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, AcceptedLastCam 1 if customer accepted the offer in 1st, 2nd, 3rd, 4th , 5th and last campaign, else 0
Complain 1 if the customer complained in the last two years, else 0
Field description of the variables

Types of Exploratory Data Analysis

EDA is generally cross-classified. It can be done non-graphically or graphically and is further divided into either univariate or multivariate. There are four types of EDA:

Univariate Non-Graphical

This looks at single variables like age, categories, state, salary, etc.

For categorical variables like states or categories, a simple tabulation of the frequency for each category is helpful. For example, the percentage of sales in five categories gives us an idea of which category is not performing well.

Univariate Non-Graphical
Categorical Univariate Non-graphical

For quantitative (numerical) data, it can be best understood by describing the center, spread, shape, and outliers. Summary statistics can help by displaying the mean, median, mode (center tendency), maximum, minimum, interquartile range (Q3 – Q1, more is the IQR, more spread out is the data), standard deviation (spread), total count, skewness (asymmetry), and so on.

Univariate Non-Graphical - Figure 2
Quantitative Univariate Non-graphical

Univariate Graphical

For categorical data, bar charts work very well.

A histogram is a bar plot in which each bar represents the frequency (count) of the cases for a range of values. So, They can be used for continuous or categorical data. They help understand the central tendency, spread, shape, and outliers.

Univariate Graphical
Univariate Graphical – Histogram

Box plots, which graphically describe the five-number summary of minimum, first quartile, median, third quartile, and maximum, are another way of viewing univariate variables. A box is drawn from Q1 to Q3, and the whiskers go from each quartile to the minimum or maximum values. ence, They give information about the central tendency, spread, symmetry, and outliers using robust statistics like median and inter-quartile range.

 Box Plots
Univariate Graphical – Box Plots

Multivariate Non-Graphical

Cross tabulated reports of two or more variables fall into this category.

A cross-tabulated report works well for two categorical variables (Education & marital status) in order to give an idea about the counts or percentages.

 Multivariate Non-Graphical
Categorical Multivariate Non-Graphical – Cross Tab

For two quantitative variables, their correlation tells us how much and in which direction we should expect one variable to change when the other changes. It is a value between -1 to +1, indicating a strong negative and strong positive relationship, respectively. Correlation is a handy parameter for understanding relationships between variables, especially during feature selection for machine learning algorithms

Correlation
Quantitative Multivariate Non-Graphical – Correlation

Multivariate Graphical

For instance, side-by-side box plots are a valuable method of visualizing the relationship between a categorical (Gender) and a quantitative variable (Unemployed), as seen in this figure.

Side-by-side bars are another way of visualizing categorical and quantitative variables.

Side-by-side bars
Categorical Multivariate Graphical – Side-by-side box plot

Scatter plots, which help visualize two quantitative variables, have one variable on the x-axis and another on the y-axis. A point for each value in the dataset is plotted. More than two variables can be accommodated in the scatter plot to differentiate using different sizes and shapes.

Scatter Plots
Quantitative Multivariate Graphical – Scatter Plots

If you have an interest in learning more about Exploratory Data Analysis, check out the new blog that relates to Objectives of Exploratory Data Analysis which defines the milestones in your data science journey.

Popular EDA Tools

Python, R, Excel are some of the popular EDA tools.

For instance, Python has many in-built functions for data cleaning and data analysis. R is also an open-source programming language and is widely use by statisticians and data scientists for analysis. Excel is the simplest tool in order to start your data exploration. With many built-in functions and add-on tools, we can perform in-depth analysis.

Final Remarks

Hence, There is no strict carved-out path for EDA. Many other visualizations and techniques can be used. So, We have listed down a few here. It is also about understanding your data from the business and analysis perspective and improving data quality, patterns, and relationships. So, without biases, go ahead and talk to your data!

If you have an interest in learning more about Data Analysis, you can opt for CBDA-Training from Techcanvass in order to pursue a career in Business Analysis.

References

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu