Exploratory Data Analysis (EDA) tools provide a better understanding of the data variables and their relationships. This, in turn, helps in deciding how to manipulate data for complex analysis, like determining the machine learning algorithm, features selection, and creating new features using business knowledge. It also helps stakeholders confirm whether they are asking the right questions or if the data is suitable for the business problem or opportunity.
As an example, let’s say that we are analyzing marketing data for campaign effectiveness using exploratory data analysis tools. During the EDA process, we observe that the number of website purchases in the last month has decreased even if the website visits increased, i.e., a negative correlation is seen.
So, we analyze further and see a positive correlation between website visits and the number of deals purchased! So, we now know that deals are effective in increasing web purchases.
The objectives of using exploratory data analysis tools are in order to get insights about the data summary, identify patterns, uncover data quality issues, spot correlations, and highlight outliers for accurate analysis.
EDA with Techcanvass
We will also explore different examples for an in-depth understanding of the concepts. Knowledge of basic descriptive statistics will be helpful. To refresh the concepts, you can check out our post related to Descriptive Statistics.
We will be working with marketing data from Kaggle. This data has to be analyze as the recent marketing campaigns have not been as effective as expected. We start with EDA before the actual analysis. Field description of the variables is as shown below.
Year_Birth | Customer’s birth year |
Dt_Customer | Date of customer’s enrollment with the company |
Recency | Number of days since the last purchase |
MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds | Amount spent on wines, fruits, meat, fish, sweets, and gold resp. in last two years |
NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases | The number of purchases made using deals, website, catalog, and stores. |
NumWebVisitsMonth | Number of visits to company’s website in the last month |
AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, AcceptedLastCam | 1 if customer accepted the offer in 1st, 2nd, 3rd, 4th , 5th and last campaign, else 0 |
Complain | 1 if the customer complained in the last two years, else 0 |
Types of Exploratory Data Analysis
EDA is generally cross-classified. It can be done non-graphically or graphically and is further divided into either univariate or multivariate. There are four types of EDA:
Univariate Non-Graphical
This looks at single variables like age, categories, state, salary, etc.
For categorical variables like states or categories, a simple tabulation of the frequency for each category is helpful. For example, the percentage of sales in five categories gives us an idea of which category is not performing well.
For quantitative (numerical) data, it can be best understood by describing the center, spread, shape, and outliers. Summary statistics can help by displaying the mean, median, mode (center tendency), maximum, minimum, interquartile range (Q3 – Q1, more is the IQR, more spread out is the data), standard deviation (spread), total count, skewness (asymmetry), and so on.
Univariate Graphical
For categorical data, bar charts work very well.
A histogram is a bar plot in which each bar represents the frequency (count) of the cases for a range of values. So, They can be used for continuous or categorical data. They help understand the central tendency, spread, shape, and outliers.
Box plots, which graphically describe the five-number summary of minimum, first quartile, median, third quartile, and maximum, are another way of viewing univariate variables. A box is drawn from Q1 to Q3, and the whiskers go from each quartile to the minimum or maximum values. ence, They give information about the central tendency, spread, symmetry, and outliers using robust statistics like median and inter-quartile range.
Multivariate Non-Graphical
Cross tabulated reports of two or more variables fall into this category.
A cross-tabulated report works well for two categorical variables (Education & marital status) in order to give an idea about the counts or percentages.
For two quantitative variables, their correlation tells us how much and in which direction we should expect one variable to change when the other changes. It is a value between -1 to +1, indicating a strong negative and strong positive relationship, respectively. Correlation is a handy parameter for understanding relationships between variables, especially during feature selection for machine learning algorithms
Multivariate Graphical
For instance, side-by-side box plots are a valuable method of visualizing the relationship between a categorical (Gender) and a quantitative variable (Unemployed), as seen in this figure.
Side-by-side bars are another way of visualizing categorical and quantitative variables.
Scatter plots, which help visualize two quantitative variables, have one variable on the x-axis and another on the y-axis. A point for each value in the dataset is plotted. More than two variables can be accommodated in the scatter plot to differentiate using different sizes and shapes.
If you have an interest in learning more about Exploratory Data Analysis, check out the new blog that relates to Objectives of Exploratory Data Analysis which defines the milestones in your data science journey.
Popular EDA Tools
Python, R, Excel are some of the popular EDA tools.
For instance, Python has many in-built functions for data cleaning and data analysis. R is also an open-source programming language and is widely use by statisticians and data scientists for analysis. Excel is the simplest tool in order to start your data exploration. With many built-in functions and add-on tools, we can perform in-depth analysis.
Final Remarks
Hence, There is no strict carved-out path for EDA. Many other visualizations and techniques can be used. So, We have listed down a few here. It is also about understanding your data from the business and analysis perspective and improving data quality, patterns, and relationships. So, without biases, go ahead and talk to your data!
If you have an interest in learning more about Data Analysis, you can opt for CBDA-Training from Techcanvass in order to pursue a career in Business Analysis.