Exploratory Data Analysis (EDA) has been around since the early 1970s! It was defined by John Tukey, a great mathematician & statistician. He explains EDA as: “Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.” He gave a very apt simile to EDA – an investigation carried out by a detective. Like a detective, we dig deep into piles of data in order to find clues that will aid the actual data analysis. so, In this blog, I have shared my understanding of the exploratory data analysis steps and tried to catch hold of as many insights from the data set using EDA.
EDA is the foundation stone, a very vital step before you begin with the data analysis. EDA techniques reveal the true nature of the data. The better you know your data (have more clues), the better is your analysis (case outcome)!
Why Is EDA Important?
Learning what you can do using the data available will make your final analysis more robust and effective. Open-minded exploration of data will provide valuable information. It is all about finding and revealing clues! Anyone working with data from researchers, analysts, business intelligence professionals, and others will spend most of their time on EDA by following exploratory data analysis steps.
The objective of EDA is to “understand” the data as follows:
- Confirm if the data is making sense in the context of the business problem.
- Get insights into the data summary.
- Detect outliers and anomalies.
- Understand patterns and correlations between data variables.
- Uncover and resolve data quality issues.
- Drop unwanted columns and derive new variables.
How to Perform EDA?
EDA gives you the flexibility to talk to your data. It is not a formal process with strict rules. It is an iterative approach to understanding data, where the data is investigated and explored without any assumption or bias. But we can broadly say that are three main parts that come under EDA.
- Prepare questions related to the business goal (context/problem you are working with).
- Generate answers by cleaning, transforming, summarizing, and visualizing data.
- Based on your learning, refine/prepare new questions.
Technical, statistical, and mathematical knowledge with domain knowledge is critical for performing EDA.
EDA with Techcanvass
We will be working with a telecom churn dataset from Kaggle. (Some changes have been made to explain some concepts.) Churn indicates a customer leaving the service to join another service. All businesses want to prevent churn and retain their customers. So, this is an essential metric in all industries. The snippet of the data looks like this:
Our goal is to minimize the churn percentage by identifying the customers who have a high churn probability. Once identified, we want to take steps to retain them.
Univariate: Data summaries for single variables using descriptive statistics are very handy to give you an idea of how the values in the dataset look.
- Looking at the counts of our data summary, we can see that there are missing values.
- Skew indicates whether the data is evenly distributed or not. A value greater than +1 tells us that ‘Account length’ has unexpected high values, i.e., outliers.
Multivariate: You can use cross-tabulation (also called contingency table) for analyzing multiple variables. They help us compare or analyze relationships between multiple variables. This table tells us that customers who don’t have an international plan are more likely to churn.
Getting to Know the Dataset
We now examine the data to check for data quality issues because that is a significant factor that will affect the quality of the data analysis.
- Sanity Checks: Do all the rows contain appropriate information as suggested in the column names? In our data set, adding filters to the columns tells us that the row values are valid. For example, the ‘State’ column has alphabetical short-forms for states and contains no numerals or Boolean values.
- Missing Values: Voice mail plan, Customer service calls, Churn and almost all other minutes, charge and call columns have missing values. We can replace missing values in ‘Voice mail plan’ with No as the ‘Number vmail messages’ are 0 for these rows. For all other missing values, we will drop these rows are their percentages are pretty low. We can also replace them with mean, mode, and other imputation methods using domain knowledge.
- Duplicates: Many times, when data is fetched from multiple sources, duplicate rows are a possibility. Deleting duplicates is crucial for correct calculations of the metrics.
Anomalous Values and Fields
- EDA is beneficial to uncover strange or anomalous values and fields, for example, in the ‘Account Length’ field. We can see that the values fall in the range of 50-150. So, We can see two outliers way above this range. We will delete these outliers and We can detect outliers using histograms and scatter plots as well.
- We can also see that the ‘Area code’ field has only three codes – 408, 415, and 510 – for California. But we see these codes distributed across all states in the US. This looks suspicious and needs further investigation.
Understand the Data Types
During EDA, we will correct the data types as required using domain knowledge.
Graphs can be used in order to communicate conclusions or discover new information. Let us look at some examples.
Exploring Categorical Variables
Categorical values have one or more categories without any specific ordering. So, Bar and pie charts are some examples of visualizing this data.
- This bar chart tells us that Texas and New Jersey have the highest number of customer churns.
- This stacked bar chart displays the above tabular report in percentages.
- This chart tells us that many customers do not have a voice mail plan. Also, improving the current voice mail plans can be an important factor in preventing churn.
Exploring Univariate Variables
There are different graphs such as histograms, area charts, box plots, line plots, and scatter plots to explore numerical data. So,
- This stacked chart shows the percentage of service calls. It looks like after the third call, many customers churn. The call needs to be investigated to take corrective action and offer incentives.
- This area chart suggests that day users with calls between 80 to 125 are churning. We need to find out why that is happening. A similar pattern can be found in the evening and night calls.
- This histogram reveals that customers with account lengths between 90 and 180 have a higher tendency to churn. Maybe some offers or loyalty points can be introduced.
Exploring Multivariate Variables
- In this scatter plot, we are exploring two variables, and so, we can see a pattern that customers with low day minutes and high service calls have high churn rates. So, now we have a more specific customer set to target.
So, We can Summarize our insights as:
- Texas and New Jersey have the highest customer churns.
- Customers with an international plan tend to churn more frequently.
- A Customers with a voice mail plan churn less frequently.
- Customers with more than three service calls and low day minutes will churn more frequently.
- A Customers with higher day, evening, and night minutes are churning.
- Customers with an account length of greater than 180 churns frequently.
- Area code field needs corrections and more investigation.
For instance, With the right questions and visualizations, we can begin with our analysis journey in the right direction. Hence, Using exploratory data analysis steps, we have enough clues about what the data suggests and how we can best use it to solve our business question!
Exploratory Data Analysis, By John Tukey
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. LaroseReferences