How are the Data Analytics projects executed? In this article, I am going to discuss and explain Data Analytics Projects Life Cycle.
Over the last two years alone, 90 percent of the data in the world was generated! Looking at the sheer volume of data generated every minute across the globe can be mind-boggling. It would be impossible to find any useful information from this raw data.
But if we follow logical steps sequentially, we can better grasp the data and get valuable insights from this data mine. Each data analytics project follows standard measures to derive insights from data and make it useful for business.
Read First - What is Data Analytics?
We can see the various stages involved in the life cycle of any data analytics project.
We will look at a real-life example to see how the analytics life cycle can help businesses uncover insights from massive amounts of data.
A company has recently finished designing a new cell phone and wants to launch it in the market. Before they start the production, they decide to use analytics to launch the new cell phone successfully.
The first step of Data Analytics projects life cycle is to focus on understanding the Business requirements. The business needs to identify the business goal and convert it into questions that need to be answered using analytics. They also need to identify the output expected out of the analysis and its impact on business.
Keeping in mind the above business goal of successfully launching the new cell phone, what business requirements do you think the company will decide?
Let us say they decide on the below four requirements:
- Decide optimal price for the phone.
- Choose the best country to start launching the phone.
- Analyze which customer segment to target.
- Select the best marketing campaign strategies.
After planning the business requirements, the focus can be shifted on what data is already available and what data needs to be collected to achieve the goal. Data collection is the process of gathering inputs and information that can help answer the business’s questions. It can help a company learn about different people’s opinions, the co-relation between customers and regions, sales information, form customer clusters based on habits, get feedback, and so much more.
Data collection is synonymous with the ETL (Extract – Transform – Load) extract process.
Different varieties of data may be useful for various analysis and goals like:
- Data collected from surveys, questionnaires and, reviews to get insights into what users think, feel, want, and gather facts and opinions.
- Unstructured data like audio data (songs or human speech), image data (photos), drone, or CCTV video data can be collected.
- Everyday structured business data like consumer, sales, product, geographical data, or regular transactional data can be used.
For our company planning to launch the new phone, they need to collect data from:
- Competitor sales data for similar cell phones to decide the price.
- Previous sales data for similar cell phones to decide the price.
- Previous country-wise sales data for similar cell phones to decide the country.
- Consumer details and cell phone sales to decide the customer clusters.
- Region-wise sales and campaigning ideas can be used to select the campaign strategies.
Thus, we can see how precisely business requirements can be translated to exact data requirements for analysis.
Data Cleaning and Storage
The next step of Data Analytics Projects Life Cycle is data cleaning.
Data cleaning is one of the most critical and time-consuming steps in the life cycle. What might be the reason for this? If there are errors in the data, the data’s insights provided by this data and the following analysis would be incorrect, leading to wrong insights and misleading business decisions!
Data cleaning is synonymous with the ETL (Extract – Transform – Load) transform process.
Data cleaning involves identifying and removing inaccurate, irrelevant, and erroneous data, handling null values, outliers, and duplicates and converting columns to correct data types, and much more.
Let us look at some examples:
- Some rows in a data set may represent females as ‘F’ or ‘Females’ or ‘females’ or ‘fem.’ All these values should be described in one form, say ‘F.’
- Null or missing values can be replaced by any following methods depending on the needs: value in the previous row, the value in the next row, mean, mode, data from similar rows, get data from the data source, or predict missing values using algorithms like linear regression. For example, missing values for ‘fare’ can be replaced using other features like destination, origin, class, and train number.
- Rows containing null values can be dropped, but it may result in loss of information.
- The data type of dates is often text, and this needs to be converted to the correct date/time format with accurate and consistent precision.
- Remove extra spaces, change to lower/upper case, check spellings, remove any formatting if present and, change data types.
- Remove irrelevant data not beneficial for the analysis.
After cleaning, data in all data sets should be synchronous. Manual and data transmission errors should be removed, and null values should be handled appropriately.
Cleaned data needs to be stored before it can be analyzed together. Structured data can be stored in a data warehouse, unstructured data in a data lake. Database, flat files, excel files, JSON files, shapefiles for geospatial data are other storage sources, or any other system can be chosen based on business suitability.
Data storage is synonymous with the ETL (Extract – Transform – Load) load process.
Once clean data is stored, it is ready for analysis. Exploratory data analysis can be performed on this clean data, which uses visual analysis to understand the data before applying more formal approaches like hypothesis testing, machine learning algorithms, or advanced statistical inferences. Let us look at these in detail.
Exploratory Data Analysis
This process summarizes the data using statistical methods, graphical analysis, and correlations between variables. Summary statistics give an idea about the overall quality of data, transforming variables so that they are useful for research and graphical representations using heat maps for correlations, box plots for outliers, histograms to understand the distribution of single numerical variables, etc.
In our example, we can understand the consumer age band to target for maximum sales of the cell phone or explore the campaigning ideas that worked the best for similar products in the past.
The actual analysis consists of applying machine learning algorithms or advanced statistical inferences to the data to get insights. Algorithms are chosen based on the desired outcome.
Let us look at few examples:
- Classification: Classifying whether a customer should be given coupons based on his buying history or whether an email is a spam or not. Random Forest, Non-linear SVM, Gradient Boosting Tree, etc.
- Regression: Predicting a continuous value like predicting prices of a house given the house features like size, number of rooms. Decision Tree, Random Forest, Linear Regression, Ordinary Least Square Regression, etc.
- Clustering: Inclination of voters, profiling, and clustering facebook users for potential customers. K-means, Gaussian Mixture Model.
Insights are the analysis outcomes and can be divided into two areas:
- Informational: Informational insights are descriptive analysis explaining what is happening or diagnostic analysis explaining why it is happening.
In our example, we can use descriptive analysis to answer questions about the past: “What is the previous country-wise sales data for similar cell phones telling us about the sale in different countries?” Or diagnostic analysis to understand why particular campaigning ideas are working best country-wise and why a specific segment of consumers should be targeted.
- Actionable: Predictive analysis helps answer questions about what will happen in the future or prescriptive analysis to make data-driven decisions from the predictive analysis.
In our example, predictive analysis models can predict the cell phone’s optimal price to be launched. Prescriptive analysis can help us understand the outcomes and prescribe the possible courses of action that can be taken by the business to prepare a plan to launch the product.
A human mind can process visual information much faster than long sheets of tabular data. Visualization helps us to find insights from data quickly through visual graphics. It helps make up decisions much quicker, helps businesses understand trends much faster in a consolidated manner than rows of tabular or textual data. Interactive charts can help users change parameters to see how correlations change and their impacts.
Different graphs can be used for different types of analysis, as shown below.
Multiple charts can together form a story giving value to numbers and help business plan strategically. There are many visualization tools like Tableau, QlikView, PowerBI, Google charts, Plotly, etc.
Data has to be turned into actionable items meaning the analysis has to be incorporated into business decisions. For our company, all four questions decided during the business requirements must be answered using the analysis’s insights. Also, a final plan has to be formulated with justified reasons. A report documenting all the actionable items has to be prepared to start launching the latest cell phone.
By following the life cycle sequentially, we can see that a complex business problem can be broken down into more manageable actionable steps and how raw data can significantly help a business grow.
In this article, we explained the steps of Data Analytics Projects Life Cycle.
Techcanvass is an IT training and consulting organization. We are an IIBA Canada Endorsed education provider (EEP) and offer business analysis certification courses for professionals.
We offer CBDA certification training to help you gain expertise in Business Analytics and work as a Business Analyst in Data Science projects.