May 8, 2021

What is ETL (Extract, Transform, Load)?

An organization has built various systems over the years, and these systems are on numerous platforms with different architectures. Data in these systems is in varied formats like SAP, Excel, flat files, databases, web services, cloud applications, unstructured data, etc.

Also, information all over the organization’s departments is scattered in these various systems. Now the management wants to build a new report or application which requires inputs from these multiple systems. How can this data from different systems and different formats be joined together for seamless access for analysis and reporting?

Here the process of ETL comes in handy. It refers to the methods involved in accessing and manipulating source data and loading it into the target database. Data is extracted from an online transaction processing (OLTP) database and other sources, transformed to match the data warehouse schema, and loaded into the target (data warehouse/data hub/data lake) database during the ETL process. 

As an example, consider a bank operating internationally with a customer base of over one million. It has data coming from different systems in various formats, as shown below:

System NameAttribute NameColumn NameData
type
Values
Source 1Customer Application DateCUSTOMER_APPLICATION_DATENumeric(8,0)12122020
Source 2Cust App dtCUST_APP_DATEDate
time
12/12/2020 13:45
Source 3Application DateAPPLICATION_DATEDate12-Dec-2020

This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards. The target database acts as a single standardized source of data for further analysis to drive essential insights.

Importance of ETL

  1. Offers deep historical context for business and enhances the business decision-making process to answer complex questions. 
  2. Gives a single source of truth with a central repository of clean and easily accessible data from various systems. 
  3. Allows data to be cleaned, transformed and, aggregated for analytics tasks, primarily for descriptive analytics problems.

The basic process of ETL can be visualized as shown below:

Basic process of ETL

Extraction

The ETL extraction element is responsible for extracting data from multiple homogeneous or heterogeneous sources.

Business intelligence teams decide how to extract data from unstructured sources using report mining tools to convert it into structured formats, integrate with legacy systems, like COBOL, and determine the extraction method from relational databases. In addition to the extraction method, an extraction strategy is also determined. The integrity of extracted data like conformity of data sizes, formats, redundancy, and data transmission losses is verified during data extraction through a universal schema that maps source elements to extracted data elements through ETL tools.

The data warehouse design should accommodate both full and incremental data extraction. When data is loaded for the first time, full extraction is needed, but after that, data can be loaded incrementally to add new records or update only the records that have been modified. Incremental load allows faster business intelligence reports generation, while a full extraction allows for better predictive and prescriptive analytics tasks.

Extracted data is loaded into the data warehouse staging area (a buffer relational database usually separate from the data warehouse database) for manipulation by the remaining ETL processes. The staging area is used for data cleansing and organization of the collected data, which is in different formats. 

Transformation

The ETL transformation element is responsible for data validation, data accuracy, data type conversion, data profiling, and business rule application based on data mappings. Data mappings specify the rules considering the target tables’ granularity; some fields may require summarized data. The data restructuring and cleaning are done in the transformation stage. Data from multiple source systems are converted to a single system format, and data quality is improved by data validation and data accuracy checks.

Business rules are implemented to merge fields, derive calculated fields, standardize, mask, summarize and decode fields when represented differently in the source systems.

Load

The ETL loading element is responsible for loading transformed data into the target database. Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load. The data warehouse is often taken offline during update operations so that data can be loaded faster. 

Different types of loads in the data warehouse are:

  1. Dimensions: These tables hold information related to non-transactional data like products or customer details.
  2. Facts: These tables hold information related to transactional data like orders, inventory, etc.
  3. Aggregates: These tables hold fact data, which is aggregated on some dimension.

Dimensions should be loaded first as all facts data depends on these dimension tables. After dimension tables are loaded, the fact tables are loaded. Aggregate tables are loaded after fact tables as the aggregation depends on the fact data.

General Steps for Designing an ETL Process

  1. Define the structure of source data and target data.
  2. Define the rules to map source data elements to target data. 
  3. Define the transformations required.
  4. Schedule the execution of the process. Once the process is executed, it can be checked using the generated logs showing the status, the number of records inserted, etc. 

Common ETL Challenges

  1. Extracting data from multiple systems and identifying the desired subset of data from various sources may be tedious.
  2. High-volume data movement through ETL requires a lot of planning and effort to maintain consistency. Validation of the incoming data may be tedious and time-consuming.
  3. Defining correct transformation rules to ensure that accurate source data is correctly loaded in the target tables.
  4. For incremental loads, comparing new data with the existing data poses many challenges.
  5. Not suitable for real-time decision-making.

About Techcanvass

Techcanvass is an IT training and consulting organization. We are an IIBA Canada Endorsed education provider (EEP) and offer business analysis certification courses for professionals.

We offer CBDA certification training to help you gain expertise in Business Analytics and work as a Business Analyst in Data Science projects.

You can also learn Data visualization skills by joining our Data visualization course with Tableau and Power BI Certification Course.


Leave a Reply

Your email address will not be published. Required fields are marked *

Visit Us On LinkedinVisit Us On FacebookVisit Us On YoutubeVisit Us On Twitter