Diving Into Data Pipelines — Foundations of Data Engineering
Original Source Here
Well-built data pipelines do more than just extract data from sources and load them into manageable database tables or flat files for analysts to use. They perform several steps with raw data, including cleaning, structure, normalization, combining, aggregation, and so on. A data pipeline also requires other actives such as monitoring, maintenance, enhancement, and support of different infrastructures.
ETL and ELT
There is possibly no pattern as well known as ETL and its more modern sibling, ELT. Both are patterns extensively used in data warehousing and business intelligence. They have encouraged pipeline patterns for data science and machine learning models running in production in the last few years.
Both patterns are approaches to data processing used to feed data into a data warehouse and make the data useful to analysts and reporting tools. The extract step gathers data from various sources in preparation for loading and transforming. The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into a final destination — the transform step, where raw data from each source system is combined and formatted to help analysts, visualization tools, or whichever use case our pipeline is serving.
The combination of the extraction and loading steps is often referred to as data ingestion.
Data Ingestion and Its Interfaces
The term data ingestion refers to extracting data from one source and loading it into another source. Data ingestion occurs in real-time, in batches, or in a combination of two (also regularly called a lambda architecture).
When data is ingested in batches, the job is regularly scheduled in intervals. It interacts and communicates with several sources, requiring several different types of interfaces and data structures.
The following are the most common ingestion interfaces and data structures:
- Stream Processing Platforms: RabbitMQ, Kafka.
- Databases: Postgres, MySQL, HDFS, or HBase database.
- Data warehouse or data lake.
- JSON, CSV, REST API.
- Shared network file system or a cloud storage bucket.
Extracting Data from MySQL Databases
We can extract data from a MySQL database in a couple of different ways:
- Full or incremental extraction using SQL
- Binary Log (also known as binlog) replication
Full or Incremental Extraction using SQL
Full or incremental extraction using SQL is straightforward to implement but less scalable for large datasets with frequent changes.
When we need to ingest either all or a subset of columns from a MySQL table into a data warehouse or data lake, we can use either full extraction or incremental extraction.
- Every record in the table is extracted on each run of the extraction job.
- High-volume tables can take a long time to run.
SELECT * FROM Customers
Binary Log (binlog) Replication
A binary Log replication is more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source. It is also a path to creating a streaming data ingestion [2].
Pipeline Orchestration
Orchestration assures that the steps in a pipeline are run correctly and that dependencies between these steps are appropriately managed. Pipeline steps are always directed, which means they start with a task or multiple tasks and end with a specific task or tasks. Such is required to guarantee a path of execution. In other words, it ensures that tasks do not execute before all their dependent tasks are completed successfully.
Pipeline graphs must also be acyclic, meaning that a task cannot point back to a previously completed task. In other words, it cannot cycle back. If it could, then a pipeline could run endlessly.
For instance, Apache Airflow is an open-source data pipeline task orchestrator tool that has been built to solve everyday challenges faced by data engineering teams: how to build, manage, and monitor workflows that involve multiple tasks with mutual dependencies.
Airflow has excellent configuration options, such as:
- Schedulers.
- Executors.
- Operators.
A graph view of the ETL DAG(Airflow):
DAG code implementation:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator \
import BashOperator
from airflow.operators.postgres_operator \
import PostgresOperator
from airflow.utils.dates import days_ago
dag = DAG(
'elt_pipeline_sample',
description='A sample ELT pipeline',
schedule_interval=timedelta(days=1),
start_date = days_ago(1),
)
extract_orders_task = BashOperator(
task_id='extract_order_data',
bash_command='python /p/extract_orders.py',
dag=dag,
)
extract_customers_task = BashOperator(
task_id='extract_customer_data',
bash_command='python /p/extract_customers.py',
dag=dag,
)
load_orders_task = BashOperator(
task_id='load_order_data',
bash_command='python /p/load_orders.py',
dag=dag,
)
load_customers_task = BashOperator(
task_id='load_customer_data',
bash_command='python /p/load_customers.py',
dag=dag,
)
revenue_model_task = PostgresOperator(
task_id='build_data_model',
postgres_conn_id='redshift_dw',
sql='/sql/order_revenue_model.sql',
dag=dag,
)
extract_orders_task >> load_orders_task
extract_customers_task >> load_customers_task
load_orders_task >> revenue_model_task
load_customers_task >> revenue_model_task
Data Validation in Pipelines
Certainly, even in the best-designed data pipelines, something can go wrong. Several issues can be dodged, or at least mitigated, with a great design of processes, orchestration, and infrastructure. Data validation is also a necessary step to ensure the data’s quality and validity because untested data is not safe to use in analytics.
Obtaining a data quality issue at the end of a pipeline and tracing it back to the beginning is a worst-case situation. By validating each step in a pipeline, we are more inclined to find the root cause in the current step than a previous one.
In enhancement to quality issues in the source system, there is the probability of the data ingestion process itself resulting in a data quality problem. These are a couple of the common data ingestion risks or load steps of an ingestion
- A logical error in incremental ingestions.
- Parsing issues in an extracted file.
Validating data at each step of the pipeline is critical. Even if the source data passes validation beginning in a pipeline, it is always good practice to run validation on the data models built at the end of a pipeline.
The following steps represent a way to check this process:
- Assuring a metric is within certain lower and upper bounds.
- Reviewing row count growth (or reduction) in the data model.
- Checking to see if there is unexpected inconstancy in the value of a particular metric.
Python scripts can be written to perform data validation, also there are several frameworks available:
- dbt: It is a command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/04/16/diving-into-data-pipelines%e2%80%8a-%e2%80%8afoundations-of-data-engineering/