What is Data Pipeline?

Divij Sharma
4 min readMar 14, 2021
Engineer creating Data Pipeline :D Photo by SELİM ARDA ERYILMAZ on Unsplash

In today’s world we are getting data from variety of sources. The data is generated from internal sources like IoT devices fitted in an automobile, POS machines in a store, inventory data of a retail store and from external sources like https://data.gov.in/ for India, https://www.data.gov/ for US etc. (I am not considering the data that is collected from direct sources like physical surveys and interviews but only that is generated or stored digitally in an internal or external system. Eventually the survey and interview data stored in a digital format will become part of either internal or external system.) There are also variety of type of data — both structured and unstructured — like text, video, audio, images, XML files, records, HTML data from website etc.

Producers and Consumers

The agents or processes that generate the data are called as producers and which use that data are called as consumers. The IoT devices fitted in an automobile are producers and the machine learning program to predict the engine failure is a consumer.

Why is Data Pipeline needed?

In most of the cases the consumers do not understand the format of the data in which it is produced. For example, the machine learning program to predict the sales (consumer) does not directly ingest the data generated by POS machine (producer). In order to generate the machine learning model, we have to clean the data captured from POS machine, enrich it with the data from other sources like item inventory and add new features.

What is Data Pipeline?

The steps required for data cleansing, data enrichment, data governance and data processing collectively are called Data Pipeline. The Data Pipeline ensures that data generated by producers reaches the consumers in correct and desired format. In these intermediate steps between producers and consumers the data is transformed and stored in multiple forms. The design of Data Pipeline depends on various factors

  • Business Problem
  • Type of producer — batch or real time
  • Type of consumer — reports, ML Model or dashboard
  • Intermediate data storage — files, NoSQL or SQL databases

Let’s discuss this with an example — Suppose we are working for an automobile company that had data related to part inventory coming in the form of excel file from every plant at the end of the day. Also, real time data is generated from various IoT devices fitted in engine, breaks and other parts of car. In addition to these logs related to server performance and usage are generated in real time.

So some data is being produced in batch mode (part inventory at the end of the day) and some data is generated real time (IoT devices and logs).

We want to create a system for users to understand the usage of parts, predict the inventory need for every plant, display the load on server based on logs and predict the failure of engine.

How will our data pipeline look like?

Staging Environment

For the part inventory coming in the form of excel file from every plant at the end of the day we can have a traditional ETL design in batch ingestion. This data can be stored in an Operation Data Store which can be a relational database like MySQL or PostgreSQL.

For the real time data, we will have to implement the distributed event streaming platform like Apache Kafka. The messages from IoT devices can be stored in a message hub which can be either a NoSQL database like MongoDB or a relational database like MySQL or PostgreSQL.

MDM

When the data is stored in the staging environment in the raw (or near raw) form the MDM will monitor the data coming into the pipeline. Master Data Management (MDM) is the core process used to manage, centralize, organize, categorize, localize, synchronize and enrich master data according to the business rules. It helps to create one single master reference source for all critical business data and thereby resulting in lesser redundancy and fewer errors in the business process. The efficient management of master data in a central repository gives you a single authoritative view of information and eliminates costly inefficiencies caused by data silos.

MDM will be a combination of various tools and programs to create the central repository. From the MDM the data can flow to either data lake or data warehouse depending on the need.

From the data lake and data warehouse, the data will be served to users for various dashboards and ML program to create the model which can then use the staging database of event streaming platform to make predictions.

This is all depicted in the below picture

From Producers to Consumers through Data Pipeline

--

--