Design Principles for Big Data Pipelines

“An intelligent person learns from their own mistakes, but a genius learns from the mistakes of others.” – Anonymous.

In the last post in this series, we covered what Big Data Pipelines are and what the challenges are building them. In this installment, we’re going to discuss some fundamental design principles to overcome these challenges in the most cost-effective, reliable, and robust way possible.

Key Design Principles

Scalability

As data is increasingly being generated and collected, data pipelines need to be built on top of a scalable architecture. Such architecture should support growth in data size, data sources, and data types without any drop-in efficiency. Meaning, your pipeline needs to scale along with your business.

To achieve this, organizations should take advantage of one of the cloud platforms that provide the ability to scale automatically without human intervention, such as AWS. The platform allows you to adjust your computing and storage resources (up and down) by a predefined set of rules and conditions on selected metrics. For example, you can use the platform to spin up additional computing resources when there is a data peak and shut down the peak when the peak is over.

High Availability

Since data has become such an important part of a business foundation, organizations need to ensure that their data is usable, accessible, and fresh at all times. Any design of data-related processes and systems must be built to remain operational even if some of its components fail.

Therefore, the right approach when building such systems is that they are as decoupled as possible, featuring separating storage and compute (serverless approach) with failover policies and backups that will automatically start getting live requests in the event of failure on the primary component.

By “harnessing the power” of the cloud, it is easy to create retry-mechanisms and failover policies that will ensure the availability of the data on any hardware or software malfunction.

Business Agility

Over the last decade, many sectors have undergone sudden and dramatic transitions to support the changing trends in business management. As a result, organizations need to adapt to changes quickly and rapidly support new business requirements. Data Lake is a central location that allows storing massive amounts of data of any type and is immediately available to be processed, analyzed, and consumed by any data consumer in the company. The Data Lake architectural approach is highly recommended and will support obtaining quick insights into business operations.

Architecture Principle

Every data pipeline should be split into four independent layers (decoupled approach). Splitting the layers means that the way used to ingest the data should not be dependent on how the data is processed or consumed. In this way, it is easy to change the way or the tool used to store or consume data without breaking the flow.

The next diagram presents the general idea of the 4-layer concept:

Ingest

This layer handles the collection of raw data in a wide variety (structured, semistructured, and unstructured) from various sources (such as transitions, logs, IoT devices, and so on) at any speed (batch or stream) and scale.

Store

This layer takes care of storing the data in secure, flexible, and efficient storage. For example, store the data in a data lake, data warehouse, relational database, graph database, NoSQL database for future use in the ‘Consume’ layer.

Process

This layer turns the raw data into consumable, by sorting, filtering, splitting, enriching, aggregating, joining, and applying any required business logic to produce new meaningful data sets.

Consume

This layer provides the consumer of the data the ability to use the post-processed data, by performing ad-hoc queries, produce views which are organized into reports and dashboards or upstream it for ML use.

In my previous and current blog, I presented some common challenges and recommended design principles for Big Data Pipelines. These insights were taken from real-world projects experience as a data engineer at AllCloud. If you have any questions or would like to discuss a project, please contact us.

Idan Petel

Data Engineer