Scaling a Mature Data Pipeline—Managing Overhead

Scaling a Mature Data Pipeline—Managing Overhead

  • October 5, 2019
Table of Contents

Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala.

However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and execute our jobs on Amazon EMR. We use Airflow as our task orchestration system that takes care of the orchestration logic.

For a data pipeline, we define the orchestration logic as the logic that facilitates the execution of your tasks. It includes the logic that you use to define your dependency graph, your configuration system, your Spark job runner, and so on. In other words, anything required to run your pipeline that is not a map-reduce job or other business logic tends to be orchestration logic.

In total, the our pipelines are made up of a little over a thousand tasks.

Source: medium.com

Tags :
Share :
comments powered by Disqus

Related Posts

Altair: Declarative Visualization in Python

Altair: Declarative Visualization in Python

With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.

Read More
Matplotlib—Making data visualization interesting

Matplotlib—Making data visualization interesting

Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to understand with just a glance and keeps the audience engaged.

Read More