Data Ops at petabyte scale

Tim McNamara

on 8 January 2020

Tags: apache spark , ceph , data ops , Hadoop , hdfs , Juju , jupyter notebook , s3 , Storage , yarn

This article was last updated 3 years ago.

Deploying Apache Spark in production is complex. Should you deploy Kubernetes? Should that Kubernetes cluster be backed by Ceph? Perhaps stick with a traditional Hadoop/HBase stack? Learn how Juju and model-driven operations have enabled one data engineering team to evaluate several options and come to an ideal solution.

This article is an interview between Tim McNamara, Developer Advocate at Canonical and James Beedy of OmniVector Solutions. James has spent years refining his approach for packaging Apache Spark and managing large-scale deployments. With data volumes into the petabyte range and current operations to maintain, he has used Juju to create purpose-built solutions for his team and their customers.

The interview is divided into multiple sections:

Introductions and background
Packaging Spark for repeated and ad-hoc workloads
Moving away from HDFS
Day-N benefits of Juju
Recommendations for others deploying Apache Spark

Introducing the Data Ops problem

Tim: Hi James, how did you become interested in Spark? And especially Spark with Juju?

I’ve had the privilege of creating the ops and workflows for PeopleDataLabs over the past few years. PeopleDataLabs, or PDL for short, runs a myriad of batch style data processing workloads. They’re predominantly based on the Apache Spark framework and its Pyspark extension for Python.

When I started working with PDL they were using CDH—the Cloudera Distribution of Hadoop. As I started digging in, I quickly realized that this implementation presented some major roadblocks.

[Juju’s] modelling approach alleviates the person implementing and maintaining the software from monotonous work. This allows engineers to spend cycles where it counts. Juju can handle the rest.
James Beedy from Omnivector Solutions explains why you should use Juju

First and foremost, as the company added headcount, developers began experiencing contention when accessing the available compute and storage resources. While CDH had been a great way for PDL to get up and running, it was clear to me that a better solution was needed for the company to continue its rapid growth.

Second to that, it was critical that we maintained consistency across all production and development environments. This would allow developers to execute workloads using a standardized workflow regardless of what substrate was being used. It shouldn’t matter whether the processing was being carried out on-prem, in the cloud or some hybrid model.

I had been a member of the Juju community for a number of years at this point. I knew that this technology would allow me to create a robust and scalable solution to the problems presented above. Furthermore, Juju would allow us to easily evolve the technology stack as the company continued to grow.

Packaging Spark for repeated and ad-hoc workloads

Development jobs are heavily interactive. From an ops perspective, they’re much lumpier than production. It’s impossible to know if a developer will create a job that is ridiculously expensive. That’s why we need Juju, actually.
James Beedy, OmniVector Solutions

How did you begin packaging Spark applications?

I was presented with multiple data processing challenges that require highly parallel and distributed solutions. As I started researching these problems, I quickly determined that Apache Spark was a perfect fit for many of them.

Knowing this, I needed to learn more about how Spark applications were deployed and managed. I spent a fair amount of time digging into the Spark documentation and experimenting with different configurations and deployment methodologies.

As someone fairly new to Spark at the time, the top level differentiator that stuck out to me was that there were two primary types of workloads that Spark could facilitate: batch processing and stream processing.

Batch processing is essentially static processing. You have input data in one location, load it, process it, write the output to the output location. Work is batched up and one job is more-or-less independent from the next. Aggregation and summaries are very important here.

Stream processing has less to do with aggregating over large static sources, and more to do with in-flight processing of data that might be generated on-the-fly. Think sensor readings or processing events generated by some backend server.

Another way to distinguish the two models is to think about whether the data comes from disk. As far as Spark is aware of, data for streaming workloads never touches disk. Batch processing, on the other hand, is 100% stored on disk.

Which of those models best describes your use case?

PeopleDataLabs largely does batch, extract transform load (ETL) processing, but with two different modes: production and development.

What is a production job for your team?

Production workloads are ETL-type jobs. Data is stored in some obscure format, loaded into Spark, molded and interrogated, and finally transformed and then dumped back to disk. These production jobs are semi-automated, headless Spark workloads running on Yarn and backed by HDFS.

HDFS?

The Hadoop File System. It spreads files over many machines and makes them available to applications as if they were stored together. It is the workhorse that created the Big Data movement.

Great. How do things look like in development?

Development workloads consist of Jupyter Notebooks running on similar clusters with a similar setup. Apache Spark backed by Yarn and HDFS clusters.

Development jobs are heavily interactive. From an ops perspective, they’re much lumpier than production. It’s impossible to know if a developer will create a job that is ridiculously expensive. That’s why we need Juju, actually.

We want to be able to create and maintain multiple isolated environments that share common infrastructure. Anyway, this interactive development workflow generates our ETL workloads that are later deployed into production.

That’s pretty important background. How did you get started rolling it out?

Right, so after doing some research and diving into our codebase, I developed a checklist of features that I needed to support.

Moving away from HDFS

I was targeting a batch processing architecture. Data movement is critical in that environment. That means a fat pipe connecting big disks.

Here is the list requirements. Some firmer than others.

We knew that were using HDFS for our distributed backend. HDFS is where the input and output data goes. But what was surprising after looking deeper that the only component of upstream Hadoop we were using was HDFS. Almost everything else was purely Spark/Pyspark.

Our codebase was dependent on the Spark 2.2.0 API. Everything we wanted to deploy needed to support that version.

Every node in the cluster needed to have identical dependencies in identical locations on the filesystem in order for a job to run. In particular, we needed a few Hadoop libs to exist in the filesystem where the Spark executors run.

Our method of distributing code changes to nodes in the cluster was to zip up the Jupyter Notebook’s work directory and scp to every node in the cluster.

So those are your requirements. Were there any “nice to haves”?

For sure! Replacing HDFS with a more robust storage solution would completely remove our dependency on Hadoop. This would significantly reduce the amount of effort it would take to maintain the stack. The more we could simplify, the better.

At this point, you probably experimented with a practical evaluation?

Right. At this point I had a clear understanding of the requirements. I started out by looking into swapping out CDH with Charmed Bigtop. The prototype highlighted two huge technical issues that needed to be overcome:

Charmed Bigtop supported the Spark 2.1 API but PeopleDataLabs jobs required version 2.2 of the Spark API.
PeopleDataLabs needed to sustain 1PB of SSD storage across all Hadoop task nodes. Charmed Bigtop only provisions HDFS onto the root directory.

This reaffirmed that a more flexible solution would be needed to meet the job processing requirements .

They sound like fairly critical shortcomings..

Well they meant that I knew right from the start that the Charmed Bigtop stack would not work for us out of the box. I needed to find another solution to provision our Spark workloads.

Following further analysis on moving forward with HDFS, it made a lot of sense to look into decoupling the storage infrastructure from the compute nodes.

What stack did you end up choosing?

We ended up deploying S3A with Ceph in place of Yarn, Hadoop and HDFS.

Interesting. Why?

There were many upsides to this solution. The main differentiators were access and consumability, data lifecycle management, operational simplicity, API consistency and ease of implementation.

Could you please explain what you mean by “access” and “consumability”?

If we wanted to have multiple Hadoop clusters access the data, we needed to have the data on the respective HDFS filesystem for each cluster. This is a large part of the contention our developers were experiencing. Specifically, access to the data, the ability to process it, and the resources needed to house it.

Moving the data to Ceph and accessing it remotely via S3 solved this problem. By allowing many Spark jobs to read and write data simultaneously, consumability and access were no longer an issue.

It was quickly becoming clear that Ceph would provide a much more robust approach to distributed storage. On top of that, the Juju Ceph charms make deployment straight forward and painless.

What’s “data lifecycle management”? What problems were you facing that are solved by your new storage backend?

Migrating the data back and forth from HDFS to long term storage was a huge pain point. We could process data on HDFS just fine, but getting the data in and out of the distributed file system to a proper long term storage system on a frequent basis was creating contention. It limited how often jobs could be run. A user would need to wait for data to finish transferring before the cluster was available to run the next job.

That sounds like a great win for PeopleDataLab’s developers. You mentioned “API consistency and scale” as well. Would you mind explaining how those two are related?

Simply put, we can make Ceph talk to AWS S3. That makes it really easy to point our Spark jobs at wherever the data lives in Ceph.

The Hadoop AWS module makes this very easy. Plugging that into Ceph and Radosgw (HTTP REST gateway for the RADOS object store) meant that remote access via Spark is suddenly compatible with S3.

Decoupling storage from compute by moving to Ceph S3 opened up a whole new world for us. We were already using object storage a fair amount for other purposes, just not for our processing backend.

This change allowed us to run jobs in the public cloud using Spark in the same way jobs are executed on premise.

That’s really cool. I suppose that feeds into your “operational simplicity” point?

That’s right. After decoupling the storage with Ceph and dropping the need for HDFS, we only had to account for a single process: Spark.

Previously we had to account for a whole ocean of applications that needed to run in harmony. That was the reality for us running Pyspark on top of Cloudera Hadoop.

Having the data on separate infrastructure allowed us to manage the compute and storage independent of each other. This enhanced user access, made the data lifecycle simpler, and opened up doorways for us to more easily run our workload in the cloud and on-prem alike.

Day-N benefits of Juju

Juju handles the details. All I needed to do was get Spark running with storage support. Juju completely manages the gritty details of talking to the underlying hosting infrastructure.
James Beedy, Omnivector Solutions

Supporting two hosting environments—cloud and on-prem—doesn’t sound like it matches your “ease of implementation” point.

That’s where Juju comes in. Juju handles the details. All I needed to do was get Spark running with storage support. Juju completely manages the gritty details of talking to the underlying hosting infrastructure.

Knowing that I wasn’t going to need to account for HDFS anymore, I took a closer look at Spark’s internal storage requirements: Spark has a need for an executor workspace and a cache.

From prior experience I knew that Juju’s storage model fit this use case.

Sorry to interrupt, but could you please explain the term “Juju storage model” for anyone that’s not familiar?

Juju allows you to provision persistent storage volumes and attach them to applications. Actually that’s not the whole story. The Juju community would use the phrase, “Juju allows you to model persistent storage.”

Everything managed by Juju is “modelled”. The community calls this “model-driven operations”. You declare a specification of want you, e.g. “2x200GB volumes backed by SSD”. Juju implements that specification.

The term modelling is used because storage—and other things managed by Juju such as virtual machines and networking—is, in some sense, abstract. When we’re talking about an entire deployment, we actually don’t care about a block device’s serial number. And from my point of view, I just care that Spark will have access to sufficient storage to get its job done.

Okay cool, so it’s easy to give Spark access to two storage volumes, one for its executor workspace and the other for its working cache.

That’s right. Juju completely solves the storage challenge. But now we need to package the various Hadoop libs that Spark needs for our specific use case. Spark versions are tied to Hadoop versions, so this is more complicated than it should be.

Provisioning the Spark and Hadoop libs seemed to be a perfect fit for Juju resources. I downloaded the Spark and Hadoop upstream tarballs, attached as charm resources via layer-spark-base and layer-hadoop-base. And it worked perfect!

“Juju resources”?

Ah right, sorry – more jargon. A resource is some data, such as a tarball of source code, that’s useful for the charm. So, our tarballs are considered resources from Juju’s perspective.

A resource can be any binary blob you want to couple with your charm. Resources can be versioned and stored with the charm in the charmstore, or maintained separately and supplied to the charm by the user on a per deployment basis.

And what do you mean by “layer-spark-base” and “layer-hadoop-base”?

Layers are reusable code snippets that can be used within multiple charms. Charm authors use layers to create charms quickly.

A charm is a software package that’s executed by Juju to get things done, like install, scale up, and some other functionality that we’ve touched on such as storage and network modelling.

Our private code dependencies and workflow was accommodated via another layer: layer-conda. This allowed our developers to interface to Juju to get our code dependencies on to the compute nodes.

I wrapped our data processing code up as a Python package. This allowed our developers to use the layer-conda config to pull our code dependencies into the conda environment at will. It also provides a more formal interface to managing dependencies.

Combining layer-conda, layer-spark, layer-hadoop, and layer-jupyter-notebook I was able to create a much more manageable code execution environment that featured support for the things our workload required.

If I’m hearing this correctly, you have the bulk of your implementation within 5 or so different code libraries—called layers—that allowed you to not only to deploy Spark on Ceph/S3A, but also enable developers to iterate on and deploy new workflows directly to production Spark cluster.

More or less. It’s pretty cool. But the solution itself wasn’t entirely optimal.

What’s wrong with what you deployed?

I’ve covered a lot of ground. Perhaps before answering, I’ll review where we got to, implementation-wise.

I swapped out the standard HDFS backend for a Juju-deployed Ceph with object storage gateway. In this new architecture we were running Spark standalone clusters that were reading and writing data to the S3 backend using the S3A to communicate with an S3 endpoint from Spark.

Decoupling the storage and ridding ourselves of HDFS was huge in many ways. The Spark storage bits are handled by Juju storage. This accomodates the storage needs of Spark really well. The code dependencies bit via layer-conda was a huge improvement in how we were managing dependencies.

My Spark project had come a long way from where it started, but was nowhere near finished.

The build and runtime dependency alignment across the entire Bigtop stack is of critical importance if you intend to run any Apache software components in conjunction with any other component of the Apache stack. Thus illuminating the genius and importance of the build system implemented by the original Charmed Bigtop stack. This also shed light on reasons why my slimmed down solution wasn’t really full circle.

I realized that if I wanted to make a Spark charm that allowed for Spark and Hadoop builds to be specified as resources to the charm, that I would need a way of building Spark against Hadoop versions reliably.

Recommendations for other people deploying Apache Spark

So, is your solution something that you recommend for others?

Ha, well like a lot of things in technology – it depends.

The Spark standalone charm solution I created works great if you want to run Spark Standalone. But has its snares when it comes to maintainability and compatibility with components of the greater Apache ecosystem.

Without context, it’s impossible to know which Spark backend and deployment type is right for you. And even once you have an architecture established, you also need to decide where to run it.

I’ve evaluated three alternatives that suit different use cases: Elastic Map Reduce (EMR), Docker and Kubernetes with Ceph.

EMR is great if you want a no-fuss, turnkey solution. Docker provides more flexibility, but still suffers from misaligned dependencies. In some ways, wanting to make use of Docker while keeping everything consistent naturally leads to Kubernetes.

Elastic Map Reduce (EMR)?

AWS provides Elastic Map Reduce, Hadoop as a service. You give it one or more scripts, give them to the spark-submit command to run and voilà. Behind the scenes, custom AMIs launch and install emr-hadoop. Once the instances are up, EMR runs your job.

It’s easy to ask the EMR cluster to terminate after it completes the jobs you’ve defined. This gives you the capability to spin up the resources you need, when you need them, and have them go away when the provided steps have been completed.

In this way EMR is very ephemeral and only lives for as long as it takes to configure itself and run the steps.

Using EMR in this way gave us a no-fuss, sturdy interface to running Spark jobs in the cloud. When combined with S3 as a backend, EMR provides a hard to beat scale factor for the number of jobs you can run and amount storage you can use.

What’s it like to run Spark in Docker containers?

I faced challenges in the likes of mismatched build and runtime versions and mismatched dependencies. It was a pain.

This issue became more prevalent as I tried to package Spark, Hadoop and Jupyter Notebook in hopes of getting remote EMR execution to work with yarn-client.

There were subtle mismatched dependencies all over the place. An emr-5.27.0 cluster is going to run Spark 2.4.4 and Hadoop 2.8.5. This means if you want to execute Jupyter Notebook code against the EMR cluster, the Spark and Hadoop dependencies that exist where the Jupyter Notebook is running need to match those provisioned in the EMR cluster. Keeping everything in sync is not easy.

Dockerizing Spark below version 3.0.0 is tedious as it was built from an unsupported Alpine image. Newer versions of Spark are more viable. Spark now uses a supported OpenJDK base built on debian-slim. This makes building on top of spark images far more streamlined.

On the plus side, once Juju is installed, you can deploy Jupyter + Apache Spark + Hadoop using Docker with in a single line of code:

$ juju deploy cs:~omnivector/jupyter-docker

cs:~omnivector/jupyter-docker is a “Juju Charm”. Once you’ve deployed it, you can change the Docker image that’s running via changing configuration settings:

$ juju config jupyter-docker \
    jupyter-image="omnivector/jupyter-spark-hadoop-base:0.0.1"

Alternatively, you can supply your image at deployment time:

juju deploy cs:~omnivector/jupyter-docker \
    --config jupyter-image="omnivector/jupyterlab-spark-hadoop-base:0.0.1"

Example Docker images compatible with cs:~omnivector/jupyter-docker are available from our open source code repositories.

Yay for things actually working. And Kubernetes?

In the progression of packaging and running workloads via docker, you can imagine how we got here. Running on Kubernetes has provided improvements in multiple areas of running our workload.

It’s possible to build the Jupyter Notebook image and the Spark driver image from the image that your executors run. Building our Spark application images in this way, provided a clean and simple way of using the build system to organically facilitate the requirements of the workload. Remember the dependencies and file system need to be identical where the driver and executors run.

This is the largest come up of all in the packaging of Jupyter/Spark applications; the ability to have the notebook and inherently also Spark driver image built from the same image as the Spark executors.

To facilitate this all happening, the layer-jupyter-k8s applies a role to grant the Jupyter/Spark container the permission needed to provision other containers (Spark workloads) on the Kubernetes cluster. This allows a user to login to the Jupyter web interface and provision Spark workloads on demand on the Kubernetes cluster via running cells in the notebook.

I have a few high level takeaways.

Multi-tenancy is great. Many developers can execute against the k8s cluster simultaneously.
Dockerized dependencies. Package your data dependencies as docker containers. This works really well when you need to lock in Hadoop, Spark, conda, private dependencies, and other software to the version for different workloads.
Increased development workflow. Run workloads with different dependencies without having to re-provision the cluster.
Operational simplification. Spark driver and executor pods can be built from the same image.

Some drawbacks of Kubernetes and Ceph vs HFDS:

Untracked work. Spark workloads provisioned on k8s via Spark drivers are not tracked by Juju.
Resource intensity. It takes far more mature infrastructure to run Ceph and K8s correctly than it does to run Hadoop/HDFS. This comes down to the fact that you can run a 100 node Hadoop/HDFS installation on an L2 10G network by clicking your way through the CDH GUI. For Ceph and K8s to work correctly, you need to implement an L3 or L2/L3 hybrid network topology that facilitates multi-pathing and scaling links in a pragmatic way, you can’t just use L2, 10G if you want to do Spark on K8S + Ceph past a few nodes.

Given what we’ve talked about, you can see how packaging Spark and Hadoop applications can get messy. Using Juju though, developers and ops professionals alike can model their software application deployments. The modelling approach alleviates the person implementing and maintaining the software from monotonous work. This allows engineers to spend cycles where it counts. Juju can handle the rest.

Juju is simple, secure devops tooling built to manage today’s complex applications wherever you run your software. Juju can be used to deploy and manage any networked service, whether that service is delivered from bare metal hardware, containers or virtual machines. Visit the Juju website to learn more.

James Beedy is a dev ops specialist from Omnivector Solutions. Visit the Omnivector Solutions website and chat with them on Twitter at @OV_Solutions.

What is Ceph?

Ceph is a software-defined storage (SDS) solution designed to address the object, block, and file storage needs of both small and large data centers.

It's an optimized and easy-to-integrate solution for companies adopting open source as the new norm for high-growth block storage, object stores and data lakes.

Learn more about Ceph ›

How to optimize your cloud storage costs

Cloud storage is amazing, it's on demand, easy to implement, but is it the most cost effective approach for large, predictable data sets?

Understand the true costs of storing data in a public cloud, and how open source Ceph can provide a cost effective alternative.

Access the whitepaper ›

A guide to software-defined storage for enterprises

Ceph is a software-defined storage (SDS) solution designed to address the object, block, and file storage needs of both small and large data centres.

Explore how Ceph can replace proprietary storage systems in the enterprise.

Access the guide ›

Performant, reliable and cost-effective storage with Ceph

Canonical Ceph simplifies the entire management lifecycle of deployment, configuration, and operation of a Ceph cluster, no matter its size or complexity. Install, monitor, and scale cloud storage with extensive interoperability.

Find out how Ceph scales effortlessly and cost-effectively ›

Data Ops at petabyte scale

Tim McNamara

Introducing the Data Ops problem

Packaging Spark for repeated and ad-hoc workloads

Moving away from HDFS

Day-N benefits of Juju

Recommendations for other people deploying Apache Spark

What is Ceph?

How to optimize your cloud storage costs

A guide to software-defined storage for enterprises

Performant, reliable and cost-effective storage with Ceph

Newsletter signup

Related posts

Predict, compare, and reduce costs with our S3 cost calculator

MicroCeph: why it’s the superior MinIO alternative (and how to use it)

The hitchhiker’s guide to infrastructure modernization