Migrating from Cloudera to a modern data hub architecture

robgibbon

on 22 February 2024

In the early 2010s, Apache Hadoop captured the imagination of the tech community. A free and powerful open source platform, it gave users a way to process unimaginably large quantities of data, and offered a dazzling variety of tooling to suit nearly every use case – MapReduce for odd jobs like processing of text, audio or video; Hive for SQL based data warehousing; Pig, an unusual language with a similar data warehousing goal; Hbase, Oozie, Sqoop, Flume and a whole parade of other tools for processing massive datasets at scale.

With the surge in interest in Hadoop, a number of startups and established software companies jumped to provide commercial offerings of it. Cloudera was one of the first into the game with CDH – a Hadoop distribution. Initially offered as Deb packages and RPMs, Cloudera quickly introduced Cloudera Manager, a sophisticated, web-based management system to deploy, maintain, and operate Hadoop clusters. With the introduction of Cloudera Manager, Cloudera established themselves as the market leader. Over time, they consolidated their position, merging with competitor HortonWorks.

A lot has changed in the nearly two decades since Hadoop’s release

Hadoop was designed and built as a hyperscale system to be deployed on premise in the data centre, well before public cloud computing had established its dominance and become arguably the most popular way to deliver IT services.

Certain design assumptions taken with Hadoop, like the paradigm “bring the compute to the data”, made sense in the context of a millennial data centre, with 1GbE networking and a relatively low cost per GB for direct attached storage media. But many of those design assumptions make little sense on the cloud, where local block storage devices are costly versus remote, highly durable object storage – which is offered with a far lower cost per GB.

Then there were parts of Hadoop’s critical architecture – for instance YARN and Kerberos –  which over time proved to be difficult to work with. YARN is a complex cluster job scheduler with a narrow focus on data processing, whilst the Kerberos security protocol used by Hadoop has long been a bugbear for many administrators.

In short: Hadoop was never designed with cloud computing in mind.

Between the ageing architecture and the complexity of the platform, many are looking to make a move away from Hadoop and from Cloudera, and seek a state-of-the-art alternative, more aligned with modern cloud-native computing principles and optimised for low cost operation in the contemporary cloud context.

Cloudera migration alternatives

When architects are looking for alternative data hub platforms, they tend to seek a solution that’s free, open source, powerful, and flexible. Like Hadoop, it should be capable of processing extremely large quantities of data, and give them flexibility in features to suit a wide array of use cases. But these days, the solution needs to be cloud ready, capable of auto scaling, and must run efficiently with a low operational cost.

Charmed Spark from Canonical is founded on Apache Spark, a mature and sophisticated data processing framework widely used with Hadoop. Spark supports data engineering, data lakehouse, and data science use cases for AI/ML and has been widely adopted by the Big Data user community. Charmed Spark entirely replaces Hadoop YARN with the more general purpose, extensible cluster resource manager  – Kubernetes. Kubernetes has become by far the most popular cluster resource manager on the market today, with a flexible palette of features and plugins.

How to move legacy workloads off Hadoop

You’re leaving 2010 behind, but you aren’t giving up its data – nor its processing logic. You need to make sure that your chosen migration plan makes it relatively straightforward to bring all of it with you. In our experience, this has been a major concern for clients planning a data hub migration to a modern, cloud-native infrastructure, and it’s why Charmed Spark is designed to offer a straightforward path from legacy Hadoop to a state-of-the-art, cloud-native data platform.

Charmed Spark includes a distribution of Apache Spark – the most popular of the Hadoop data processing frameworks – designed and built to run on Kubernetes. It offers an effective replacement for Hadoop, as its architecture entirely supersedes YARN and abstracts away the data storage tier to cloud object storage. Legacy Hadoop workloads such as Hive-based SQL data processing applications can often be readily migrated to the Spark framework, which has a high degree of compatibility with Hive, simplifying the transition from Hadoop. Charmed Spark is also available as a fully integrated offer for the data centre, including the Ceph object storage system and an advanced Kubernetes distribution from Canonical, easing transition from legacy Hadoop still further.

How to maintain flexibility in your cloud-native data hub design

A modern data hub is nothing if it lacks connectivity to popular object storage systems on the cloud. Whether it’s AWS S3, Azure Data Lake Store, Google Cloud Storage, or API compatible clones, a flexible data hub needs to be able to access and use all of these systems. Charmed Spark has been built with this in mind.

Architects are also concerned with workload consolidation and efficiency and with this in mind, the solution offers support for the Volcano gang scheduler Kubernetes plugin, helping ensure maximum efficiency on the Kubernetes cluster.

Cloud lock-in remains a worry for many, and a key concern that a modern data hub architecture needs to address. Charmed Spark offers platform portability between clouds. The solution can be deployed to many popular cloud Kubernetes platforms, including AWS EKS and Google GKE.

Cloudera migration: how to migrate while keeping costs manageable

One of the biggest pressure points in a migration project is cost, and it’s often the deciding factor in new technology adoption, even beyond functionality. Like Hadoop, Charmed Spark ticks both boxes – cost and capability – as it’s free to deploy and use.

As with any other project, the cost concern is a long-term consideration: how do you keep your data hub secure, up to date, and protected, in a cost-effective manner? Charmed Spark offers long-term support and security maintenance commitments: the Charmed Spark solution is available with up to ten years of support per stable track – which includes security fixes and break/fix support with a choice of 24/7 or weekday SLA.

Users interested in learning more about Charmed Spark can contact our commercial team, access the product page or explore Canonical’s portfolio of data and AI solutions.

Further reading

Talk to us today

Interested in running Ubuntu in your organisation?

Newsletter signup

Get the latest Ubuntu news and updates in your inbox.

By submitting this form, I confirm that I have read and agree to Canonical's Privacy Policy.

Related posts

Canonical announces supported solution for Apache Spark® on Kubernetes

17 October 2023 Today, Canonical announced the release of Charmed Spark – an advanced solution for Apache Spark® that provides everything users need to run...

Apache Kafka service design for low latency and no data loss

Designing a production service environment around Apache Kafka that delivers low latency and zero-data loss at scale is non-trivial. Indeed, it’s the holy...

Spark or Hadoop: the best choice for big data teams?

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition....