Everyone hates waiting in a queue. On the other hand, when you’re moving gigabytes of data around a cloud environment, message queues are your best friend. Enter Apache Kafka.
Apache Kafka enables organisations to create message queues for large volumes of data. That’s about it – it does one simple but critical element of cloud-native strategies, really well. Let’s look at the three significant benefits, challenges and use cases of Apache Kafka, and the easiest way to get it running in production.
Apache Kafka – what is it?
You need to know three things – topics, partitions and replication.
Apache Kafka connects apps that publish data to apps that want to subscribe to that data. It first stores data into a log called a topic. The topic keeps a sense of the order of data it receives, as the publisher appends data to the end of it. Subscribing apps read from the log, based on asking for an offset of the data.
To make sure publishing and subscribing can occur at a speed and the scale the cloud environment needs, Apache Kafka partitions data. This means making a copy of all or a part of a topic to partition it.
Finally, partitions are replicated to ensure high availability and failure tolerance. Replication means multiple copies of partitions are made and the duplicates are stored in different locations, such as various data centres.
Why use Apache Kafka – 2 ways it transforms clouds
Kafka solves scalability challenges as partitions of a topic can independently manage read and writes from subscribers and producers. Partitions let Kafka perform multiple reads and writes. Partitions also maintain speed, as the number of subscribers and producers increases. Kafka finally keeps order when various producers “write” to the same topic at the same time, thus not losing data.
Scalability achieved with partitions allows organisations to grow their cloud environment and makes it easier to add new subscribers or publishers to existing topics. Apache Kafka lets a cloud environment scale horizontally, with new nodes easily added to existing infrastructure. Kafka also vertically scales, with higher throughput achieved by making new partitions.
The reliable choice
Next, Apache Kafka is a robust solution because replicas increase fault tolerance and reliability. Without replication, a Kafka topic would be a single point of failure, and so replicas act as redundancy in emergencies. When a server with a partition goes down, there are separate copies, so the data isn’t lost.
Replicas make Kafka fit for mission-critical workloads. By storing replicas in different availability zones or regions, a cloud environment can reach high availability, and improve their uptime.
How is Apache Kafka used?
So far, we’ve learnt the fundamental concepts of Apache Kafka and why it is advantageous to include in cloud environments. Now, let’s focus on practical applications:
- Stream processing: It enables you to create real-time data streams. Subscribing apps can process data and transform it, before publishing the data to other subscribing apps.
- Fast message queue: In a literal sense, it can be used to send and receive messages such as emails and IMs. More generically, it allows message passing in a microservice architecture. Kafka moves messages without knowing the format of the data, and this means it can do so very fast – endpoints decode data with no overhead in the transit process.
- Data aggregation: Kafka can make a common topic with multiple producers writing to a topic. It solves the complexity of having numerous producers append to a time-sensitive log, and has in-built functionality to arbitrate clashes. It is useful for log data so that multiple sources of log data can be combined.
Challenges of using Apache Kafka
We know Apache Kafka has the features for scalability (partitioning) and reliability (replication). However, to apply the elements to a business use-case takes planning and architecture design.
Primarily, there are physical constraints to how scalable and reliable Kafka can perform. Users need to make sure there is adequate network bandwidth and disk space for clusters.
Second, careful consideration in selecting the number of partitions made for a topic. If it is too high, naturally this will slow the system down. Too low, and publishers or subscribers stall in getting access to a topic.
Finally, replication will only lead to high availability and improved reliability if replicas are stored in different servers, regions and availability zones. Replication requires physical cloud environments to meet business requirements.
Canonical’s webinar on Kafka in production, discusses these three topics and more. You can watch it on-demand now.
Optimised Apache Kafka, managed for you
In the spring, we released Managed Apps, and this included Apache Kafka. With Canonical’s managing your app you get the following benefits.
- We provide the highest quality deployment, optimised to your business needs and with automation wrapped around critical tasks. Automation, achieved by an operator framework, means reduced response-time and human error during operation.
- Quality deployment leads to fewer day-2 errors, the main driver of TCO, and so we can offer app management at the lowest possible price-point. It also means your teams are free from doing day-to-day management and can focus on what matters – your business
- We manage on any conformant Kubernetes and virtually any cloud environment. This means you know who is responsible for the management, with fewer grey areas.
- Fast and lower-risk start. Our team does the heavy lifting so you can focus on using apps and transforming your cloud.
Apache Kafka offers a general-purpose backbone for all your cloud’s data needs. It provides practical solutions to get the reliability and scalability needed in any cloud environment. It is flexible enough to be essential for many use-cases. To optimise your deployment, improve quality and economics, speak to our engineers today.
Ubuntu offers all the training, software infrastructure, tools, services and support you need for your public and private clouds.