The real-time data transmission techniques have changed rapidly over the years for the challenges it poses like latency, reliability, scalability. There has been extensive research on how to overcome these obstacles. One such research led to the development of Apache Kafka in LinkedIn around 2011.
We in Kimshuka Technologies, have built a couple of real-time streaming applications for our clients using both Kafka and RabbitMQ (Message Queues). With having experience of using both the technologies, we decided to make a time and memory analysis on how both of these systems work when we have a really high load of data being streamed.
To start off with the technology stack that we are going to use in both of systems, we have selected a sensor-based home automation system dataset available in Kaggle to feed into our real-time streaming systems. We added random text into the dataset to increase the ingested data to have more data size. We used the Confluent version of Kafka which provides various kinds of serialization and deserialization techniques (we will discuss them as well). As a message queue, we use RabbitMQ as a message queuing system. And we used python as the programming langugage for our application.
Let us understand the architecture of both how RabbitMQ and Kafka which helps us to decode how we try to resolve the issues that real-time streaming has.
A Message queue is a queue of messages sent between applications that are accessible in a sequence for the receiver. It contains work objects which are waiting to be processed. In the message queue, each message or task can be processed only once by only one consumer. It is generally used to decouple the systems and services. There are three components in a message queue.
The issue that message queues possess is that a single message cannot be assigned to multiple consumers. The message once consumed cannot be processed again. If the latest message in a queue is created for an idle channel, till the message reaches the end of the queue, the consumer of that channel keeps polling, making it idle. Suppose a consumer fails, the message is lost.
Kafka follows the principle of Publisher-Subscriber architecture pattern. In a Kafka system, there are multiple producers. Producers can produce messages for multiple topics (topics are similar to channels in message queues). The messages are passed into message brokers. The consumers poll for the messages of the specific topic and read them from the broker. A consumer can consume messages on different topics multiple times. Brokers do have larger benefits when over queues in message queues. Apart from the producer, broker, and consumer, Kafka has other components. Let us dive deep into the architecture of Kafka and understand it better.
Let us discuss one of the important components, serialization, and deserialization while we will discuss the other components in a different blog.
Serialization is the process of converting the data object into a sequence of bytes. A Kafka broker only works with bytes. Kafka stores records in bytes, and when a fetch request comes in from a consumer, Kafka returns records in bytes. This also helps in the transmission of data against various systems. Deserialization does exactly the opposite of what serialization does. It converts bytes into data object originally created.
In Confluent Kafka, we have the following serialization and deserialization techniques:
Avro - A serialization technique that is very widely used and we specify the schema in an avsc file.
JSON - A serialization technique where the JSON schema is specified and we use python dictionaries to specify the same in our application
Protobuf - Developed by Google, protobuf is a protocol buffer, which is compiled by a protobuf compiler and used in the application.
With this introduction, lets move into the process we did to analyze the results:
We start the data ingestion as a real-time streaming in the following manner: 100 records, 1000 records, 10000 records, 100000 records, 503911 records(complete dataset) and then double the datasets by appending the same dataset so making it 1007822 records, 1511733 records, and 2015644 records. This effectively means we will be testing for around 2.1GB of the dataset being ingested as a real-time streaming system. We have around 20Lakh datasets being ingested into the systems.
We run each of the datasets into the following real-time streaming applications:
1. Message Queue using Rabbit MQ
2. Confluent Kafka with JSON Serialization
3. Confluent Kafka with Avro Serialization
4. Confluent Kafka with Protobuf Serialization
During the process of streaming the data in Kafka, the data transmission from the producer is so fast that the memory buffer allocated for Kafka producer, gets filled and thus goes into an exception even before it reaches out to the consumer. To resolve this exception, we call the poll(1) method on the producer object. It Polls the producer for events and calls the corresponding callbacks (if registered). We can also call the flush() function which waits for all messages in the Producer queue to be delivered. This is a convenience method that calls poll() until len() is zero or the optional timeout elapses. We call flush at the end of our application so that there is nothing left in the buffer before we start the next application.
We have tracked the following parameters during application execution for time for execution analysis.
total_data(records): The number of rows of the dataset being streamed. buffer_clear_count: Number of times the buffer clears actual_docs_per_second(seconds): The number of documents being streamed when the buffer clear time is also included.
overall_time_taken(seconds): Total time taken including the buffer clear time buffer_clear_time(seconds): Time is taken for the buffer to clear. transfer_time_taken(seconds): The time taken by the system when the buffer clear time is removed.
overall_docs_per_second(seconds): The total documents transferred per second when buffer clear time is removed.
Interpreting the results, we can see Kafka with protobuf serialization takes lesser time than any other systems.
Now, its time to analyse the memory consumption:
On memory consumption analysis, we can clearly see Kafka Protobuf taking lesser memory when compared to all other systems.
With these analysis, we see Kafka with Protobuf serializer being the clear winner on both memory and time. Thus we decided to use Kafka with Protobuf serializer in our applications due to which, we were able to cut down server costs by more than 10%.
The whole analysis has been made open-source which may help many organizations make better decisions on using the appropriate technology. You can find the code here.
Please feel free to contact us for any consultation needs, we help you make better technical architectural decisions and improve your product reliability and scale it well.
Comments