TL;DR A stream is an unbounded sequence of events with producers and consumers. It's a rich source of data that can be utilized for diverse applications like asynchronous event processing, analytics, etc. Kafka is a prominent stream-processing service.
This post talks about a common component in the infrastructure of a company—stream processing.
A stream is an unbounded list of events. There are producers and consumers. Producers add new events to the stream and consumers read events from the stream. In most cases, the stream is append-only and the events are immutable.
Each consumer or consumer group is associated with an offset. The offset moves forward as the consumer acknowledges the events in the stream.
Stream processing systems accept new events from producers and persist them onto disk.
In several cases, the events of a stream can be deleted:
Partition and replication mechanisms are similar to a database. To achieve scalability, a stream system can partition the key space with a hash-based or range-based scheme and divide the work between partitions. For each partition, there are more than one nodes that store the data. There's usually a leader that takes in a new event and replicates that event to followers before acknowledging the clients.
A stream processing service may guarantee that an event is processed at most once, or at least once, or exactly once. The semantics are mainly determined by how you configure producers and consumers and handle message acknowledgment and retries.
Messages may be lost but are never redelivered or duplicated. This is achieved by configuring the producer to not retry failed sends (retries=0
) and committing the offset immediately after fetching the messages in the consumer. In this setup, if processing fails or if there are issues while sending the message, the message won't be retried and hence may be lost.
In short, do not retry.
Messages are never lost but may be redelivered and processed multiple times, leading to duplicates. This is achieved by configuring the producer to retry failed sends (retries > 0
) and committing the offset after the message has been processed in the consumer. If the processing fails, the consumer will read the message again on restart, leading to duplicates. You could handle duplicates in the downstream systems, or by making the processing idempotent, if possible.
In short, retry until success.
Here, each message is delivered and processed exactly once, with no losses or duplicates. This is the most complex to achieve.
Let's try to define what it entails first.
Firstly, you need to configure the producer for idempotency (enable.idempotence=true
).
If a producer has multiple events to produce at the same time (potentially for different topics), the events are written atomically. This can be done with a two-phase commit and requires a transactional API.
On the consuming side, you need to consume the messages and commit the transactions in one go, so that if a failure occurs during processing, the read offsets are not committed, and the same messages can be read again.
We are familiar with a database. A database also needs to handle an unbounded number of events/requests. How is a stream different from a database?
If we think of each event as a write, then a stream contains more information than a database: a database is only concerned with the most up-to-date value for each key while a stream stores all the history of writes. Therefore, a stream can be used to represent a database. Conceptually, a database is a compacted stream where only the latest entry for each key is kept.
The stream of replication events of a database are called Change Data Capture (CDC). The followers that consume the replication event do not blindly store the stream of events on disk. Instead, they apply the database updates from the stream onto the local copy of the database, and then the events from the stream can be discarded. Conceptually, the replication stream is a stream of state changes while each database node maintains a state machine, this view is particularly true for distributed databases.
How is a stream processing service different from a PUB/SUB service? PUB/SUB mostly focuses on the routing/distribution of data while a stream processing service can involve more processing. Kafka can be used for a PUB/SUB service but it can also be used for more complex processing.
Background:
Traditional two-copy mechanism:
One-copy mechanism:
sendfile()
system call in Linux, which allows data to be transferred directly from the file system cache to the socket buffer. In Java, this is the transferTo()
method in java.nio.channels.FileChannel
.This is the main optimization that Kafka uses to reduce unnecessary copies of data.
“Zero-copy" mechanism:
In theory, this would be the most efficient, but it isn't typically feasible for standard TCP/IP networking applications like Kafka. Directly copying data from the kernel page cache to the NIC's memory buffer means bypassing the kernel's socket buffer and the TCP/IP stack.