KRaft: The Kafka Raft

Introduction

Since KRaft introduction in 2021 with Kafka version 2.8.0, I have read many titles celebrating it. But I never really took the time to understand what it is and why they actually replaced Zookeeper. Maybe, as a data engineer, you wanted to raise this in your company with your manager, product owner, or wider engineering, but as if I could hear them saying:

If you have a well-running Kafka cluster with company data, why would you risk it to change it. We will need to invest time and learn nuances and do migration. This work adds no value to the company and is delaying development of more valuable feature XYZ. Let’s push it to the backlog and review next quarter.

Harold: No Rush, we will move to Kraft… eventually.

KRaft is production ready some time now, since Kafka 3.3. But recently I saw that in the Kafka version 4.0 released in 2025 was the support for Zookeeper completely dropped. KRaft has become the only metadata mode, I thought:

OK, maybe now is the time to understand what KRaft is. Let’s learn and write about it.

So I decided to dig into it and write this article to share my findings.

Why Kafka needs metadata storage

To run a Kafka cluster, it needs to be able to store metadata. This metadata is necessary for the proper functioning of the Kafka cluster. It needs to be able to store the partition state, whether they are online or offline. The state of the replicas, which one is a leader, which one is a follower. We usually have multiple topics in Kafka, and each topic can be created differently, with various configurations. Partitions are distributed across multiple brokers, so it would be great to have a central place to store all this metadata.

ZooKeeper

Since the beginning of Kafka, it used Apache ZooKeeper (configuration service) to store this metadata in a distributed coordination service. A Controller, which was running on one of the Kafka brokers, was communicating with Kafka brokers by receiving metadata updates from them and also propagating metadata updates to them. With ZooKeeper, broker liveness was implied by an ephemeral znode, and at the same time it decided which one is a leader node.

Kafka with 3 brokers and controllers communicating with zookeeper cluster

On the other side, the Controller was sending metadata further to ZooKeeper that stored the metadata in a replicated way. I’m saying that because ZooKeeper is not really a single node, but a cluster of nodes that are replicating the data between them in a Fault Tolerant way. So if one ZooKeeper node goes down, the other nodes can take over and the data is still available. Once the ZooKeeper stored the metadata, it has confirmed that back to the Controller, which then propagated the metadata to the Kafka Brokers. From the outside, the ZooKeeper was a single source of truth for the Kafka cluster metadata.

Let’s say that the Controller went down, then one of the Kafka Brokers would be elected by the remaining brokers as the new Controller. No two brokers can claim to be the Controller at the same time, because ZooKeeper would select only one of them.

So it all seems like a good idea that should work well. Well, yes, but after some time, the Kafka users started to realize some flows in this architecture.

Issues with ZooKeeper

For the start, it is less developer-friendly, if you are starting a new Kafka cluster locally, you also need to start a ZooKeeper cluster. This is not a big deal, but it is an extra step that you need to do. And if you are in a production, you also need to maintain the ZooKeeper cluster, which is an extra operational overhead.

Broker Failures

If the Kafka Broker goes down, it was probably leader of some partitions. The Controller now needs to do a partition leader election. Right after that, the Controller needs to update the metadata in ZooKeeper. Controller was sending updates one by one. Even when they later started to use batch API to propagate metadata updates, it was still slow. Much more so if there were multiple broker failures, the Controller would send multiple metadata updates to ZooKeeper, which could cause a backlog in ZooKeeper. This is O(n) complexity, where n is the number of partitions. The ZooKeeper internally needs to propagate each partition setting and come to some consensus between the ZooKeeper nodes. This could take some time, especially if there are many partitions. Once ZooKeeper confirms that the metadata is stored, the Controller can propagate the metadata to the Kafka Brokers. As you can see, during this time the Controller was overloaded and ZooKeeper was a bottleneck.

broker failure and controller updating metadata in zookeeper and propagating to brokers

Controller Failures

If the Controller goes down, one of the Kafka Brokers needs to be elected as the new Controller. ZooKeeper
confirms this election, and the new Controller needs to fetch the latest metadata from ZooKeeper. This can take even several minutes, depending on the size of the metadata. During this time, the Kafka cluster has no Controller to write offsets, etc. So the Kafka cluster is effectively down.

controller failure, and new controller is fetching metadata from zookeeper

Temporarily inconsistent metadata

I said that ZooKeeper was the source of Truth, but because Controller was always in the middle, effectively making it another source of truth, these shortly lived inconsistencies arose in metadata between Broker and Controller and the Controller and ZooKeeper.

But let’s imagine an even worse situation. Controller sends metadata update to Broker 1 and Broker 2, but before it could send to Broker 3, it crashes. Now Broker 1 and Broker 2 have a different view of the metadata than Broker 3. This could lead to inconsistencies in the Kafka cluster, which is not a good thing. When a new Controller is up and running, it fetches the latest metadata from ZooKeeper, which is different from an outdated state of Broker 3. But The new Controller doesn’t know about the outdated state of Broker 3. So it takes some time while the new Controller propagates the latest metadata to all Brokers, during which time some Brokers have outdated metadata. This could lead to inconsistencies in the Kafka cluster. And Good luck debugging this situation!

inconsistent metadata between brokers and controller and zookeeper

I believe these were the main reasons why managed Kafka services like Confluent and AWS MSK, are so popular and also why the remaining dev teams rather don’t touch Kafka with a ten-foot pole. To summarize, the Zookeeper architecture was complex, it couldn’t scale, wasn’t atomic, and caused controller instability.

KRaft - the new solution

The proposed solution is Raft in Kafka, therefore KRaft. Raft is yet another distributed consensus protocol between distributed logs. It builds a distributed fault-tolerant log in kafka, such that if any of the nodes holding that log goes down, we can tolerate that failure because the other nodes have the same log. In such a scenario we can switch to a new leader from the remaining nodes. I will not go into details about Raft, but you can read more about it. I will focus on how KRaft is implemented in Kafka.

The idea in KRaft is that we have metadata in a topic with a single partition. The controller quorum replicates this single metadata log via Raft. One controller is the Leader, the rest are Voters. Only Controllers vote and commit metadata records. Brokers don’t vote; they fetch metadata from the Leader and serve client data (topics/partitions). In production, you typically run at least 3 dedicated Controllers on nodes separate from Brokers.

Kafka with 1 Leader, 2 Voters and 3 observers

In case of Leader Controller receives a metadata update from any broker, it will append the metadata update to the metadata topic. Once the metadata update is replicated and confirmed by the majority of Voters, the Leader Controller will propagate the metadata update to all Voter Controllers and Brokers. This way we have a single source of truth for the metadata. There are far fewer metadata inconsistencies, than in the ZooKeeper architecture. Since we’re now based on a log, we can quantify the lag in terms of offsets, which means we can do useful things like fencing. Fenced Controllers that fall too behind or don’t heartbeat are illegible to become Leaders and can’t accept work until they catch up.

Handling issues

In case of Leader Controller goes down, one of the Follower Controllers will be elected as the new Leader Controller.
In case of Voter Controller goes down, the remaining Voters will still be able to form a majority and confirm metadata updates.
In case of Broker goes down, it doesn’t affect the metadata replication, since Brokers don’t participate in the quorum.
In case of Broker goes down, and later it comes back up, it will hold a majority of metadata on disk. Therefore, it can quickly catch up with the latest metadata from the Leader Controller. Compared to ZooKeeper architecture, where the broker had to fetch the all metadata from ZooKeeper, which could take several minutes.
Diverging views are not possible anymore, since there is a single topic (source of truth) that everyone is observing.

When Raft finally reaches consensus.

Decisions made around KRaft

Metadata kept in one partition.
Brokers and Controllers keep the latest metadata buffered in memory. In addition, the Leader syncs data in the metadata log (called __cluster_metadata) always to disk. That sounds logical because we don’t want to lose a source of truth.
Using Raft quorum means we don’t need all Voter replicas to be in sync to confirm the metadata update. We need just a majority of Voters to confirm the metadata update.
Voters are periodically fetching metadata updates from the Leader. That is opposite to a Raft standard, where Leader is pushing metadata to Voters. Kafka went with a pull-based approach, making Voters less disruptive. For example, if a disruptive Voter decides to start a Leader election, Leader can remove it from a Voters list. Or if a Voter is slow, it can just lag behind and pull the updates when it can, rather than being pushed updates that it cannot handle.
Brokers are also pulling metadata updates from the Leader, but they don’t participate in quorum or Leader elections.
Voter pulls metadata in batches, not one by one, like in ZooKeeper architecture, therefore less load on the Leader Controller.
while you can see __cluster_metadata behave like a topic in Kafka, you cannot consume it like a regular topic. Actually, it’s an internal Raft log replicated by the controller quorum. Therefore, conceptually, it belongs to a Controller, not to Brokers. Brokers rather observe it to update their in-memory cluster metadata state.

Metadata compaction

The log is like a queue with incremental changes in /var/libs/kafka/metadata-logs/ folder with individual metadata records. Therefore, over time, the metadata will grow. To solve this, Kafka creates snapshots. The Leader Controller periodically takes a snapshot of the current metadata state and writes it to the metadata:

/var/libs/kafka/metadata-snapshots/<epoch>-<offset>.snapshot

The offset marks the end of the log that is included in the snapshot. Epoch is a monotonically increasing integer and increases each time there is a new Leader Controller elected. Once the snapshot is written, the Leader Controller marks that snapshot as the latest snapshot and then truncates the log up to that snapshot. Then continues to append new log entries after that. When a new Controller or Broker comes up, it can fetch the latest snapshot and then apply the log entries after that snapshot to get the latest metadata state. This way, the metadata size is managed.

metadata, snapshots and log entries

Migration to KRaft

First, there was an KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum that described the migration process from ZooKeeper to KRaft. KIP-500 was conceptual and architectural design.

This was followed by KIP that described the complete migration without downtime in detail in KIP-866: ZooKeeper to KRaft Migration and finally after that, in KIP-833: Mark KRaft as Production Ready that described KRaft as production ready. In simplified terms, the migration process is as follows:

1. Update Kafka Brokers

You need to update Kafka to a version that supports KRaft (3.5 or later), the so-called bridge releases.

2. Dual-write phase

Brokers run in dual-write mode, meaning they write metadata to both ZooKeeper and KRaft. This way, both systems are kept in sync.
During this phase, you can monitor the performance and stability of KRaft while still relying on ZooKeeper.
This is the last phase where you can roll back to ZooKeeper if needed. After this phase, the migration is non-reversible.

3. Controller transition phase

Once metadata logs are fully caught up, the cluster elects a KRaft Controller quorum.
This quorum takes over the Controller duties from ZooKeeper. At this point, the Kafka Brokers are still writing metadata to both systems, but the KRaft Controllers are now responsible for managing the metadata log

4. Zookeeper decommissioning phase

Zookeeper is no longer needed and can be decommissioned.
The Kafka Brokers restart in KRaft-only mode, relying solely on the KRaft Controllers for metadata management.

5. Kafka 4.0 and beyond

Since you are now running KRaft-only mode, you can upgrade to Kafka 4.0.

Rollback Plan

This approach is considered advanced for most users, and some teams may prefer to set up a new KRaft-based Kafka cluster and migrate their data over, rather than performing an in-place migration. The choice depends on your specific risk tolerance and operational capabilities.

Conclusion

KRaft makes deploying and operating Kafka markedly simpler by removing ZooKeeper and bringing the metadata control-plane fully in-house: brokers persist most metadata locally, so restarts and leader changes recover fast without fetching entire state from ZooKeeper; quorum voting avoids ISR bottlenecks because the fastest majority of voters confirms updates; and with one log as the single source of truth, you eliminate temporary metadata inconsistencies while every broker pulls updates as quickly as it can. For data engineers and platform teams, that means fewer moving parts, faster broker recovery, and a more resilient, self-contained architecture built on Kafka’s mature log rather than an external coordinator. If you’re starting a new cluster, KRaft should be the default—and if you’re running ZooKeeper today, it’s worth planning a migration as the community moves decisively toward KRaft.

What’s next? I’d love to hear from you: are you using ZooKeeper or KRaft, and what challenges or benefits have you observed? Have you already migrated a cluster to KRaft? I’d love to hear your migration experiences. Drop a comment or connect with me on LinkedIn.

KRaft: The Kafka Raft

KRaft: The Kafka Raft

Introduction

Why Kafka needs metadata storage

ZooKeeper

Issues with ZooKeeper

Broker Failures

Controller Failures

Temporarily inconsistent metadata

KRaft - the new solution

Handling issues

Decisions made around KRaft

Metadata compaction

Migration to KRaft

Conclusion

CATALOG

FEATURED TAGS