Going Beyond the Basics with ZooKeeper

Overview

ZooKeeper is an Apache project built for large-scale distributed systems. It provides open-source services for distributed configuration management, synchronization, and naming/registration. From an architectural perspective, it achieves high availability through redundant services while following a CP-oriented consistency model.

Its core purpose is to hide the difficult and failure-prone parts of distributed consistency behind a set of efficient, reliable primitives, then expose them through simple interfaces that are practical to use.

A quick refresher on ZooKeeper fundamentals

Data model

ZooKeeper organizes data as a hierarchical namespace, much like a filesystem. Data is stored in a key-value style structure. A key is expressed as a path made up of elements separated by /, such as /node. Every node in the namespace is identified by its full path.

A few rules are important here:

Each node name under the same parent must be unique. In other words, the full path identifies a node uniquely.
Every node stores both a value and a set of state attributes, and there may be multiple attributes attached to a node.

ZooKeeper supports several node types:

PERSISTENT: the default node type; a persistent node.
PERSISTENT_SEQUENTIAL: a persistent sequential node; ZooKeeper appends a monotonically increasing suffix to the path. This is commonly used for distributed locks and leader election. Create with -s.
EPHEMERAL: a temporary node tied to a session. It is removed automatically when the session ends. Ephemeral nodes cannot have children. Create with -e.
EPHEMERAL_SEQUENTIAL: a temporary sequential node. It also gets a suffix and is removed when the session disconnects. It cannot have children. Create with -e -s.
CONTAINER: a container node. Once all of its children are deleted, the container node is removed as well. Create with -c.
PERSISTENT_WITH_TTL: a persistent node with a TTL. If it is not modified within the TTL and has no children, it expires and is deleted. Create with -t.

Common operations

The basic node commands include:

ls: list the children under a path. Optional flags: -s returns status information, -w watches for node changes, -R recursively lists the subtree.
create: create a node and assign data to it. Optional flags correspond to node types. Keep in mind that ephemeral nodes cannot create child nodes.
set: update the data stored in a node.
get: retrieve a node's data and state information. Optional flags: -s returns status information, -w returns data and sets a watch on the node.
stat: inspect node status information; -w is also supported.
delete / deleteall: remove a node. If a node is not empty, delete cannot remove it.

One detail that often matters in practice: a watcher set with -w is one-time only. Once the watched node changes, ZooKeeper sends the event and the watch is cleared.

Distributed locks

The straightforward approach

A simple way to implement a distributed lock with ZooKeeper is:

Create an ephemeral node; whichever client creates it successfully gets the lock.
Other clients watch that node for deletion.
Once the node is deleted, all waiting clients are notified and retry the same process.

This works, but it has an obvious drawback: the herd effect.

When the lock node is released, every client watching it is notified at the same time. Those clients then all attempt to acquire the lock again, creating a burst of requests and unnecessary pressure on ZooKeeper.

A better locking pattern

A more robust design avoids that problem:

Every client creates an ephemeral sequential node and writes its basic information there.
Each client reads the node list and checks whether its own node has the smallest sequence number. If it does, it owns the lock.
If it does not, it watches only the node immediately before it.
When the lock holder releases the lock, or the client holding the lock crashes, that node disappears. The next client in line is notified and repeats the check.

This reduces unnecessary notifications because only one waiting client is awakened each time the lock is released.

That pattern can be refined further by separating sequential nodes into read-lock nodes and write-lock nodes:

A read-lock node only needs to watch the previous write-lock node.
A write-lock node only needs to watch the previous node, regardless of whether that previous node is a read lock or a write lock.

With this design, ZooKeeper can support both shared locks and exclusive locks. In Java projects, these capabilities are commonly implemented through Curator.

import org.apache.curator.RetryPolicy;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.recipes.locks.InterProcessLock;
import org.apache.curator.framework.recipes.locks.InterProcessSemaphoreMutex;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.curator.utils.CloseableUtils;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class DistributedLockDemo {

    // ZooKeeper 锁节点路径, 分布式锁的相关操作都是在这个节点上进行
    private final String lockPath = "/distributed-lock";

    // ZooKeeper 服务地址, 单机格式为:(127.0.0.1:2181),
    // 集群格式为:(127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183)
    private String connectString;

    // Curator 客户端重试策略
    private RetryPolicy retry;

    // Curator 客户端对象
    private CuratorFramework client;

    // client2 用户模拟其他客户端
    private CuratorFramework client2;

    // 初始化资源
    @Before
    public void init() throws Exception {
        // 设置 ZooKeeper 服务地址为本机的 2181 端口
        connectString = "192.168.200.168:2181";
        // 重试策略
        // 初始休眠时间为 1000ms, 最大重试次数为 3
        retry = new ExponentialBackoffRetry(1000, 3);
        // 创建一个客户端, 60000(ms)为 session 超时时间, 15000(ms)为链接超时时间
        client = CuratorFrameworkFactory.newClient(connectString, 60000, 15000, retry);
        client2 = CuratorFrameworkFactory.newClient(connectString, 60000, 15000, retry);
        // 创建会话
        client.start();
        client2.start();
    }

    // 释放资源
    @After
    public void close() {
        CloseableUtils.closeQuietly(client);
    }

    @Test
    public void sharedLock() throws Exception {
        // 创建共享锁
        final InterProcessLock lock = new InterProcessSemaphoreMutex(client, lockPath);
        // lock2 用于模拟其他客户端
        final InterProcessLock lock2 = new InterProcessSemaphoreMutex(client2, lockPath);

        new Thread(new Runnable() {

            @Override
            public void run() {
                // 获取锁对象
                try {
                    lock.acquire();
                    System.out.println("======== client1 get lock ========");
                    // 测试锁重入
                    Thread.sleep(5 * 1000);
                    lock.release();
                    System.out.println("======== client1 release lock ========");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }).start();

        new Thread(new Runnable() {
            @Override
            public void run() {
                // 获取锁对象
                try {
                    lock2.acquire();
                    System.out.println("======== client2 get lock ========");
                    Thread.sleep(5 * 1000);
                    lock2.release();
                    System.out.println("======== client2 release lock ========");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }).start();

        Thread.sleep(20 * 1000);
    }
}

Curator provides several lock abstractions for this:

InterProcessMutex: a distributed reentrant exclusive lock. Reentrancy can be tracked with a local counter such as one stored in LocalMap.
InterProcessSemaphoreMutex: a distributed exclusive lock.
InterProcessMultiLock: a container that manages multiple locks as a single unit.
InterProcessReadWriteLock: a distributed read-write lock.

ZooKeeper in cluster deployments

Why ZooKeeper clusters usually use an odd number of nodes

ZooKeeper clusters are typically deployed with an odd number of servers.

The first reason is fault tolerance. The cluster must be able to form a majority for voting:

In a 3-node cluster, at least 2 nodes must be healthy.
Since half of 3 is 1.5, a majority means at least 2 votes.
That means the cluster can tolerate 1 node failure and still continue operating.

The second reason is split-brain avoidance. If the network partitions and the cluster splits into smaller groups, only the side with a majority should be able to elect a leader:

In a 3-node cluster, the voting threshold is still above 1.5.
If one server becomes isolated from the other two, the 2-node side can still elect a leader because 2 is greater than 1.5.
The isolated single node cannot elect a leader on its own.

That is why 3 nodes is the minimum practical cluster size. With 4 nodes, a split can leave the cluster in a state where no valid leader can be elected.

Leader election and the ZAB protocol

ZooKeeper uses ZAB, a protocol derived from Paxos.

To understand that, it helps to recall the basic Paxos roles:

Proposer: initiates proposals.
Acceptor: receives proposals and may accept or reject them.
Learners: do not decide the proposal themselves; they passively learn the result, including nodes that join later.

Paxos follows a majority-based rule: proposals succeed only when accepted by more than half.

ZAB stands for ZooKeeper Atomic Broadcast. It extends the ideas behind Paxos and is designed to support atomic broadcast and crash recovery, while ensuring that the sequence of changes broadcast by the leader is processed in order.

Under ZAB, a server can be in one of four states:

LOOKING: the server is in election mode, either during startup or after the leader has failed.
FOLLOWING: the server is a follower; it synchronizes with the leader and participates in voting.
LEADING: the server is the leader.
OBSERVING: the server synchronizes with the leader but does not participate in voting.

Majority still governs the process here as well.

ZooKeeper's source code, in FastLeaderElection.java, shows the decision logic used when comparing candidates:

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug(
            "id: {}, proposed id: {}, zxid: 0x{}, proposed zxid: 0x{}",
            newId,
            curId,
            Long.toHexString(newZxid),
            Long.toHexString(curZxid));

        if (self.getQuorumVerifier().getWeight(newId) == 0) {
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        /*
         * 对应上面代码的解释（两个节点之间使用比较的方法来决定选票给谁，三种比较规则）
         * 1- 比较 epoche(zxid高32bit):
         *     如果其他节点的epoche比自己的大，选举 epoch大的节点（理由：epoch 表示年代【投票次数越多，数据越新】，epoch越大表示数据越新）
         *     代码：(newEpoch > curEpoch)；
         * 2- 比较 zxid，:
         *     如果纪元相同，就比较两个节点的zxid的大小，选举 zxid大的节点（理由：zxid 表示节点所提交事务最大的id，zxid越大代表该节点的数据越完整）
         *     代码：(newEpoch == curEpoch) && (newZxid > curZxid)；
         * 3- 比较 serviceId:
         *     如果 epoch和zxid都相等，就比较服务的serverId，选举 serviceId大的节点（理由： serviceId 表示机器性能，他是在配置zookeeper集群时确定的，所以我们配置zookeeper集群的时候可以把服务性能更高的集群的serverId设置大些，让性能好的机器担任leader角色）
         *     代码 ：(newEpoch == curEpoch) && ((newZxid == curZxid) && (newId > curId))。
         */
        return ((newEpoch > curEpoch)
                || ((newEpoch == curEpoch)
                    && ((newZxid > curZxid)
                        || ((newZxid == curZxid)
                            && (newId > curId)))));
    }

The comparison between two nodes follows three rules:

Compare epoch first, which corresponds to the high 32 bits of zxid. If another node has a larger epoch, it wins because its state is considered newer.
If the epoch values are equal, compare zxid. The node with the larger zxid is preferred because it has seen a more complete transaction history.
If both epoch and zxid are equal, compare the serverId. The larger one wins.

How ZooKeeper handles reads and writes in a cluster

Read requests

When a client sends a read request to ZooKeeper, either the Leader or a Follower can answer directly. Reads do not need to be routed through the leader.

Write requests sent to the Leader

When a client sends a write request to the Leader:

The Leader writes the data locally.
It then forwards the change to all Followers and waits for their responses.
Once the Leader receives successful acknowledgments from a majority of nodes, including itself, it returns success to the client.

Write requests sent to a Follower

When a client sends a write request to a Follower:

The Follower forwards the request to the Leader.
The Leader writes the data locally.
The Leader then sends the update to all Followers and waits for responses.
Once the Leader receives successful acknowledgments from a majority of nodes, including itself, it returns success to the forwarding Follower.
The Follower then returns success to the client.

This read/write model is a good illustration of ZooKeeper's design priorities: reads are lightweight and can be served broadly, while writes are coordinated through the leader and confirmed by a quorum to preserve consistency.