Curator's Tech Notes 翻译与部分验证


ZooKeeper watches are single threaded.


When your watcher is called, it should return as quickly as possible. All ZooKeeper watchers are serialized - processed by a single thread. Thus, no other watchers can be processed while your watcher is running. For example, a Curator user had a watcher handler something like this:


InterProcessMutex lock = ...

public void process(WatchedEvent event)

This cannot work. Curator’s InterProcessMutex relies on ZooKeeper watchers getting notified. The code above, however, is holding on to the ZooKeeper watcher processing thread. The way to fix this is to run the code that needs a lock in a separate thread. e.g.


InterProcessMutex lock = ...
ExecutorService service = ...

public void process(WatchedEvent event)
service.submit(new Callable<Void>(){
Void call() {

InterProcessMutex acquire() can be used to return immediately if lock can’t be acquired.

It’s not obvious from the docs, but calling InterProcessMutex.acquire(0, unit) will return immediately (i.e. without any waiting) if the lock cannot be acquired.

(非翻译)InterProcessMutex acquire()不会受到阻塞影响,会立即返回,不会等待获取锁。


InterProcessMutex lock = ...
boolean didLock = lock.acquire(0, TimeUnit.any);
if ( !didLock )
// comes back immediately

Dealing with session failure

ZooKeeper clients maintain a session with the server ensemble. Ephemeral nodes are tied to this session. When writing ZooKeeper-based applications you must deal with session expirations (due to network partitions, server crashes, etc.). This ZooKeeper FAQ discusses it: http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3

zk客户端实例需要维护和服务器的会话。临时节点是绑定在会话中的(连接和session断掉时被删除,已验证)。编写使用zk的应用程序必须要处理session的失效事件(比如网络分裂(network partition)、服务器故障等)

For the most part, Curator shields you from the details of session management. However, Curator’s behavior can be modified. By default, Curator treats session failures the same way that it treats connection failures: i.e. the current retry policy is checked and, if permitted, operations are retried.


There are use-cases, though, where a series of operations must be tied to the ZooKeeper session. For example, an ephemeral node is created as a kind of marker then several other ZooKeeper operations are performed. If the session were to fail at any point, the entire operation should fail. Curator’s default behavior doesn’t do this. When you need this behavior, use:SessionFailRetryLoop


This is similar to the standard retry loop but if a session fails, any future Curator methods (in the same thread) will also fail.


ZooKeeper makes a very bad Queue source

The ZooKeeper recipes page lists Queues as a possible use-case for ZooKeeper. Curator includes several Queue recipes. In our experience, however, it is a bad idea to use ZooKeeper as a Queue:

The ZooKeeper recipes page lists Queues是zookeeper使用的一种方式。Curator包含大量的Queue recipes。但是,根据我们的经验,使用zookeeper作为队列是十分不建议的,理由如下:

  • ZooKeeper has a 1MB transport limitation. In practice this means that ZNodes must be relatively small. Typically, queues can contain many thousands of messages.


  • ZooKeeper can slow down considerably on startup if there are many large ZNodes. This will be common if you are using ZooKeeper for queues. You will need to significantly increase initLimit and syncLimit.


  • If a ZNode gets too big it can be extremely difficult to clean. getChildren() will fail on the node. At Netflix we had to create a special-purpose program that had a huge value for jute.maxbuffer in order to get the nodes and delete them.


  • ZooKeeper can start to perform badly if there are many nodes with thousands of children.


  • The ZooKeeper database is kept entirely in memory. So, you can never have more messages than can fit in memory.


Porting Netflix Curator code to Apache Curator

The APIs in Apache Curator are exactly the same as Netflix Curator. The only difference is the package names. Simply replace com.netflix.* with org.apache.*.

Netflix CuratorApache Curator是完全一致的,只需要修改包名就可以了。

Friends don’t let friends write ZooKeeper recipes

Writing ZooKeeper code is on par with the difficulty in writing concurrent language code. As we all know Concurrency is Hard! For ZooKeeper in particular, there are numerous edge case and undocumented behaviors that you must know in order to write correct recipes. In light of this, we strongly suggest you use one of the existing Curator pre-built recipes instead of writing raw ZooKeeper code yourself. At minimum, use a Curator recipe as a base for your work.

编写zookeeper的代码的难度和并发编程的难度基本一致。而并发编程是众所周知的困难!特别是对于zookeeper来说,要编写出正确的recipe,使用者必须知道非常繁多的边界情况和无文档的特性。因此,强烈建议使用Curator的预编译recipes而不是自己撰写zookeeper代码。至少,使用Curator recipe作为使用者的工作基础。

Curator Recipes Own Their ZNode/Paths

Do not use paths passed to Curator recipes. Curator recipes rely on owning those paths and the ZNodes in those paths. For example, do not add your own ZNodes to the path passed to LeaderSelector, etc.

不要使用传递给Curator recipes的路径。Curator recipes依赖于这些路径及其子节点。比如,不要在传递给LeaderSelector的路径下添加使用者自己的节点,范例:

selector = new LeaderSelector(client, "/leader", listener);
client.create().forPath("/leader/mynode"); // THIS IS NOT SUPPORTED!

Also, do not delete nodes that have been “given” to a Curator recipe.


Controlling Curator Logging

Curator logging can be customized. Use the following switches via the command line (-D) or via System.setProperty()

|curator-dont-log-connection-problems=true|Normally, connection issues are logged as the warning “Connection attempt unsuccessful…” or the error “Connection timed out…”. This switch turns these messages off.|
|curator-log-events=true|All ZooKeeper events will be logged as DEBUG.|
|curator-log-only-first-connection-issue-as-error-level=true|When this switch is enabled, the first connection issue is logged as ERROR. Additional connection issues are logged as DEBUG until the connection is restored.|

配置 描述
curator-dont-log-connection-problems=true 正常情况下,连接问题会被记录为警告”Connection attempt unsuccessful…”或者错误”Connection timed out…”,该配置将会关闭这些信息
curator-log-events=true 所有的zookeeper事件的记录都为DEBUG
curator-log-only-first-connection-issue-as-error-level=true 当启用该配置,第一次连接问题会被记录为错误。额外的连接问题会被以DEBUG记录直到连接恢复

PathChildrenCache now uses getData() instead of checkExists().

Curator 2.5.0 changes internal behavior for PathChildrenCache. Now, regardless of whether or not “cacheData” is set to true, PathChildrenCache will always call getData on the nodes. This is due to CURATOR-107. It’s been shown that using checkExists() with watchers can cause a type of memory leak as watchers will be left dangling on non-existent ZNodes. Calling getData() works around this issue. However, it’s possible that this change will affect performance. If you would like the old behavior of using checkExists(), you can set a system property: add -Dcurator-path-children-cache-use-exists=true to your command line or call System.setProperty(“curator-path-children-cache-use-exists”, “true”).

Curator 2.5.0 改变了PathChildrenCache的内部行为。现在,不论是否设置cacheData为true,PathChildrenCache都将会在节点上调用方法getData。原因可以查看该文档CURATOR-107。该文档写明,在观察者中使用checkExists()将会导致某种类存泄露,因为观察者有可能会挂在不存在的节点上。调用cacheData可以解决此问题。但是,这种改变可能会影响性能。如果你更愿意使用旧方式处理checkExists(),你可以设置系统属性:
添加-Dcurator-path-children-cache-use-exists=true在命令行或者使用函数System.setProperty("curator-path-children-cache-use-exists", "true")

JVM pauses can cause unexpected client state with improperly chosen session timeouts

Background discussion: http://qnalist.com/questions/6134306/locking-leader-election-and-dealing-with-session-loss

ZooKeeper/Curator recipes rely on a consistent view of the state of the ensemble. ZooKeeper clients maintain a session with the server they are connected to. Clients maintain periodic heartbeats to the server to maintain this session. If a heartbeat is missed, the client goes into Disconnected state. When this happens, Curator goes into SUSPENDED via the ConnectionStateListener. Any locks, etc. must be considered temporarily lost while the connection is SUSPENDED (see http://curator.apache.org/errors.html and the Error Handling section of each recipe’s documentation).


The implication of this is that great care must be taken to tune your JVM and choose an appropriate session timeout. Here’s an example of what can happen if this is not done:


  • A session timeout of 3 seconds is used
  • Client A creates a Curator InterProcessMutex and acquires the lock
  • Client B also creates a Curator InterProcessMutex for the same path and is blocked waiting for the lock to release
  • Client A’s JVM has a stop-the-world GC for 10 seconds
    • Client A’s session will have lapsed due to missed heartbeats
    • ZooKeeper will delete Client A’s EPHEMERAL node representing its InterProcessMutex lock
    • Client B’s watcher will fire and it will successfully gain the lock
  • After the GC, Client A will un-pause
  • session超时3秒
  • Client A 创建了一个Curator InterProcessMutex并且申请获取锁
  • Client B 也创建了一个Curator InterProcessMutex在同样的路径下,并且等待A释放锁
  • Client A 的JVM发生了全局GC,暂停了10s
    • Client A的session将会因为错过心跳而失效
    • zookeeper将会删除Client A代表InterProcessMutex lock的临时节点
    • Client B观察者将会触发,并且可以成功获取锁
  • 完成GC后,client A会继续运行
  • 在很短的时间内,A和B都认定他们都是锁的拥有者

The remedy for this is tune your JVM so that GC pauses (or other kinds of pauses) do not exceed your session timeout. JVM tuning is beyond the scope of this Tech Note. The default Curator session timeout is 60 seconds. Very low session timeouts should be considered risky.


Summary: there is always an edge case where VM pauses might exceed your client heartbeat and cause a client misperception about it’s state for a short period of time once the VM un-pauses. In practice, a tuned VM that has been running within known bounds for a reasonable period will not exhibit this behavior. Session timeout must match this known bounds in order to have consistent client state.


Curator internally wraps Watchers

When you set Watchers using Curator, your Watcher instance is not passed directly to ZooKeeper. Instead it is wrapped in a special-purpose Curator Watcher (the internal class, NamespaceWatcher). Normally, this is not an issue and is transparent to your client code. However, if you bypass Curator and set a Watcher directly with the ZooKeeper handle, ZooKeeper will not recognize it as the same Watcher set via Curator and that watcher will get called twice when it triggers.

当你使用Curator设置Watcher的时候,Watcher的实例是不会直接连接zookeeper的。取而代之的是使用一个特殊的Curator Watcher(内部类,NamespaceWatcher)。一般而言,这不是错误并且对于你的客户端代码,这部分是透明的。但是,如果你绕过Curator直接在ZooKeeper handle上设置Watcher,zookeeper将无法识别出是否是已经通过Curator设置的Watcher,因此该Watcher将会在触发的时候调用两次。


Watcher myWatcher = ...
curator.getZookeeperClient().getZooKeeper().getData(path, myWatcher, stat);

// myWatcher will get called twice when the data for path is changed

curator.getZookeeperClient().getZooKeeper().getData(path, myWatcher, stat);
curator.getZookeeperClient().getZooKeeper().getData(path, myWatcher, stat);|1|
curator.getZookeeperClient().getZooKeeper().getData(path, myWatcher, stat);|2|
总结,zookeeper会认定通过Curator和Zookeeper Handle绑定的watcher是不同的,但是同一方式重复绑定会被zookeeper识别从而避免多次触发。

Curator connection semantics

The following events occur in the life cycle of a connection between Curator and Zookeeper.
CONNECTED: This occurs when Curator initially connects to Zookeeper. It will only ever be seen once per Curator instance.
SUSPENDED: This occurs as soon as Curator determines that it has lost its connection to Zookeeper


LOST: The meaning of a LOST even varies between Curator 2.X and Curator 3.X.
In all versions of Curator, a LOST event may be explicitly received from Zookeeper if Curator attempts to use a session that has been timed out by Zookeeper.
In Curator 2.X a LOST event will occur when Curator gives up retrying an operation. The number of retries is determined by the specified retry policy. A LOST event of this type does not necessarily mean that the session on the server has been lost, but it must be assumed to be so.

丢失:在Curator 2.X版本与3.X版本有多种释义。
无论什么版本,只要Curator尝试使用已经被zookeeper超时的session时,zookeeper就may be explicitly发出一个LOST事件。

In Curator 3.x, Curator attempts to simulate server side session loss, by starting a timer (set to the negotiated session timeout length) upon receiving the SUSPENDED event. If the timer expires before Curator re-establishes a connection to Zookeeper then Curator will publish a LOST event. It can be assumed that if this LOST event is received that the session has timed out on the server (though this is not guaranteed as Curator has no connection to the server at this point to confirm this).


RECONNECTED: This occurs once a connection has been reestablished to Zookeeper.


Guava usage in Curator


Since Curator was created at Netflix it has used Google’s popular Guava library. Due to the many versions of Guava used in projects that also use Curator there has always been the potential for conflicts. Recent versions of Guava removed some APIs that Curator uses internally and Curator users were getting ClassNotFoundException, etc. CURATOR-200 addresses these issues by shading Guava into Curator.


Shaded But Not Gone

Unfortunately, a few of Curator’s public APIs use Guava classes (e.g. ListenerContainer’s use of Guava’s Function). Breaking public APIs would cause as much harm as solving the Guava problem. So, it was decided to to shade all of Guava except for these three classes:


  • com.google.common.base.Function
  • com.google.common.base.Predicate
  • com.google.common.reflect.TypeToken

The implication of this is that Curator still has a hard dependency on Guava but only for these three classes. What this means for Curator users is that you can use whatever version of Guava your project needs without concern about ClassNotFoundException, NoSuchMethodException, etc.

  • All but three Guava classes are completed shaded into Curator
  • Curator still has a hard dependency on Guava but you should be able to use whatever version of Guava your project needs
  • 除了三个类,其他Guava都被屏蔽
  • Curator依然对于Guava有强依赖,但使用者不用再担心Guava的版本,可自由选择