ZeroMQ - Disconnects are Good for You

[+] jrockway|13 years ago|reply

Yeah, don't use REQ/REP.

I used to be a big ZeroMQ fan, but honestly, the library trades sanity for performance and has collapsed in on itself because the code is no longer maintainable. Last I checked, it's being completely rewritten in a different programming language. Maybe the result will be fine, but the core features of retrying connections and remembering dropped messages for clients that disappear temporarily are easy enough to write yourself.

(I do like Mongrel2's use of that feature to allow me to make an HTTP request and then start the backend server. And the right place for this is a library. It's just that ZeroMQ had too many features and too much code.)

[+] pedoh|13 years ago|reply

> Last I checked, it's being completely rewritten in a different programming language.

The rewrite you mention is, I believe, Crossroads I/O ( http://www.crossroads.io/ ). Martin Sustrik was one of the creators of ZeroMQ to begin with.

He's got a writeup as to why he should have used C to begin with ( http://www.250bpm.com/blog:4 ).

[+] seiji|13 years ago|reply

http://www.crossroads.io/ looks like the sane path to follow after the community implosion of zeromq.

[+] cageface|13 years ago|reply

It's just that ZeroMQ had too many features and too much code.

Wasn't ZeroMQ supposed to be the simple, lightweight alternative?

I'm feeling better about my decision to ignore all the whole *MQ circus.

[+] jberryman|13 years ago|reply

Do you have an alternative you use?

[+] tome|13 years ago|reply

A implementation of this kind of usage pattern is already provided in the ZMQ guide:

http://zguide.zeromq.org/page:all#Client-side-Reliability-La...

If the client has not received a response by the timeout it should close the connection itself and reopen a new one. Whether this is a sutible solution for the blog author issue I don't know, but work well for RPC connections with little or no state.

[+] rarrrrrr|13 years ago|reply

Indeed. ZMQ has other capabilities beyond REQ/REP exactly for this situation, and helps you layer "patterns" on top of them.

I found working through all five chapters of the ZeroMQ guide unusually educational. It's full of the wisdom of people writing message oriented software for years, and includes frank discussions and solutions for several of these performance and reliability situations. (Don't miss the adventures of the suicidal snail in chapter 5!)

I found it worthwhile even to spend the time to work through all the examples in both C and Python.

In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.

This sort of loop gives an obvious place to also integrate timeouts. You can also watch multiple sockets. Blocking receive forever is appropriate for a prototype sort of client but as things grow, generally more sophistication is needed.

[+] m0th87|13 years ago|reply

> This could probably be improved by having a background thread that uses a ZeroMQ socket for heartbeating.

Don't use heartbeats on REQ/REP, because they won't work well with the lockstep communication fashion of those socket types. Also, you have to be careful because ZeroMQ sockets are not thread-safe, so the background and active thread must coordinate through a lock, or work in an implementation that handles this implicitly for you.

In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked out pretty well in practice.

[+] tcwc|13 years ago|reply

Rather than polling, zeromq >= 2.2 allows you to set ZMQ_RCVTIMEO on the socket which seems to be what the author is after. It would be nice to be notified of disconnected peers, but the timeout + retry approach has been good enough for me.

[+] chubot|13 years ago|reply

Great article. ZeroMQ had a "smell" that I couldn't put my finger and thing article kind of nailed it. In retrospect I guess the smell is that it is tightly couples both sides of the network to make performance claims. It sacrifices robustness for performance.

I guess that it was developed for financial trading applications. Maybe it will work fine for those -- you have a few machines and high network connectivity between them. But people started doing "data center" stuff with 0MQ. Then you have geographical separation, and WAN latency and reliability.

[+] hogu|13 years ago|reply

I think the problem is people come into zeromq expecting a high level library that handles all the details, zeromq does not do that, you need to handle reliability and disconnect behavior yourself. I agree that the default behavior in this case could be saner, but it's pretty easy to build reliable request reply in many different ways as illustrated in the guide, so I'm fine with it.

The benefit though, is that in zeromq you get to (and are forced to) choose exactly how your messaging patterns are reliable (or not)

[+] willvarfar|13 years ago|reply

The better solution? That the 0mq libs do the right thing and don't get wedged. It shouldn't be on the users of the API to handle this.

EDITED: my point is general; it should be 0mq libs doing the timeouts and keepalives and so on and only pushing meaningful error handling like "the server has gone away and cannot reconnect" back up to the user.

[+] rumcajz|13 years ago|reply

The problem with that is that 0MQ socket abstracts multiple underlying connections. Reporting error would mean making the connections visible to the user. There would have to be connection IDs, accept function, error notifications etc. In the end the whole thing would boil down to ugly version of standard BSD TCP sockets.

The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.

[+] sausagefeet|13 years ago|reply

I've been experimenting with being completely asynchronous (and working on being connection-less). The protocol layer just wraps up payloads and unpacks them. There is a background heartbeat and when the heartbeat is not met there is a notification that the heartbeat has not been met but the user is in charge of if this should be considered a disconnection. This is mostly inspired by how Oz does distribution. I don't have any good results yet, though.

[+] noselasd|13 years ago|reply

Receiving and handling I/O errors is easy, the harder part is when something goes wrong on the peer and you don't receive an error.

When dealing with network code, you need 1) Timeouts, 2) Keepalives.

What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)

[+] StavrosK|13 years ago|reply

There's a problem when you restart servers at the wrong moment, though, as the article mentions...

[+] lucian1900|13 years ago|reply

It seems to be that the better solution might be just using Twisted and regular networking techniques.

[+] StavrosK|13 years ago|reply

So whenever there's a small problem with something, the solution is to discard the whole thing and go down a layer?

I don't like some of Python's warts, but you don't see me writing assembly.

[+] o1iver|13 years ago|reply

Sure, the REQ/REQ sockets are limited, especially because they force the Request/Reply/Request/Reply/... series. I don't think any complex applications use this. I recently built an application using DEALER/ROUTER sockets, where you can send multiple requests, without having to wait for responses, etc. Additionally, no application should rely on receiving a response, the poller he suggests solves this problem nicely (although I don't think it necessary to wrap it into send/recv methods as pyzmq offers a nice polling API).

[+] stonemetal|13 years ago|reply

Carries messages across inproc, IPC, TCP, and multicast.

so when you are using actual sockets across the network, it uses TCP. So ZMQ should be able to detect disconnects rather easily.

[+] unknown|13 years ago|reply

[deleted]

[+] boothead|13 years ago|reply

The solution I use is the one I mentioned to Armin on twitter: https://gist.github.com/2994781

It's not really idea that a synchronous connection doesn't have notification of a connection failure, but this has been working fine for us for ages.

[+] rumcajz|13 years ago|reply

Armin is right that the timeout works, but delays the signal about connection failure from TCP. If anyone feels like implementing automatic resend inside 0MQ/XS (in case of timeout or TCP connection failure), give it a go and submit a patch. If noone submits the patch, I'll fix the problem once I have more free time.

[+] unknown|13 years ago|reply

[deleted]

[+] kephra|13 years ago|reply

This badly reminds me at my MQSeries experience.

I wonder - is there any MQ that does not suck ?

[+] freyrs3|13 years ago|reply

ZeroMQ isn't a MQ ( Message Queue ). It's a message passing library. You can use it to build message queues though.

[+] the_mitsuhiko|13 years ago|reply

> I wonder - is there any MQ that does not suck ?

ZeroMQ is not an MQ but it does not suck. That particular behavior is just confusing and should probably pointed out in the docs, even if it's supposed to be obvious. Also it would be nice if you could poll for transport level disconnect events.

[+] orenmazor|13 years ago|reply

I love RabbitMQ with the passion of a thousand suns right now.

(I also use 0mq but only for a disposable internal queue)

[+] shasty|13 years ago|reply

The TCP stack takes care of this problem this is an insane attempt at POST mature optimization.

54 comments