I used to be a big ZeroMQ fan, but honestly, the library trades sanity for performance and has collapsed in on itself because the code is no longer maintainable. Last I checked, it's being completely rewritten in a different programming language. Maybe the result will be fine, but the core features of retrying connections and remembering dropped messages for clients that disappear temporarily are easy enough to write yourself.
(I do like Mongrel2's use of that feature to allow me to make an HTTP request and then start the backend server. And the right place for this is a library. It's just that ZeroMQ had too many features and too much code.)
If the client has not received a response by the timeout it should close the connection itself and reopen a new one. Whether this is a sutible solution for the blog author issue I don't know, but work well for RPC connections with little or no state.
Indeed. ZMQ has other capabilities beyond REQ/REP exactly for this situation, and helps you layer "patterns" on top of them.
I found working through all five chapters of the ZeroMQ guide unusually educational. It's full of the wisdom of people writing message oriented software for years, and includes frank discussions and solutions for several of these performance and reliability situations. (Don't miss the adventures of the suicidal snail in chapter 5!)
I found it worthwhile even to spend the time to work through all the examples in both C and Python.
In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.
This sort of loop gives an obvious place to also integrate timeouts. You can also watch multiple sockets. Blocking receive forever is appropriate for a prototype sort of client but as things grow, generally more sophistication is needed.
> This could probably be improved by having a background thread that uses a ZeroMQ socket for heartbeating.
Don't use heartbeats on REQ/REP, because they won't work well with the lockstep communication fashion of those socket types. Also, you have to be careful because ZeroMQ sockets are not thread-safe, so the background and active thread must coordinate through a lock, or work in an implementation that handles this implicitly for you.
In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked out pretty well in practice.
Rather than polling, zeromq >= 2.2 allows you to set ZMQ_RCVTIMEO on the socket which seems to be what the author is after. It would be nice to be notified of disconnected peers, but the timeout + retry approach has been good enough for me.
Great article. ZeroMQ had a "smell" that I couldn't put my finger and thing article kind of nailed it. In retrospect I guess the smell is that it is tightly couples both sides of the network to make performance claims. It sacrifices robustness for performance.
I guess that it was developed for financial trading applications. Maybe it will work fine for those -- you have a few machines and high network connectivity between them. But people started doing "data center" stuff with 0MQ. Then you have geographical separation, and WAN latency and reliability.
I think the problem is people come into zeromq expecting a high level library that handles all the details, zeromq does not do that, you need to handle reliability and disconnect behavior yourself. I agree that the default behavior in this case could be saner, but it's pretty easy to build reliable request reply in many different ways as illustrated in the guide, so I'm fine with it.
The benefit though, is that in zeromq you get to (and are forced to) choose exactly how your messaging patterns are reliable (or not)
The better solution? That the 0mq libs do the right thing and don't get wedged. It shouldn't be on the users of the API to handle this.
EDITED: my point is general; it should be 0mq libs doing the timeouts and keepalives and so on and only pushing meaningful error handling like "the server has gone away and cannot reconnect" back up to the user.
The problem with that is that 0MQ socket abstracts multiple underlying connections. Reporting error would mean making the connections visible to the user. There would have to be connection IDs, accept function, error notifications etc. In the end the whole thing would boil down to ugly version of standard BSD TCP sockets.
The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.
I've been experimenting with being completely asynchronous (and working on being connection-less). The protocol layer just wraps up payloads and unpacks them. There is a background heartbeat and when the heartbeat is not met there is a notification that the heartbeat has not been met but the user is in charge of if this should be considered a disconnection. This is mostly inspired by how Oz does distribution. I don't have any good results yet, though.
Receiving and handling I/O errors is easy, the harder part is when something goes wrong on the peer and you don't receive an error.
When dealing with network code, you need 1) Timeouts, 2) Keepalives.
What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)
Sure, the REQ/REQ sockets are limited, especially because they force the Request/Reply/Request/Reply/... series. I don't think any complex applications use this. I recently built an application using DEALER/ROUTER sockets, where you can send multiple requests, without having to wait for responses, etc. Additionally, no application should rely on receiving a response, the poller he suggests solves this problem nicely (although I don't think it necessary to wrap it into send/recv methods as pyzmq offers a nice polling API).
Armin is right that the timeout works, but delays the signal about connection failure from TCP. If anyone feels like implementing automatic resend inside 0MQ/XS (in case of timeout or TCP connection failure), give it a go and submit a patch. If noone submits the patch, I'll fix the problem once I have more free time.
ZeroMQ is not an MQ but it does not suck. That particular behavior is just confusing and should probably pointed out in the docs, even if it's supposed to be obvious. Also it would be nice if you could poll for transport level disconnect events.
[+] [-] jrockway|13 years ago|reply
I used to be a big ZeroMQ fan, but honestly, the library trades sanity for performance and has collapsed in on itself because the code is no longer maintainable. Last I checked, it's being completely rewritten in a different programming language. Maybe the result will be fine, but the core features of retrying connections and remembering dropped messages for clients that disappear temporarily are easy enough to write yourself.
(I do like Mongrel2's use of that feature to allow me to make an HTTP request and then start the backend server. And the right place for this is a library. It's just that ZeroMQ had too many features and too much code.)
[+] [-] pedoh|13 years ago|reply
The rewrite you mention is, I believe, Crossroads I/O ( http://www.crossroads.io/ ). Martin Sustrik was one of the creators of ZeroMQ to begin with.
He's got a writeup as to why he should have used C to begin with ( http://www.250bpm.com/blog:4 ).
[+] [-] seiji|13 years ago|reply
[+] [-] cageface|13 years ago|reply
Wasn't ZeroMQ supposed to be the simple, lightweight alternative?
I'm feeling better about my decision to ignore all the whole *MQ circus.
[+] [-] jberryman|13 years ago|reply
[+] [-] tome|13 years ago|reply
http://zguide.zeromq.org/page:all#Client-side-Reliability-La...
If the client has not received a response by the timeout it should close the connection itself and reopen a new one. Whether this is a sutible solution for the blog author issue I don't know, but work well for RPC connections with little or no state.
[+] [-] rarrrrrr|13 years ago|reply
I found working through all five chapters of the ZeroMQ guide unusually educational. It's full of the wisdom of people writing message oriented software for years, and includes frank discussions and solutions for several of these performance and reliability situations. (Don't miss the adventures of the suicidal snail in chapter 5!)
I found it worthwhile even to spend the time to work through all the examples in both C and Python.
In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.
This sort of loop gives an obvious place to also integrate timeouts. You can also watch multiple sockets. Blocking receive forever is appropriate for a prototype sort of client but as things grow, generally more sophistication is needed.
[+] [-] m0th87|13 years ago|reply
Don't use heartbeats on REQ/REP, because they won't work well with the lockstep communication fashion of those socket types. Also, you have to be careful because ZeroMQ sockets are not thread-safe, so the background and active thread must coordinate through a lock, or work in an implementation that handles this implicitly for you.
In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked out pretty well in practice.
[+] [-] tcwc|13 years ago|reply
[+] [-] chubot|13 years ago|reply
I guess that it was developed for financial trading applications. Maybe it will work fine for those -- you have a few machines and high network connectivity between them. But people started doing "data center" stuff with 0MQ. Then you have geographical separation, and WAN latency and reliability.
[+] [-] hogu|13 years ago|reply
The benefit though, is that in zeromq you get to (and are forced to) choose exactly how your messaging patterns are reliable (or not)
[+] [-] willvarfar|13 years ago|reply
EDITED: my point is general; it should be 0mq libs doing the timeouts and keepalives and so on and only pushing meaningful error handling like "the server has gone away and cannot reconnect" back up to the user.
[+] [-] rumcajz|13 years ago|reply
The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.
[+] [-] sausagefeet|13 years ago|reply
[+] [-] noselasd|13 years ago|reply
When dealing with network code, you need 1) Timeouts, 2) Keepalives.
What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)
[+] [-] StavrosK|13 years ago|reply
[+] [-] lucian1900|13 years ago|reply
[+] [-] StavrosK|13 years ago|reply
I don't like some of Python's warts, but you don't see me writing assembly.
[+] [-] o1iver|13 years ago|reply
[+] [-] stonemetal|13 years ago|reply
so when you are using actual sockets across the network, it uses TCP. So ZMQ should be able to detect disconnects rather easily.
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] boothead|13 years ago|reply
It's not really idea that a synchronous connection doesn't have notification of a connection failure, but this has been working fine for us for ages.
[+] [-] rumcajz|13 years ago|reply
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] kephra|13 years ago|reply
I wonder - is there any MQ that does not suck ?
[+] [-] freyrs3|13 years ago|reply
[+] [-] the_mitsuhiko|13 years ago|reply
ZeroMQ is not an MQ but it does not suck. That particular behavior is just confusing and should probably pointed out in the docs, even if it's supposed to be obvious. Also it would be nice if you could poll for transport level disconnect events.
[+] [-] orenmazor|13 years ago|reply
(I also use 0mq but only for a disposable internal queue)
[+] [-] shasty|13 years ago|reply