(no title)
adpowers | 14 years ago
The workers were configured to fetch new tasks every 30 seconds. With 10 workers you'd expect tasks to get fetched from the queue every 3 seconds, but that is not what we were seeing. The tasks were only getting picked up on 30 second boundaries. What was going on?
It turns out that the tasks were piling up. As soon as one task tried to update at the same time as another it would get blocked on the database lock. Its own transaction would then run really quickly following the first transaction. However, since these two tasks were run in immediate succession, now they were synced for life. They both slept for exactly 30 seconds, the first one wakes up a few tens of milliseconds earlier and grabs the lock, the second one wakes up and blocks on the lock, and this happens in perpetuity. Eventually, due to small randomness, all tasks entered lockstep and would be a small thundering herd against the database.
This was noticed by a developer and fixed by introducing a small jitter in the sleep time. After the push our tasks were picked up in three seconds and our end-to-end workflow time got substantially shorter.
No comments yet.