top | item 36504155

(no title)

fridental | 2 years ago

I am not sure if author is familiar with RabbitMQ.

I am also not sure if the author has valid assumptions about business requirements - thumbnails being not generated doesn't look like a reason to be waked up at night to me.

Also you usually have an intern or two in the support team who can re-run failed jobs and don't need to waste time of your devs or ops or devops for that.

discuss

order

barrkel|2 years ago

Re-running failed jobs should be automated wherever possible. Expected, routine failures should not require manual intervention. If you don't have this attitude, toil will gradually increase over time until all anyone ever does is put out small fires.

Thumbnails not being generated might not be worth an early morning alarm, but running out of disk space might be, or not getting to do other work because it's blocked by the failure of thumbnail generation.

fridental|2 years ago

Nothing should be done if it is not economical. In my experience, issues with the message busses happen very often in the first weeks after their rollout and then disappear for a while or forever.

This means: merge the PR first, let it go live, use your working students or interns to rerun stuff, wait for a month - if it is still happening, then you have a proof of a problem that needs to be fixed.

Disk space: use your monitoring tool to proactively warn you when the free disk space is below of 20% or is reducing too quickly.

If some other work is blocked by failed thumbnails, this is a logical bug and not the consequence of a message bus. This stuff has been blocked even before the introduction of the message bus anyways.