I'm also a fan of logs. If you have some more examples of how you typically log things to be most effective, I'd love to see 'em! I'm still finding my sense for when it's too much versus too little. Best way to incorporate runtime data. How to structure log messages to work well with other systems. Hearing from others and seeing battle tested examples would surely help. Or if you're down to chat a bit I can send you an email and continue the conversation. Will check out Pachyderm in the meantime~
jrockway|2 years ago
To me the golden rule is "show your work". Every operation that can start and end should log the start and the end. If your process is using CPU but not logging anything, something has gone wrong. Aim to log something about ongoing requests/operations every second or so. (This is spammy if you're doing 100,000 things concurrently. I use zap and zap's log sampling keys on the message; so if your message is "incoming request" and 100,000 of them are arriving per second, you can have it only write the logs for one of them each second. I hate to sample, but it's a necessity for large instances and hasn't caused me any problems yet.)
I also like to keep log levels simple; DEBUG for things interesting to the dev team, INFO for things interesting to the operations team, ERROR for things that require human intervention. People often ask me "why don't we have a WARN" level, and it's because I think warnings are either to be ignored, or are fatal. Warnings ("your object storage configuration will be deprecated in 2.10 and removed in 2.11, please migrate according to these docs") should appear in the user-facing UI, not in the logs. They do require human action eventually.
Overall, I'm more of a "print" debugger than a "step through the code with breakpoints" debugger. To me, this is an essential skill when you're running code on someone else's infrastructure; you will be 1000 times slower at operating the debugger when you are telling someone via a support ticket which commands to run. (Even if the servers are yours, I don't love sshing into production and mutating it.) So ultimately, the logs need to collect whatever you'd be looking for if you had a reproduction locally and were trying to figure out the problem. It's an art and not a science; you will get it wrong sometimes, and your resolution for the underlying bug will include better observability as part of the fix. This is usually enough to never have a problem with that subsystem again ;)