(no title)
komuW | 3 years ago
To quote from Annotations to Eldon Hall's Journey to the Moon[1]: "The Coroner recorded every instruction executed, with its inputs and results, writing over the oldest record when it filled up. When a program crashed, you could punch out a full record of what it was doing in most of its last second and analyze the problem at your ease. I have often wished that PCs offered such an advanced feature."
So essentially buffer all logs into an in-memory circular buffer of capacity N. If a log record is emitted that has a certain severity/level; flush all records from the buffer to disk/clickHouse/grafana/whatever.
The python MemoryHandler[2] almost implements the said technique, except that it also flushes when buffer is full; which is not particularly what I would want.
I also wrote a blogpost[3] about how to log without losing money or context, ~3yrs ago.
1. https://authors.library.caltech.edu/5456/1/hrst.mit.edu/hrs/...
2. https://github.com/python/cpython/blob/v3.11.1/Lib/logging/h...
3. https://www.komu.engineer/blogs/09/log-without-losing-contex...
twic|3 years ago
So, i did something a bit like the coroner. When the program started, it created a fresh log file, extended it to a certain size, and memory-mapped it. It then logged into this buffer, with new logging overwriting old (it wasn't actually a circular buffer; the program dropped a big blob of logging into the buffer at the top of its main loop).
While alive, the process never closed or msynced the mapping, and it was fixed size, so the kernel was under no particular pressure to write the contents to disk. But when the process crashed, the kernel would preserve the contents.
I admit i never benchmarked this, so i don't know whether it actually avoided excessive writes. But it seemed like a neat idea in principle!
kevin_nisbet|3 years ago
As I remember it, they wrote their crash handler to include the ring buffer of recent messages sent to the services. So whenever they'd get into an unexpected state, they'd just crash the process, and collect the ring buffer of recent messages along with the other normal things in a mini core. Made it so easy to track down those unexpected / corner cases in that platform.
foobiekr|3 years ago
1. A ring of log-like objects (obviously not rendered strings, since that is a waste of CPU) that can be optionally included in a crash report in structured form that can be dissected later.
2. Compiler-generated enter/exit counters and corresponding table per module, modules linking themselves as init time to the master table, for performance counters [invocations or time spent]; dumpable on demand; lightweight and always on
3. a ring of logs - these actually being rendered logs plus indices into (1) - that have been otherwise rendered, so the retention cost is minimal and you can map back to log files otherwise provided.
The distinction between (1) and (3) should be obvious, but in case it is not, short circuiting log rendering for logs that should otherwise be dropped is a very important practice to avoid debug-level logs consuming the majority of CPU time.
Traditionally, all of these are trivially inspectable in a core dump, but usually you'd like a reduced crash report instead: less wear and tear on the flash and easier for users [and bug management systems] to juggle. Crash reports and cores obviously need to include an unambiguous version [typically a hash of the code rather than a manually managed version #, for dynamically linked ELFs, fingerprints of all libraries as well]; for cores you just make sure to compute this at start and keep it in memory reachable from a pointer out of main().
pmalynin|3 years ago
foobiekr|3 years ago
A problem with time travel debugging is that you generally can't use it in production [of course, there are people who think devs should have direct access to prod, for them there is no help], and you 100% cannot use it for anything deployed at a customer (so for embedded, devices, actual non-SAAS software etc. etc.).
It's better to shore up your tools so that the workflow is very straightforward and leave stuff like time travel for people doing work on a very narrow subset of very hard to understand bugs.