Nanolog supports logging with 7 ns median latency

It uses a background thread to do most of the work, and it appears the 7ns latency numbers are a little cooked:

1. The paper's 7ns like number is 8ns for microbenchmarks but 18ns in applications. The 7ns number I'm guessing is microbenchmarks, and the true application level number is prob more in the 17ns range.

2. It isn't precisely clear what that is measuring. The says that is the invocation time of the logging thread. Considering the thread making the call to log just passes most of the work to a background threads through a multi-producer single consumer queue of some sort, this is likely the time to dump it in the queue.

So you really aren't logging in 7ns. The way I'm reading this is you're dumping on a queue in 17ns and letting a background thread do the actual work.

The workload is cut down by preprocessing the creating a dictionary of static elements do reduce the I/O cost of the thread doing the actual writing (I assume this just means take the format strings and index them, which you could build at runtime, so i'm not sure the pre-processing step is really needed).

My logger than dumps binary blobs onto a ring buffer for another process to log might be able to beat this invocation latency. This isn't really groundbreaking. I know a few place that log the binary blobs and format them later. None of them do the dictionary part, but when that is going to a background thread, I'm not sure how much that matters.

gpderetta|1 year ago

Yes, the overhead in the logging thread is what this is trying to minimize. The background thread is considered "free". This sort of async logging is a common setup for some class of applications.

And yes, it boils down to writing data to a message queue. Most of the overhead is probably the call to the hardware timestamp counter.

DesiLurker|1 year ago

I did something similar about a decade ago. the main active costs for logging (assuming writing into a fixed sized lockless ring buffer) is for doing the string formatting and timestamp fetching. I actually did my own ts interpolation with rdtsc and periodic fetches to get this to minimal. for string formatting you could completely avoid by preregistering messages and store away arguments.

I have also seen people make the logger into dso so they can dump strings from main app binary and record addresses to avoid the hassel of preregistering. but with preregistering I was able to get the perf all the way down to ~40ns per message w/o the separate log dumper thread. at that point you actually start running into disk IO limitations.

one additional thing I did was to create the ring buffer in a shared memory segment and log to that from all services in the system. so this way you dont even have to think about crashing services. anyway the point is this hardly seems groundbreaking. what I'd like to see is this evolve into a fast syslog type service when one just freely logs trace messages throughout w/o worry about the cost and we can use them for postmortem analysis or filter at dump.

szundi|1 year ago

Only thing that makes sense is that the thread sending the logs is blocket for 7ns - otherwise too much context dependent extra comes in to make a claim like this

renewiltord|1 year ago

Yeah. But that’s okay. For the applications that care about this, it’s just a question of how long the application thread is blocked. The logging thread can spin on its core and take as long as it wants so long as it can burn down the buffer faster than the application thread can add to it.

I thought log4j2 used a disruptor in between app thread and logging thread but maybe I’m wrong.

hinkley|1 year ago

One of the reasons I think logging as a production tool has risen in status in the last fifteen years is that in the 90’s you had languages where it was decided that if your app crashes we want to know what the last operation that was in flight when it crashed.

As we now know the syscalls to force flush to disk often didn’t, so if you were trying to figure out what happened right before a kernel panic, good fucking luck. But if it was a purely userspace problem, you might be okay.

These days we care more about throughput. I’d personally rather see us commit to stats for use cases that require that sort of logging traffic, but I think that’s one of those conversations that’s nearing the top of the heap. Maybe we’ll have that argument as soon as we stop flooding the literature with talk of AI.

So 7ns tells me something about throughput but I think i’d rather just know the messages per second per CPU. And perhaps more importantly, with a CPU that’s 80% saturated doing non-logging tasks. Because I need to make forward progress between the log messages.

fra|1 year ago

This is a common technique in embedded software. A few other examples:

1. Thrice (already mentioned in the comments) https://github.com/rokath/trice

2. Pigweed's Tokenizer (from Google) https://pigweed.dev/pw_tokenizer/

3. Memfault's Compact Logs https://docs.memfault.com/docs/mcu/compact-logs

4. Defmt by Ferrous Systems https://defmt.ferrous-systems.com/

frizlab|1 year ago

5. macOS logging system https://developer.apple.com/documentation/os/logging/viewing...

enigmo|1 year ago

6. WPP in Windows 2000 ETW https://learn.microsoft.com/en-us/windows-hardware/test/weg/...

odygrd|1 year ago

7. https://github.com/odygrd/quill

More modern than nanolog and also type safe supporting any type

unknown|1 year ago

[deleted]

lokar|1 year ago

The google logging library also defers formatting

cmptrnerd6|1 year ago

I've used https://github.com/rokath/trice which is similar but targeting microcontrollers. It isn't immediately clear to me if nanolog could run on a microcontroller with its output directed over RTT/uart/etc or not.

geertj|1 year ago

The consumer side of this would be polling a memory location for new logs, correct? It would not be possible to wake up the consumer in 7ns as that would take a FUTEX_WAKE system call with is O(microseconds).

I've been wondering about a FUTEX_WAKE that does not require a system call. Possibly, the kernel could poll a global memory area. Or maybe there is some low-level debugging API available where the kernel could be notified of a memory write by a process?

jnordwick|1 year ago

The background thread can just sleep if no data in the queue.

Since you are preprocessing the log messages and you know the minimum bounds. If the thread sleeps for 1 ms. Be generous and say you can at most push to the queue in 5ns, and if you know the largest messages you push will be 200 bytes, you can statically determine a 40M ring buffer will ensure enough space to sleep even at max rate with the largest messages.

And that's just a simplstic scheme. If you have a pool logging structs so your queue is just a pointer to one of those, you can get away with much less even.

And I can think of faster ways too. It isn't that difficult to get around using a semaphore

gpderetta|1 year ago

There isn't a significant advantage in having the kernel doing the polling, it would still be busy polling.

If you just don't want to burn power but you can still dedicate a core, there is https://www.felixcloutier.com/x86/mwait.

sriram_malhar|1 year ago

The consumer (background thread) only polls; there is no need to wake up the consumer.

At steady state, the consume is either blocked on I/O or is scanning the producer buffers. When all producer buffers are empty, it can just sleep for a short time.

toxik|1 year ago

O(microseconds) = O(years), this is not what big O notation means.

loeg|1 year ago

7ns latency is in the ballpark of small writes to L1 cache. I.e. some sort of in-mem only append to somewhere predicted by the prefetcher of like, a single cache line or less. So yeah, some sort of ringbuffer log could definitely support this kind of figure. The question is how much throughput does your persistence channel have, how much memory can you devote to logging while your persistence runs async, are you ok with a somewhat volatile log, etc.

jnordwick|1 year ago

you don't really wait on L1 cache writes though. The store buffer absorbs it, and if the data is needed it can be forwarded from there before the write to cache even happens.

Most x64 L1d caches have a 4-6 cycle latency depending on CPU, that 1 to 2 ns depending on frequency.

zdw|1 year ago

Doesn't a logging system need a storage system that can keep up with it, if the goal is to persist logs for later analysis? What storage could keep up with this?

rfoo|1 year ago

This is for very bursty logs. You don't log every 7 ns. On average you are not generating a huge amount of logs.

But you need each logging call to be very fast, cause logging calls are usually synchronous.

wdfx|1 year ago

I think the idea here is to separate the log call site in application code from the log processing/persistence?

So, the nanosecond values quoted are the impact exposed to your application code, but some other process takes over the bulk of the work of the logging.

So as long as the offloaded process can keep up with the average log rate, it'll be fine - but also as a bonus the application does not see increased latency due to logging.

cma|1 year ago

Battery backed SRAM

unknown|1 year ago

[deleted]

packetlost|1 year ago

I have ideas for a logging/wide metric system that uses this technique and some others stolen from DNS and IP. It's largely inspired by a combination of a system I've built at my day job that implements distributed command & control for servo-ing, monitoring, etc. It's been really successful, but the hardest part is mapping a unique numeric identifier to a human readable string in a way that is dynamic and efficient enough. It really seems like the exact same problem as DNS, which leads me to believe there's likely no way without a persistent centralized registry/database.

bazzargh|1 year ago

The paper says a lot of the secret sauce is dumping a dictionary of the static content and then logging in a binary format. That format looks a lot like gzip, if you squint. Could something like this use the _actual_ gzip format, but writing with a static dictionary, to make life easier for tools? (gzip has a trailer, but I'm not sure how much attention is paid to that, since it's often used for streams)

swah|1 year ago

So you're actually spawning threads in this simple C++ example? I thought this was refrained in C++ land...

  #include "NanoLogCpp17.h"
  using namespace NanoLog::LogLevels;

  int main() 
  {
    NANO_LOG(NOTICE, "Hello World! This is an integer %d and a double %lf\r\n", 1, 2.0);
    return 0;
  }

Validark|1 year ago

Amazing work! I was wondering just a few months ago whether someone ever made a logger that deferred all the expensive work of string formatting to consumption time.

~~I'm a bit surprised that it didn't come along sooner though. How come nobody at Google or VMware who said they noticed this was a problem solved it? Or any other major tech company? I guess maybe this is partially an issue with our programming languages and build tools though? I'm a Zig enthusiast though so in my head it's trivial, but I guess it won't be until C++26 that they get potentially comparable comptime facilities for C++.~~ I'm surprised Go doesn't work like this by default though. For a language like Go, I'd have made a builtin log keyword that does this.

EDIT: Looks like other implementations of similar ideas do exist. Still awesome though!

yuliyp|1 year ago

You have to be careful in deferring such work. It may end up more expensive if it means you have multiple threads accessing that data, and/or needing to extend the lifetime of an object so the logger can access it.

lokar|1 year ago

The google logging library has deferred the formatting for years

newobj|1 year ago

The real headline here is that log4j2 is faster than Boost.Log

eska|1 year ago

There’s a reason why many avoid boost.

username81|1 year ago

Are there libraries like this, but in rust? As far as I understand, it relies on C's preprocessor, so it is impossible to create bindings for another language.

cmptrnerd6|1 year ago

I have not used this and it says it targets embedded systems but maybe it is close enough to what you might be looking for: https://github.com/knurling-rs/defmt

andrepd|1 year ago

Sounds like something that would be doable in rust. I'm not sure how you would go about building the compile-time table of log locations and format strings, sounds like the hardest part.

kolbe|1 year ago

I could swear I did a deep dive into Spdlog vs Nanolog six months ago, and the performance differences weren't nearly this stark

linhns|1 year ago

I believe the performance reported in their paper is circumstantial. It’s not that much faster when I tried it, and not worth the horrible macro syntax.

synergy20|1 year ago

what do you mean? considering spdlog is the de facto logger for c++

unknown|1 year ago

[deleted]

perching_aix|1 year ago

that sounds deliciously performant, love projects like these

yas_hmaheshwari|1 year ago

I was also thinking the same~ How come such a good idea is already not part of standard logging libraries -- to allow you to configure to another process or message queue!

Loved the idea

71 comments