Increasingly the performance limit for modern CPUs is the amount of data you can feed through a single core: basically memcpy() speed. On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.
When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.
This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.
If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3
your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth
I wrote some microbenchmarks for single-threaded memcpy
zen 2 (8-channel DDR4)
naive c:
17GB/s
non-temporal avx:
35GB/s
Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
naive c:
9GB/s
non-temporal avx:
13.5GB/s
apple silicon tests
(warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)
m3
naive c:
17GB/s cold, 41GB/s warm
non-temporal neon:
78GB/s cold+warm
m3 max
naive c:
25GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
m4 pro
naive c:
13.8GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
(I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon
On quite a few recent chips (including, AIUI, Apple M series) you can only saturate memory bandwidth by resorting to the iGPU (which has access to unified memory), CPU cores on their own won't do it. It means that using the iGPU as a blitter for huge in-memory transfers and for all throughput-limited computation (including such things as parallel parsing or de/compression workloads) is now the technically advisable choice, provided that this can be arranged.
> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.
> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.
Would you mind sharing what problems motivated Lite? Curious what are the typical use cases for selective reading / in place modification of serialized data. My understanding is that for cases that really want all of the fields, the zero-copy solutions aren’t much better than JSON / protobuf, so these are solutions to different problems.
cool, do you think it's possible to add a schema mode to lite3 to remove the message size tradeoff? I think most people will still want to use lite3 with hard schemas during both serialization and deserialization. It's nice that it also works in a schemaless mode though.
> This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.
Since no one else seems to have pointed this out - the OP seems to have misunderstood the output of the 'time' command.
$ time ./wc-avx2 < bible-100.txt
82113300
real 0m0.395s
user 0m0.196s
sys 0m0.117s
"System" time is the amount of CPU time spent in the kernel on behalf of your process, or at least a fairly good guess at that. (e.g. it can be hard to account for time spent in interrupt handlers) With an old hard drive you would probably still see about 117ms of system time for ext4, disk interrupts, etc. but real time would have been much longer.
$ time ./optimized < bible-100.txt > /dev/null
real 0m1.525s
user 0m1.477s
sys 0m0.048s
Here we're bottlenecked on CPU time - 1.477s + 0.048s = 1.525s. The CPU is busy for every millisecond of real time, either in user space or in the kernel.
In the optimized case:
real 0m0.395s
user 0m0.196s
sys 0m0.117s
0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.
In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:
1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.
2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)
Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.
Hello, a couple years ago I participated in a contest to count word frequencies and generate a sorted histogram. There's a cool post about it featuring a video discussing the tricks used by some participants. https://easyperf.net/blog/2022/05/28/Performance-analysis-an...
Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.
It's not about memory/CPU/IO, but latency vs throughput. Most software is slow because it ignores the latency. If you program serially waiting for _whatever_ it is going to be slow. If you scatter your data around memory, or read from disk in small chunks, or make tons of tiny queries to the DB serially your software will be 99.9% waiting idle for something to finish. That's it. If you can organize your data linearly in memory and/or work on batches of it at the time and/or parallelize stuff and/or batch your IO, it is going to be fast.
I read tons of comments like "It's not [this], it's [that] instead!" which is also wrong.
The performance bottleneck is whatever resource hits saturation first under the workload you actually run: CPU, memory bandwidth, cache/allocations, disk I/O, network, locks/coordination, or downstream latency.
Measure it, prove it with a profile/trace, change one thing, measure again.
*Unless your in the cloud, then it's a metric to nickel and dime with throttling!
On a more serious note, the performance of hardware today is mind boggling from what we all encountered way back when. What I struggle to comprehend though is how some software (particularly Windows as an OS, instant messaging applications etc.) feel less performant now than they ever were.
The performance of hardware today is even more mind-boggling compared to what most people (SRE managers, devs, CTOs) are willing to pay for when it comes to cloud compute.
even more so when considered in the context of dev 'remote workstations'. I benchmarked perf on AWS instances that was at least 5x slower than an average m1 macbook, and cost hundreds of dollars a dev per month (easily), and the macbook was a sunk cost!
Not a new idea, but it's intriguing to think about an architecture that's just: CPU <-> caches <-> nonvolatile storage
What if you could take it for granted that mmap()ing a file has the exact same performance characteristics as malloc(), aside from the data not going away when you free the address space? What if arbitrary program memory could be given a filename and casually handed off to the OS to make persistent? A lot of basic software design assumptions are still based on the constraints of the spinning rust era...
> A lot of basic software design assumptions are still based on the constraints of the spinning rust era...
fsync() is still slow, and you need that for real persistence. It's not just about spinning rust, there's very good reasons for wanting a different treatment of clearly ephemeral/scratchpad storage.
You can get something like this from Linux today. (And mmap is actually how you request memory from the kernel in almost all cases.)
It's just that mmap is slower than using read/write, because the kernel knows less about your data access patterns and thus has to guess for how to populate caches etc.
When I took my first database course one topic was IO performance as measured in seek time on regular old hard drives.
You can’t really optimise your code for faster sequential reads than the IO is capable of, so the only thing really worth focusing on is how to optimise everything that isn’t already sequential.
In process monitoring you just see 100% "cpu" use with the processor cores running in their low-medium frequency range and no real thermal issues (fans aren't spinning up). You can use perf indicators to specifically look at whether memory bandwidth is the issue.
eliasdejong|1 month ago
When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.
This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.
If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3
lunixbochs|1 month ago
e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720
I wrote some microbenchmarks for single-threaded memcpy
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple siliconjohncolanduoni|1 month ago
tiffanyh|1 month ago
What makes M-series have 3x the bandwidth (per core), over x86?
zozbot234|1 month ago
> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.
dehrmann|1 month ago
Samsung is selling NVMe SSDs claiming 14 GB/s sequential read speed.
auselen|1 month ago
woooooo|1 month ago
You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.
squirrellous|1 month ago
Nathanba|1 month ago
mgaunard|1 month ago
That being said storing trees as serializable flat buffers is definitely useful, if only because you can release them very cheaply.
brunoborges|1 month ago
Which file formats allow partial parsing?
1vuio0pswjnm7|1 month ago
hamandcheese|1 month ago
quadrature|1 month ago
rattray|1 month ago
pjdesno|1 month ago
In the optimized case:
0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:
1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.
2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)
Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.
stabbles|1 month ago
anonymoushn|1 month ago
Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.
imtringued|1 month ago
dpc_01234|1 month ago
AmazingTurtle|1 month ago
The performance bottleneck is whatever resource hits saturation first under the workload you actually run: CPU, memory bandwidth, cache/allocations, disk I/O, network, locks/coordination, or downstream latency.
Measure it, prove it with a profile/trace, change one thing, measure again.
ThreatSystems|1 month ago
On a more serious note, the performance of hardware today is mind boggling from what we all encountered way back when. What I struggle to comprehend though is how some software (particularly Windows as an OS, instant messaging applications etc.) feel less performant now than they ever were.
rsanheim|1 month ago
even more so when considered in the context of dev 'remote workstations'. I benchmarked perf on AWS instances that was at least 5x slower than an average m1 macbook, and cost hundreds of dollars a dev per month (easily), and the macbook was a sunk cost!
nine_k|1 month ago
Both Telegram and FB messenger are snappy; I didn't use anything else seriously as of late. (Especially not Teams, nor the late Skype.)
coryrc|1 month ago
gary_0|1 month ago
What if you could take it for granted that mmap()ing a file has the exact same performance characteristics as malloc(), aside from the data not going away when you free the address space? What if arbitrary program memory could be given a filename and casually handed off to the OS to make persistent? A lot of basic software design assumptions are still based on the constraints of the spinning rust era...
zozbot234|1 month ago
fsync() is still slow, and you need that for real persistence. It's not just about spinning rust, there's very good reasons for wanting a different treatment of clearly ephemeral/scratchpad storage.
eru|1 month ago
It's just that mmap is slower than using read/write, because the kernel knows less about your data access patterns and thus has to guess for how to populate caches etc.
kevmo314|1 month ago
The real joke is on me though, some of these GPU servers actually have 2 TB of RAM now. Crazy engineering!
npn|1 month ago
akoboldfrying|1 month ago
pvorb|1 month ago
geoctl|1 month ago
hmottestad|1 month ago
You can’t really optimise your code for faster sequential reads than the IO is capable of, so the only thing really worth focusing on is how to optimise everything that isn’t already sequential.
leentee|1 month ago
verdverm|1 month ago
obogobo|1 month ago
zozbot234|1 month ago
grayxu|1 month ago
wmf|1 month ago
tomhow|1 month ago
atrooo|1 month ago
[deleted]