top | item 6327746

(no title)

krka | 12 years ago

Thanks, that's interesting data!

I am not sure why you changed the key format to "key_%09d" - is that an optimization for lmdb, to make sure the insertion order is the same as the internal tree ordering? If so, why is that needed for the benchmark?

I noticed that the wall time and cpu time for the sparkey 100M benchmarks were a bit disjoint, it would seem that your OS was evicting many pages or was stalling on disk writes. The Sparkey files were slightly larger than 4 GB while lmdb was slightly smaller, but I am not sure that really explains it on an 8 GB machine.

I am not sure I agree about the non-linear creation time difference, the benchmarks indicate that both sparkey and lmdb are non-linear. The sparkey creation throughput went from 1206357.25 to 1109604.25 (-8.0%) while lmdb's went from 2137678.50 to 2033329.88 (-4.8%)

Regarding the lookup performance "dropping off a cliff", I think that is related to the large difference in wall time vs cpu time, which indicates a lot of page cache misses.

lmdb seems really interesting for large data sets, but I think it's optimized for different use cases. I'd be curious to see how it behaves with more randomized keys and insertion order. I didn't think of doing that in the benchmark since sparkey isn't really affected by it, but it makes sense for when benchmarking a b-tree implementation.

Sparkey is optimized for our use case where we mlock the entire index file to guarantee cache hits, and possibly also mlock the log file, depending on how large it is.

The way you append stuff to sparkey (first fill up a log, then build a hash table as a finalization) is really useful when you need to use lots of memory while building and can't affort random seek file operations, and in the end when most of the work is done and your memory is free again, finalize the database. Of course, you could do the same thing with lmdb, first writing a log and then converting that into a lmdb file.

Thanks for taking the time to adapt the benchmark code to lmdb, it's been very interesting.

discuss

hyc_symas|12 years ago

Yes, I changed the key format to allow using the MDB_APPEND option for bulk loading. (That's only usable in LMDB for sequential inserts.) Otherwise, for random inserts, things will be much slower. (Again, refer to the microbench to see the huge difference this makes.) If you don't have your data ordered in advance then this comparison is invalid, and we'd have to just refer to the much slower random insert results.

Still don't understand what happened to sparkey at 100M. The same thing happens using snappy, and the compressed filesize is much smaller than LMDB's, so it can't be pagecache exhaustion.

Also suspicious of the actual time measurements. Both of these programs are single-threaded so there's no way the CPU time measurement should be greater than the wall-clock time. I may take a run at using getrusage and gettimeofday instead, these clock_gettime results look flaky.

krka|12 years ago

Could be due to a bug related to reading uninitialized data on the stack. That could lead to using the wrong number of bits for the hash, causing an unnecessarily high number of hash collisions, which makes it more expensive due to false positives that needs to be verified. I think it's fixed in the latest master, and the benchmark code now prints the number of collisions per test case, which could be useful debug data.

Also, I think it would be more interesting to see a comparison with lmdb using random writes instead of sequential.

As for the cpu time measurement, the wallclock is very inprecise, so it could be some small quantum larger than cpu time, but it should never be more than the system specific wall clock quantum.

hyc_symas|12 years ago

(I've updated my repo using gettimeofday/getrusage).

unknown|12 years ago

[deleted]