top | item 43284039

(no title)

cxie | 1 year ago

Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?

The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.

This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.

discuss

order

MortyWaves|1 year ago

When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.

Does your language have the concept of streaming files?

crazygringo|1 year ago

If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.

But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.

Nothing prevents streaming in theory, it's just far more complicated to write that library.

cess11|1 year ago

The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.

In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.

I'd also like to mention that XSLT is an often underappreciated approach.

leafmeal|1 year ago

There's a create blog from the creator of RhymeBrain that talks about Succinct Tries: https://stevehanov.ca/blog/index.php?id=120

I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.

CrimsonCape|1 year ago

This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?

senderista|1 year ago

You store the structural information separately from the data. The data can be stored sequentially in some traversal order.

neuroelectron|1 year ago

He touches on indexes but doesn't really mention the implementation. This is about the primitives.