top | item 45286355

(no title)

cHaOs667 | 5 months ago

That's what you call a DOM Parser - the problem with them is, as they serialize all the elements into objects, bigger XML files tend to eat up all of your RAM. And this is where SAX2 parsers come into play where you define tree based callbacks to process the data.

discuss

mort96|5 months ago

The solution is simple: don't have XML files that are many gigabytes in size.

iberator|5 months ago

A lot of teleco stuff dumps multi-gb stuff of xml hourly. Per BTS. Processing few TB of XML files on one server daily

It's doable, just use the right tools and hacks :)

Processing schema-less or broken schema stuff is always hilarious.

Good times.

cHaOs667|5 months ago

Depending on the XML structure and the servers RAM - it can already happen while you approach 80-100 MB file sizes. And to be fair, in the Enterprise context, you are quite often not in a position to decide how big the export of another system is. But yes, back in 2010 we built preprocessing systems that checked XMLs and split them up in smaller chunks if they exceeded a certain size.

lyu07282|5 months ago

Tell that to wikimedia, I've used libxml's SAX parser in the past to parse 80GB+ xml dumps.

stuaxo|5 months ago

Some formats are this and they are historical formats.