top | item 43856728

(no title)

JamesonNetworks | 10 months ago

30 minutes seems long. Is there a lot of data? I’ve been working on bootstrapping sqlite dbs off of lots of json data and by holding a list of values and then inserting 10k at a time with inserts, Ive found a good perf sweet spot where I can insert plenty of rows (millions) in minutes. I had to use some tricks with bloom filters and LRU caching, but can build a 6 gig db in like 20ish minutes now

discuss

order

zeroq|10 months ago

It's roughly 10Gb across several CSV files.

I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).

I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.

I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.

Here's quite recent log (on stock Ryzen 5900X):

   08:43 import
   13:30 delete non-essentials
   18:52 delete orphans
   19:23 create indexes
   19:24 optimize
   20:26 vacuum

thechao|10 months ago

Millions of rows in minutes sounds not ok, unless your tables have a large number of columns. A good rule is that SQLite's insertion performance should be at least 1% of sustained max write bandwidth of your disk; preferably 5%, or more. The last bulk table insert I was seeing 20%+ sustained; that came to ~900k inserts/second for an 8 column INT table (small integers).

pessimizer|10 months ago

Saying that 30 minutes seems long is like saying that 5 miles seems far.