top | item 44779583

(no title)

justinsaccount | 7 months ago

I have an older breach data set that I loaded into clickhouse:

  SELECT *
  FROM passwords
  WHERE (password LIKE '%password%') AND (password LIKE '%123456%')
  ORDER BY user ASC
  INTO OUTFILE '/tmp/res.txt'

  Query id: 9cafdd86-2258-47b2-9ba3-2c59069d7b85

  12209 rows in set. Elapsed: 2.401 sec. Processed 1.40 billion rows, 25.24 GB (583.02 million rows/s., 10.51 GB/s.)
Peak memory usage: 62.99 MiB.

And this is on a Xeon W-2265 from 2020.

If you don't want to use clickhouse you could try duckdb or datafusion (which is also rust).

In general, the way I'd make your program faster is to not read the data line by line... You probably want to do something like read much bigger chunks, ensure they are still on a line boundary, then search those larger chunks for your strings. Or look into using mmap and search for your strings without even reading the files.

discuss

order

No comments yet.