I have breach parser that i had written to parse through over 3 billion rows of compressed data (by parsing i simply mean searching for a particular substring), I’ve tried multiple LLMs to make it faster (currently it does so in <45 seconds on an M3 pro mac) none have been able to do that yet.
For simple string search (i.e., not regular expressions) ripgrep is quite fast. I just generated a simple 20 GB file with 10 random words per line (from /usr/share/dict/words). `rg --count-matches funny` takes about 6 seconds on my M2 Pro. Compressing it using `zstd -0` and then searching with `zstdcat lines_with_words.txt.zstd | rg --count-matches funny` takes about 25 seconds. Both timings start with the file not cached in memory.
I have an older breach data set that I loaded into clickhouse:
SELECT *
FROM passwords
WHERE (password LIKE '%password%') AND (password LIKE '%123456%')
ORDER BY user ASC
INTO OUTFILE '/tmp/res.txt'
Query id: 9cafdd86-2258-47b2-9ba3-2c59069d7b85
12209 rows in set. Elapsed: 2.401 sec. Processed 1.40 billion rows, 25.24 GB (583.02 million rows/s., 10.51 GB/s.)
Peak memory usage: 62.99 MiB.
And this is on a Xeon W-2265 from 2020.
If you don't want to use clickhouse you could try duckdb or datafusion (which is also rust).
In general, the way I'd make your program faster is to not read the data line by line... You probably want to do something like read much bigger chunks, ensure they are still on a line boundary, then search those larger chunks for your strings. Or look into using mmap and search for your strings without even reading the files.
What about AlphaEvolve / OpenEvolve https://github.com/codelion/openevolve? It has a more structured way of improving / evolving code, as long as you setup the correct evaluator.
I would start by figuring out where there is room for improvement. Experiments to do:
- how long does it take to just iterate over all bytes in the file?
- how long does it take to decompress the file and iterate over all bytes in the file?
To ensure the compiler doesn’t outsmart you, you may have to do something with the data read. Maybe XOR all 64-bit longs in the data and print the result?
You don’t mention file size but I guess the first takes significantly less time than 45 seconds, and the second about 45 seconds. If so, any gains should be sought in improving the decompression.
Other tests that can help locate the bottleneck are possible. For example, instead of processing a huge N megabyte file once, you may process a 1 MB file N times, removing disk speed from the equation.
You can't just tell LLM "make it faster, no mistakes or else". You may need to nudge it to use specific techniques (good idea to ask it first what techniques it is aware of), then give it comparison before and after, maybe with assembly. You can even get assembly output to another LLM session and ask it to count cycles, then feed the result to another session. You can also look yourself what seems excessive, consult CPU datasheets and nudge LLM to work on that area.
This workflow isn't much faster than just optimising by hand, but if you are bored with typing code is a bit refreshing. Like you can focus on "high level" and LLM does the rest.
rented_mule|7 months ago
44za12|7 months ago
justinsaccount|7 months ago
And this is on a Xeon W-2265 from 2020.
If you don't want to use clickhouse you could try duckdb or datafusion (which is also rust).
In general, the way I'd make your program faster is to not read the data line by line... You probably want to do something like read much bigger chunks, ensure they are still on a line boundary, then search those larger chunks for your strings. Or look into using mmap and search for your strings without even reading the files.
brunocvcunha|7 months ago
top256|7 months ago
Someone|7 months ago
- how long does it take to just iterate over all bytes in the file?
- how long does it take to decompress the file and iterate over all bytes in the file?
To ensure the compiler doesn’t outsmart you, you may have to do something with the data read. Maybe XOR all 64-bit longs in the data and print the result?
You don’t mention file size but I guess the first takes significantly less time than 45 seconds, and the second about 45 seconds. If so, any gains should be sought in improving the decompression.
Other tests that can help locate the bottleneck are possible. For example, instead of processing a huge N megabyte file once, you may process a 1 MB file N times, removing disk speed from the equation.
varispeed|7 months ago
lawlessone|7 months ago
Just told the LLM to create a GUI in visual basic. I am a hacker now.
lawlessone|7 months ago