top | item 47149752

100M-Row Challenge with PHP

185 points| brentroose | 4 days ago |github.com

101 comments

order

brentroose|4 days ago

A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

That's why I built a performance challenge for the PHP community

The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

I hope people will have fun with it :)

Tade0|4 days ago

Pitch this to whoever is in charge of performance at Wordpress.

A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.

gib444|4 days ago

> A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds

That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)

NorwegianDude|3 days ago

Fun challenge, but running the benchmark on Apple hardware is a weird decision as Apple doesn't even have server hardware. Would make much more sense to run it on a dedicated Linux box as that is more accessible and more realistic.

contingencies|4 days ago

Hehe. Optimization ... it's a good way to learn. Earlier in my career I did a lot of PHP. Usually close to bare.

Other than the obvious point that writing an enormous JSON file is a dubious goal in the first place (really), while PHP can be very fast this is probably faster to implement in shell with sed/grep, or ... almost certainly better ... by loading to sqlite then dumping out from there. Your optimization path then likely becomes index specification and processing, and after the initial load potentially query or instance parallelization.

The page confirms sqlite is available.

If the judges whinge and shell_exec() is unavailable as a path, as a more acceptable path that's whinge-tolerant, use PHP's sqlite feature then dump to JSON.

If I wanted to achieve this for some reason in reality, I'd have the file on a memory-backed blockstore before processing, which would yield further gains.

Frankly, this is not much of a programming problem, it's more a system problem, but it's not being specced as such. This shows, in my view, immaturity of conception of the real problem domain (likely IO bound). Right tool for the job.

ge96|4 days ago

5 days to 30 seconds? What kind of factor/order of magnitude is that damn

What takes 5 days to run

CyberDildonics|4 days ago

Using a language that is 100x slower than naive native programs to do a "speed challenge" is like spending your entire day speed walking to run errands when you can just learn how to drive a car.

lofaszvanitt|4 days ago

Do not update the leaderboard.... at all.

user3939382|4 days ago

exec(‘c program that does the parsing’);

Where do I get my prize? ;)

onion2k|4 days ago

A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds.

When people say leetcode interviews are pointless I might share a link to this post. If that sort of optimization is possible there is a structures and algorithms problem in the background somewhere.

pxtail|4 days ago

Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

https://elephpant.me/

t1234s|4 days ago

Elephpants should be for second and third place. First place should be the double-clawed hammer.

thih9|4 days ago

Excellent project. My favorites: the joker, php storm, phplashy, Molly.

semiquaver|4 days ago

Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }

daviddoran|4 days ago

PHP has always escaped forward slashes to help prevent malicious JSON from injecting tags into JavaScript I believe. Because it was common for PHP users to json_encode some data and then to write it out into the HTML in a script tag. A malicious actor could include a closing script tag, and then could inject their own HTML tags and scripts etc.

CapitaineToinon|4 days ago

That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.

poizan42|4 days ago

> The output should be encoded as a pretty JSON string.

So apparently that is what they consider "pretty JSON". I really don't want to see what they would consider "ugly JSON".

(I think the term they may have been looking for is "pretty-printed JSON" which implies something about the formatting rather than being a completely subjective term)

Xeoncross|4 days ago

This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

If Rust scares you use Go. If Go scares you use Rust.

thinkingtoilet|4 days ago

It's almost comical how often bring up Rust. "Here's a fun PHP challange!" "Let's talk about Rust..."

tracker1|4 days ago

Can't speak for go... but for the handful of languages I've thrown at Claude Code, I'd say it's doing the best job with Rust. Maybe the Rust examples in the wild are just better compared to say C#, but I've had a much smoother time of it with Rust than anything else. TS has been decent though.

codegeek|4 days ago

I am not that smart to use Rust so take it with a grain of salt. However, its syntax just makes me go crazy. Go/Golang on the other hand is a breath of fresh air. I think unless you really need that additional 20% improvement that Rust provides, Go should be the default for most projects between the 2.

tzs|4 days ago

What's a decent time for this?

I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.

The first part, using this command [1], produces output lines that look like this:

  219,/blog/php-81-before-and-after,2021-06-21
and is sorted by URL path and then date.

With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.

Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).

So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.

[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'

[2]

  #!/bin/zsh
  echo "{"
  PAGE=none
  while read LINE; do
      COLS=("${(@s/,/)LINE}")
      COUNT=${COLS[1]}
      URL=${COLS[2]}
      DATE=${COLS[3]}
      if [ $URL != $PAGE ]; then
          if [ $PAGE != "none" ]; then
              echo
              echo "    },"
          fi
          PAGE=$URL
          echo "    \"\\$URL\": {"
          FINISHDATE=no
      else
          if [ $FINISHDATE = "yes" ]; then
              echo ","
          fi
      fi
      echo -n "        \"$DATE\": $COUNT"
      FINISHDATE=yes
  done
  echo
  echo "}"

chrismarlow9|4 days ago

I don't have time to put together a submission but I'm willing to bet you can use this:

https://github.com/kjdev/php-ext-jq

And replicate this command:

jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv

And it will be faster than anything you can do in native php

Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it

Retr0id|4 days ago

The rules exclude FFI etc.

Twirrim|4 days ago

I took a quick look, the dependency on php 8.5 is mildly irritating, even Ubuntu 26.04 isn't lined up to ship with that version, it's on 8.4.11.

You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?

I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?

tveita|4 days ago

> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

Except that the generator script generates dates relative to time() ?

brentroose|4 days ago

True, it's a bug that I'm going to fix, but it only impacts local test data sets and not the real benchmark :)

tzs|4 days ago

You should say in the output formatting rules that the pages should be output in the order that the pages are in the input file. Currently it only specifies the order of the visits within the entry for each page.

spiderfarmer|4 days ago

Awesome. I’ll be following this. I’ll probably learn a ton.

Retr0id|4 days ago

How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)

lofaszvanitt|4 days ago

Submit at the very end, so others wouldn't know you have a better solution.

csjh|4 days ago

Obligatory DuckDB solution:

> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"

Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.

poizan42|4 days ago

> The output should be encoded as a pretty JSON string.

...

> Your parser should store the following output in $outputPath as a JSON file:

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.

kijin|4 days ago

They probably mean "Should look like the output of json_encode($data, JSON_PRETTY_PRINT)". Which most PHP devs would be familiar with.

tzs|3 days ago

I'm looking at the leaderboard and it raises some interesting questions. Currently the fastest are ~3.4 seconds.

Yesterday the README said that benchmarks were run on a "Premium Intel Digital Ocean Droplet with 2vCPUs and 1.5GB of available memory".

Today it says they are run on a "Mac Mini M1 with 12GB of RAM of available memory", which if the net is to be believed is quite a bit faster than the DO Droplet they said they had been using. I'm going to assume those 3.4 seconds results on the leaderboard were benchmarked on the Mac.

I've got an M2 Max Mac Studio which should be faster than the Mac Mini.

A program to do this challenge must read the entire input file, and it is going to have to at least some computation for every character in the file while parsing.

So I thought to try to get an idea of what an upper limit might be for how fast this could be done. One idea for that was this:

  $ time WC_ALL=C wc -l data.csv
The idea is wc should be written in C or C++, and counting lines just requires checking each character to see if it is newline so it is pretty minimal computation. WC_ALL=C should keep any Unicode stuff from happening which might slow it down.

This is taking 7.1 seconds. (Same without WC_ALL=C BTW).

OK, that was unexpected. I then wrote a line counter in C. Allocate a buffer of size N, loop doing (read N bytes from stdin into buffer, scan those bytes counting '\n's) until no more input. With a 1 MiB buffer it took 1 second. With a 1024 byte buffer it took 4.3 seconds. With a 512 byte buffer it took 7.1 seconds.

So...maybe wc just has a small buffer?

Then I decided to try "wc -c". That's 0.008 seconds. That's faster than "cat > /dev/null" (0.6) seconds, suggesting the "wc -c" is not reading the file. Someone probably decided to special case requests for just the number of characters and just use stat/fstat to get the file size or seeks to the end and gets the offset or something like that.

I then looked at the source for wc [1]. It does indeed special case things like -c. It also special cases -l, because lines, unlike words, can be counted without having to deal with locale stuff.

But my guess it is using a small buffer is wrong. Buffer size is 1 MiB same as mine. So why is my line counter 1 seconds and "wc -l" is 7.1 seconds?

Looking at it I see that wc is also finding the longest line, even if you have only asked for the number of lines. When I add finding the longest line to mine it then takes 5.1 seconds.

There is also more error handling in wc. Mine just loops as long as read() > 0 and then prints the stats and exits, where as wc loops as long as read() != 0, and then in the loop does an "if (len < 0)" to see if there was an error.

There is also a check in the loop in wc to see if a flag that gets set on SIGINFO is set. If it is then wc prints the current stats.

Still, on the 7 GB data.csv file, with a 1 MiB read buffer, the read loop should run under 7000 times so that "if (len < 0)" and "if (siginfo)" are only going to happen under 7000 times, and their enclosed code is only going to run if there is a read error for the first and every time I hit CTRL-T for second. In my tests that's 0 times for both of those.

That's not nearly enough to explain why it is 2.1 seconds slower than my line counter which now has the same buffer size, finds the longest line too, and aside from those two under 7000 times not taken if statements is essentially the same loop.

Maybe latter I'll see what it takes to build wc locally and try to find where the time is going.

[1] https://github.com/apple-oss-distributions/text_cmds/blob/te...

tomaytotomato|4 days ago

Tempted to submit a Java app wrapped in PHP exec() :D

brentroose|4 days ago

The rules state that FFI and the likes isn't allowed because the goal is to do it in PHP :)