top | item 41173335

Tracking supermarket prices with Playwright

467 points| sakisv | 1 year ago |sakisv.net

210 comments

order
[+] brikym|1 year ago|reply
I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.

At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.

[+] RasmusFromDK|1 year ago|reply
Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

A fun project, but challenging at times, and annoying problems to fix.

[+] batata004|1 year ago|reply
I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br
[+] latexr|1 year ago|reply
> I scrape even app and websites data

And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.

https://www.economizafloripa.com.br/?q=parceria-comercial

That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.

[+] siamese_puff|1 year ago|reply
How does the ipv6 rotation work in this flow?
[+] maerten|1 year ago|reply
Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

It works for the most part, as long as at least one correct barcode number is provided for a product.

[+] pcblues|1 year ago|reply
This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.
[+] TrackerFF|1 year ago|reply
Here in Norway, what is called the "competition authority"(https://konkurransetilsynet.no/norwegian-competition-authori...), is frequently critical to open and transparent (food) price information for that exact reason.

The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.

For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.

Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.

[+] pcblues|1 year ago|reply
The word I was looking for was collusion, but done with software and without people-based collusion.
[+] seanwilson|1 year ago|reply
> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.

You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.

[+] z3t4|1 year ago|reply
Sanity checks in programming are underrated, not only are they cheap performance vice, they catch bugs early that would otherwise poison the state.
[+] sakisv|1 year ago|reply
Yeah I thought about that, but I've seen cases that a product jumped more than 100%.

I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so

[+] langsoul-com|1 year ago|reply
The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

[+] __MatrixMan__|1 year ago|reply
I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.
[+] sakisv|1 year ago|reply
Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets

[+] seanthemon|1 year ago|reply
And you couldn't use OCR and simply take an image of the product list? Not ideal, but difficult or impossible to track depending on your method.
[+] xyst|1 year ago|reply
Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.

[+] sakisv|1 year ago|reply
I do track the price per unit (kg, lt, etc) and I was a bit on the fence on whether I should show and graph that number instead of the price that someone would pay at the checkout, but I opted for the latter to keep it more "familiar" with the prices people see.

Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.

[+] barbazoo|1 year ago|reply
Grocers not putting per unit prices on the label is a pet peeve of mine. I can’t imagine any purpose not rooted in customer hostility.
[+] candiddevmike|1 year ago|reply
Imagine mandating transparent cost of goods pricing. I'd love to see farmer was paid X, manufacturer Y, and grocer added Z.
[+] grafraf|1 year ago|reply
We have been doing it for the Swedish market in more than 8 years. We have a website https://www.matspar.se/ , where the customer can browse all the products of all major online stores, compare the prices and add the products they want to buy in the cart. The customer can in the end of the journey compare the total price of that cart (including shipping fee) and export the cart to the store they desire to order it.

I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

[+] showsover|1 year ago|reply
Do you have a technical writeup of your scraping approach? I'd love to read more about the challenges and solutions for them.
[+] odysseus|1 year ago|reply
I used to price track when I moved to a new area, but now I find it way easier to just shop at 2 markets or big box stores that consistently have low prices.

In Europe, that would probably be Aldi/Lidl.

In the U.S., maybe Costco/Trader Joe's.

For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)

If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.

[+] bufferoverflow|1 year ago|reply
> In the U.S., maybe Costco/Trader Joe's.

Costco/Walmart/Aldi in my experience.

Trader Joe's is higher quality, but generally more expensive.

[+] dexwiz|1 year ago|reply
You can find ALDIs in the USA, but they are regional. Trader Joe’s is owned by the same family as ALDIs, and until recently (past 10 years) you wouldn’t see them in the same areas.
[+] andrewla|1 year ago|reply
One problem that the author notes is that so much rendering is done client side via javascript.

The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.

[+] sakisv|1 year ago|reply
Initially that's what I wanted to do, but the first supermarket I did is sending back HTML rendered on the server side, so I abandonded this approach for the sake of "consistency".

Lately I've been thinking to bite the bullet and Just Do It, but since it's working I'm a bit reluctant to touch it.

[+] ikesau|1 year ago|reply
Ah, I love this. Nice work!

I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.

The tools that could be built with such information would do amazing things for consumers.

[+] xnx|1 year ago|reply
Scraping tools have become more powerful than ever, but bot restrictions have become equally more strict. It's hard to scrape reliably under any circumstance, or even consistently without residential proxies.
[+] gadders|1 year ago|reply
This reminds me a bit of a meme that said something along the lines of "I don't want AI to draw my art, I want AI review my weekly grocery shop, workout which combinations of shops save me money, and then schedule the deliveries for me."
[+] ElCapitanMarkla|1 year ago|reply
Something I was talking over with a friend a while ago was something along the lines of this.

Where you could set a list of various meals that you like to eat regularly, a list of like 20 meal options. And then the app fetches the pricing for all ingredients and works out which meals are the best value that week.

You kind of end up with a DIY HelloFresh / meal in a box service.

[+] sakisv|1 year ago|reply
Ha, you can't imagine how many times I've thought of doing just that - it's just that it's somewhat blocked by other things that need to happen before I even attempt to do it
[+] ptrik|1 year ago|reply
> My CI of choice is [Concourse](https://concourse-ci.org/) which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

[+] sakisv|1 year ago|reply
Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.

[+] jfil|1 year ago|reply
I'm building something similar for 7 grocery vendor in Canada and am looking to talk with others who are doing this - my email is in my profile.

One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.

[+] nosecreek|1 year ago|reply
Very cool! I did something similar in Canada (https://grocerytracker.ca/)
[+] snac|1 year ago|reply
Love your site! It was a great source of inspiration with the amount of data you collect.

I did the same and made https://grocerygoose.ca/

Published the API endpoints that I “discovered” to make the app https://github.com/snacsnoc/grocery-app (see HACKING.md)

It’s an unfortunate state of affairs when devs like us have to go to such great lengths to track the price of a commodity (food).

[+] kareemm|1 year ago|reply
Was looking for one in Canada. Tried this out and it seems like some of the data is missing from where I live (halifax). Got an email I can hit you up at? Mine's in my HN profile - couldn't find yours on HN or your site.
[+] sakisv|1 year ago|reply
Oh nice!

A thorny problem in my case is that the same item is named in 3 different ways between the 3 supermarkets which makes it very hard and annoying to do a proper comparison.

Did you have a similar problem?

[+] PigiVinci83|1 year ago|reply
Nice article, enjoyed reading it. I’m Pier, co founder of https://Databoutique.com, which is a marketplace for web scraped data. If you’re willing to monetize your data extractions, you can list them on our website. We just started with the grocery industry and it would be great to have you on board.
[+] bob_theslob646|1 year ago|reply
This looks like a really cool website but my only critique is how are you verifying that the data is actually real and not just generated randomly?
[+] redblacktree|1 year ago|reply
Do you have data on which data is in higher demand? Do you keep a list of frequently-requested datasets?
[+] lotsofpulp|1 year ago|reply
In the US, retail businesses are offering individualized and general coupons via the phone apps. I wonder if this pricing can be tracked, as it results in significant differences.

For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.

I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.

These price obfuscation tactics are seen in many businesses, making price tracking very difficult.

[+] mcoliver|1 year ago|reply
I wrote a chrome extension to help with this. Clips all the coupons so you don't have to do individual searches. Has resulted in some wild surprise savings when shopping. www.throwlasso.com
[+] hnrodey|1 year ago|reply
Nice job getting through all this. I kind of enjoy writing scrapers and browser automation in general. Browser automation is quite powerful and under explored/utilized by the average developer.

Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).

The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).

This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.

[+] sakisv|1 year ago|reply
...or worse, if there _is_ an API call but the response is HTML instead of a json