Show HN: I scraped 25M Shopify products to build a search engine
317 points| pencildiver | 2 years ago |searchagora.com | reply
My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called 'Red Shoes'. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There's a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.
I've started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I've found is keeping the search speed and performance strong as the data set becomes larger. There's about 25 million products on Agora right now. I'll ramp this up carefully to make sure we don't compromise the search speed and user experience.
I'd love any feedback!
[+] [-] senecaso|2 years ago|reply
A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.
I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.
Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.
[+] [-] DeathArrow|2 years ago|reply
In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are. Some monetize trough ads, some have partnership with stores and you can buy directly from the search results.
I generally search first on the local Amazon equivalent, if I don't like what I see, I search on a smaller store. If I still can't find or dislike the products or prices I search Google. If I am still not contended with the results, I will go search on comparison engines.
And I also have a browser extension called Pricy who polls the comparison engines, so once I land in a product page I know which store has the better price and what was the price history through last year.
Probably many people have similar patterns. I expect people in US to search Amazon first, if it's not a very niche product they are after.
I think you can have a better monetization proposal, if instead of just search you build a sales platform, so people can directly buy after searching, without hoping to various websites.
[+] [-] bruce511|2 years ago|reply
What plans did you have for generating revenue from the site? (Serious question - given your low costs it would seem like a tiny amount of revenue would gave been enough.)
[+] [-] pencildiver|2 years ago|reply
[+] [-] bytearray|2 years ago|reply
[+] [-] grumpyviscacha|2 years ago|reply
[+] [-] screye|2 years ago|reply
I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.
Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?
______________
A recommendation for how to improve search.
Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.
Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.
Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.
[+] [-] pencildiver|2 years ago|reply
For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.
Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.
Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.
A few improvement that has helped so far:
- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json
- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.
Hope this helps. Still wrapping my head around all of this.
[+] [-] Ninjinka|2 years ago|reply
[+] [-] helsinki|2 years ago|reply
[+] [-] DeathArrow|2 years ago|reply
There's nothing to scrap. You just download a JSON, the site owners kindly put on your disposal.
Scraping is a more complex process, where you have to work around rate limiting and captchas. For the tool I built I wrote tens of thousands of lines of code and I still find daily issues I have to deal with if I want to scrap a particular web page, issues I don't always have the time to solve.
[+] [-] joshuamcginnis|2 years ago|reply
There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.
Keep going! At the least, you'll come out of this with an excellent project in your portfolio.
[+] [-] pencildiver|2 years ago|reply
[+] [-] pitched|2 years ago|reply
[+] [-] senecaso|2 years ago|reply
I remember in once case, I found what appeared to be an escort service listing "models" on Shopify. It was super creepy. I needed to get in front of that one pretty quick as well, as it was turning up in results.
[+] [-] hackideiomat|2 years ago|reply
well I get mostly black shoes lol
Edit: ah no, they just use half a page for shoe shops first with black shoes as logo??
[+] [-] callmeed|2 years ago|reply
Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)
A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.
[+] [-] pencildiver|2 years ago|reply
And I agree on quality over quantity. Writing a script to remove all stores that are shutdown, products that are sold out, and a few other characteristics. Heavily focusing on the search algorithm and data quality now.
[+] [-] senecaso|2 years ago|reply
[+] [-] DeathArrow|2 years ago|reply
You mean like Amazon?
[+] [-] konschubert|2 years ago|reply
https://shop.invisible-computers.com
[+] [-] pencildiver|2 years ago|reply
https://www.searchagora.com/products/invisible-calendar-6266...
Thinking that we should have a page where store owners can submit their URL to be crawled.
[+] [-] pencildiver|2 years ago|reply
[+] [-] shubham_sinha|2 years ago|reply
[+] [-] jillesvangurp|2 years ago|reply
Two points here.
- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.
- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.
So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.
[+] [-] quaxar|2 years ago|reply
Some observations:
- Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.
- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)
- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.
- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.
[+] [-] twothamendment|2 years ago|reply
Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.
[+] [-] pencildiver|2 years ago|reply
[+] [-] Redster|2 years ago|reply
[+] [-] pencildiver|2 years ago|reply
[+] [-] TekMol|2 years ago|reply
So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?
[+] [-] ltbarcly3|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] muratsu|2 years ago|reply
From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.
[+] [-] pencildiver|2 years ago|reply
Re: Value Proposition. Absolutely, I think focusing on the SMB-angle and 'local shopping' will help direct users better. I'll definitely take this into account.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] virtuosarmo|2 years ago|reply
[+] [-] xnx|2 years ago|reply
[+] [-] secabeen|2 years ago|reply
[+] [-] pencildiver|2 years ago|reply
[+] [-] yoru-sulfur|2 years ago|reply
[+] [-] senecaso|2 years ago|reply
[+] [-] Asparagirl|2 years ago|reply
[+] [-] pencildiver|2 years ago|reply
[+] [-] xnx|2 years ago|reply
[+] [-] patatero|2 years ago|reply
Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com
[+] [-] cmcconomy|2 years ago|reply
https://beangrid.mcconomy.org/
[+] [-] misterbwong|2 years ago|reply
[+] [-] ashvardanian|2 years ago|reply
As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)
https://github.com/unum-cloud/usearch - for faster search
https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings
Feel free to reach out with feedback and feature requests!
[+] [-] thih9|2 years ago|reply