top | item 38635695

Show HN: I scraped 25M Shopify products to build a search engine

317 points| pencildiver | 2 years ago |searchagora.com | reply

Hi HN! I built Agora as a side-project leading up to the holiday season. I wanted to find an easier way to find Christmas gifts, without needing to go store-by-store.

My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called 'Red Shoes'. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There's a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.

I've started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I've found is keeping the search speed and performance strong as the data set becomes larger. There's about 25 million products on Agora right now. I'll ramp this up carefully to make sure we don't compromise the search speed and user experience.

I'd love any feedback!

268 comments

order
[+] senecaso|2 years ago|reply
I hope you have better luck than I did!

A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.

I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.

Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.

[+] DeathArrow|2 years ago|reply
>but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users

In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are. Some monetize trough ads, some have partnership with stores and you can buy directly from the search results.

I generally search first on the local Amazon equivalent, if I don't like what I see, I search on a smaller store. If I still can't find or dislike the products or prices I search Google. If I am still not contended with the results, I will go search on comparison engines.

And I also have a browser extension called Pricy who polls the comparison engines, so once I land in a product page I know which store has the better price and what was the price history through last year.

Probably many people have similar patterns. I expect people in US to search Amazon first, if it's not a very niche product they are after.

I think you can have a better monetization proposal, if instead of just search you build a sales platform, so people can directly buy after searching, without hoping to various websites.

[+] bruce511|2 years ago|reply
Im curious why you consider lack of users to be the problem. I would have described it as lack of revenue.

What plans did you have for generating revenue from the site? (Serious question - given your low costs it would seem like a tiny amount of revenue would gave been enough.)

[+] pencildiver|2 years ago|reply
Thanks for sharing this! If you're up for it, I'd love to talk more about your experience, especially the technical tooling. Working as fast as I can to understand the right way to approach the tech, as there are tradeoffs with performance and price. I'm at support @ searchagora .com
[+] bytearray|2 years ago|reply
What strategies did you consider or implement to attract more users, and what would you do differently now to ensure better user acquisition?
[+] grumpyviscacha|2 years ago|reply
Wow, it's cool to see this idea trending on HN! Full disclosure, I'm one of the co-founders at https://www.marmalade.co. Speaking from personal experience, it’s been a long road getting from the universe of all Shopify products to a curated inventory that’s easy for people to shop on. While ChatGPT isn't going to replace human curation anytime soon, the AI tailwind has made it much easier to build search and recommendation systems. On our end, we've definitely caught the semantic search bug. Watch out for it - you’ll wake up one day with a cross-modal hybrid search index on pinecone and any number of models on huggingface :). However, as you rightly point out, user growth is still the key. We're working toward launching a community aspect of the platform in the coming months as a solution.
[+] screye|2 years ago|reply
What was the process for scraping 25M products ?

I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.

Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?

______________

A recommendation for how to improve search.

Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.

Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.

Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.

[+] pencildiver|2 years ago|reply
Great suggestions, looking into this right now. First time building something like this so definitely new to some of these tools.

For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.

Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.

Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

A few improvement that has helped so far:

- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json

- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.

Hope this helps. Still wrapping my head around all of this.

[+] Ninjinka|2 years ago|reply
As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
[+] helsinki|2 years ago|reply
25 million products is really not much at all to scrape.
[+] DeathArrow|2 years ago|reply
>I have always used standard python tools like selenium, bs4 and the like

There's nothing to scrap. You just download a JSON, the site owners kindly put on your disposal.

Scraping is a more complex process, where you have to work around rate limiting and captchas. For the tool I built I wrote tens of thousands of lines of code and I still find daily issues I have to deal with if I want to scrap a particular web page, issues I don't always have the time to solve.

[+] joshuamcginnis|2 years ago|reply
I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.

There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.

Keep going! At the least, you'll come out of this with an excellent project in your portfolio.

[+] pencildiver|2 years ago|reply
Thank you, that means a lot. It has definitely been a whirlwind of emotions since posting on HN but glad I did. It's definitely an MVP so going to work fast to improve it.
[+] pitched|2 years ago|reply
Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is https://shop.app and a query for red shoes is mostly red shoes.
[+] senecaso|2 years ago|reply
Ya, curation is sadly required in the Shopify ecosystem. There are millions of shops, there is a tonne of garbage. Its also difficult (but not impossible) to properly classify items so that you can better target results for a given query. One of the first problems that anyone attempting this will run into is the amount of mature content available on Shopify shops. Innocent queries turn up many NSFW images that may offend some users, so you have to be able to get on top of that one pretty quick.

I remember in once case, I found what appeared to be an escort service listing "models" on Shopify. It was super creepy. I needed to get in front of that one pretty quick as well, as it was turning up in results.

[+] hackideiomat|2 years ago|reply
> a query for red shoes is mostly red shoes

well I get mostly black shoes lol

Edit: ah no, they just use half a page for shoe shops first with black shoes as logo??

[+] callmeed|2 years ago|reply
I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.

Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)

A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

[+] pencildiver|2 years ago|reply
Thanks for the heads up! I spent some time trying to get the cart route to work. Doesn't seem to be supported anymore (link you sent leads to a 404 page). Tried it with every combination of Product ID, Variant ID, etc. Let me know if you have any ideas on how to get this to work. It would be a great feature to add to Agora.

And I agree on quality over quantity. Writing a script to remove all stores that are shutdown, products that are sold out, and a few other characteristics. Heavily focusing on the search algorithm and data quality now.

[+] senecaso|2 years ago|reply
I didnt know about the link to checkout. That's a slightly nicer user experience for sure. Still, its confusing for users who want to do more shopping at the same time. I had users who clicked on a number of items, clicked "add to cart" in each one (all different shops), and then couldn't figure out how to checkout on the main site afterwards! Obviously people were looking for a more complete one-stop-shopping experience than I was providing at the time.
[+] DeathArrow|2 years ago|reply
>otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

You mean like Amazon?

[+] konschubert|2 years ago|reply
Hey, I have a Shopify store that sells e-paper calendars / smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?

https://shop.invisible-computers.com

[+] pencildiver|2 years ago|reply
Super cool product! I'm currently using a list of Shopify stores, so it's still limited (i.e. wanted to start with a relatively small list to focus on the search experience). I'll submit your URL to the crawler now. If you want to reach out to support @ searchagora.com , I'd love to get your feedback as a Shopify store owner.
[+] shubham_sinha|2 years ago|reply
Hi, you could drop an email to [email protected] and we will be happy to onboard you. Please add target geography like you would like to target Indian market or US market
[+] jillesvangurp|2 years ago|reply
There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.

Two points here.

- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.

- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.

So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.

[+] quaxar|2 years ago|reply
Great site. Having built a search engine that needed to handle product data on a similar scale, it's not an easy thing to manage.

Some observations:

- Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.

- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)

- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.

- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.

[+] twothamendment|2 years ago|reply
Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.

Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.

[+] pencildiver|2 years ago|reply
Just upgraded the storage and put in a few fixes so it's working a bit faster now. Working on caching some responses locally as we speak. Great idea.
[+] Redster|2 years ago|reply
Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!
[+] pencildiver|2 years ago|reply
Amazing, glad you were able to find it. I also just learned about what a "Dala" is :)
[+] TekMol|2 years ago|reply
The Terms page goes to "Jaggi Enterprises", "A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.".

So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?

[+] ltbarcly3|2 years ago|reply
Jaggi is a fake it until you make it fake portfolio. Most of the companies it runs are just lorem ipsum fake sites. I think it is likely true that this is a solo dev.
[+] muratsu|2 years ago|reply
Agora also doesn't return red shoes for the search query "red shoes". Seems like you haven't fully solved the problem yet :)

From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.

[+] pencildiver|2 years ago|reply
Definitely have not solved the problem yet! The search algorithm prioritizes the brand called "Red Wing Shoe" so still figuring out ways to show real 'red shoes'. Have been thinking about passing the images through a detection tool and tag them to enhance the search experience.

Re: Value Proposition. Absolutely, I think focusing on the SMB-angle and 'local shopping' will help direct users better. I'll definitely take this into account.

[+] virtuosarmo|2 years ago|reply
I believe Shopify built their own app / website where you can search for products exclusively from Shopify merchants. https://shop.app/
[+] xnx|2 years ago|reply
Great project. If you continue to crawl the data, be sure to save it so you can detect price changes a la camelcamelcamel.
[+] secabeen|2 years ago|reply
For all of Amazon's faults, the fact that they tolerate CCC does drive a lot of my online purchases there. CCC used to track other sites, and was eventually blocked on all of them. If more sites want my business, showing their pricing history (either from internal data, or by letting someone build the DB) would go a long way.
[+] pencildiver|2 years ago|reply
Great call! I am doing back-ups on Mongo and this is a good use-case for tracking changes. Also trying to figure out how to detect is a product is sold out or not being sold anymore.
[+] yoru-sulfur|2 years ago|reply
For those unaware, Shopify already has platform wide search. You can use https://shop.app/ (or the app), and it also has some chatbot thing that can offer suggestions
[+] senecaso|2 years ago|reply
Yes, this has been available for a few years now. Initially, they only indexed a very small number of shops, so it was less useful. Based on a few queries, it seems like the are still using some form of text-based search with rank boosting. Seems like they still aren't searching their entire base of shops, but they have increased the number of shops for sure, and they seem to be continuing to invest in the product, which is nice. It seems more useful now than it did the last time I checked!
[+] Asparagirl|2 years ago|reply
Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?
[+] pencildiver|2 years ago|reply
Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
[+] patatero|2 years ago|reply
Shopify shops always have /collections, /products, and /pages in their URL. If you have a regular Shopify site, you're not allowed to change them. I don't know if Shopify Plus clients can change them.

Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com

[+] misterbwong|2 years ago|reply
What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?
[+] thih9|2 years ago|reply
When I search for “op-1”, partial match like “Frontier Co-op Turkey Rub, Organic 1 lb. -- Frontier Co-op” gets ranked higher than “teenage engineering op-1”. I would expect the opposite.