top | item 23142220

Scraping Recipe Websites

463 points| benawad | 5 years ago |benawad.com | reply

196 comments

order
[+] selecsosi|5 years ago|reply
I highly useful tool in my household for dealing with the SEO/tracking scourge that recipe blogs have become is https://www.paprikaapp.com/.

Hoping someday to have some spare time to integrate this with https://grocy.info/ and have a pipeline for recipe -> preparation automation.

[+] taude|5 years ago|reply
Big fan of this app, and I love it so I don't have to keep revisiting the sites. This is one of the few apps, I've purchased multiple times: for my iOS eco-system (I typically cook with my iPad), my android phone (so I can add recipes on the go), and my partner's devices so she can add recipes to our list.

We've developed a little workflow where we put all the recipes we want to try into an "incoming" category, and then move them to one of our custom categories when we make it and decide it's worth keeping. This is a reaction to becoming recipe hoarders when using a site like Pintarest for something similar.

The iOs has a really subtle, but nice feature when you're cooking with the app. It prevents the iPad from going to sleep and locking which you have messy fingers.

[+] vitiell0|5 years ago|reply
You might want to checkout an app I built called Cooklist. It has the features of Paprika + Grocy + Instacart + Pinterest all in one. https://cooklist.co
[+] joelrunyon|5 years ago|reply
We built this at https://ultimatemealplans.com with an eye towards consistency & simplicity.

You can

- plan your meals for the week in seconds

- generate your shopping list

- exclude foods you don't like/want

- checkout online with amazon fresh/instacart and get your groceries delivered.

Happy to demo it for any HNers who want to give it a try.

[+] neo1691|5 years ago|reply
I’m a very happy paprika user too. I would like to know more about your Workflows with paprika.
[+] syntaxing|5 years ago|reply
How much time did/do you spend for home grocery management? I'm super interested in doing this for my family but I'm worried I won't have time to maintain it.
[+] dividedbyzero|5 years ago|reply
I love Paprika, but what keeps frustrating me is the inability to share recipes with my partner. I can Airdrop a single recipe to her, but doing this for all of them, one byone, super tedious, and then she makes some changes which I'd like to have too, and there seems to be no reliable way to get her changes back on my devices. It's all the more frustrating as Paprika does sync really well, but apparently just within a single Apple ID.
[+] mc32|5 years ago|reply
I miss punchfork. Yummly is the yelp where punchfork was the craigslist but with a modernish interface for recipes.
[+] jedieaston|5 years ago|reply
Paprika 3 (I use the iOS version, but I believe the Mac version has the same function) has a fantastic web scraper for recipes. I've had to correct maybe 1-2 errors across 100 recipes I've brought in from a bunch of different sites. It's super helpful to look through them in a standardized way (and you can sort by ingredient/category) to figure out what to make.
[+] zimpenfish|5 years ago|reply
Tried this out and I have to say I'm impressed on the first recipe. Scraped it correctly (albeit from the BBC which has a reasonably sane layout) and, since I've only got 75g of dessicated coconut instead of the 85g required, I wanted to scale it by 75/85 ... which worked. I just typed in 75/85 and it worked. Amazing.
[+] linsomniac|5 years ago|reply
Thank you all for the Paprika recommendation. I just grabbed it and imported the recipe I did a for Mothers Day Eve, and it looks great! That recipe wasn't one of the worst offenders, but it's off to a good start for Paprika.
[+] karatestomp|5 years ago|reply
I think most recipes are published using a microformat that makes this pretty easy, and that's why Paprika (I use it too!) so rarely screws up.
[+] kmbfjr|5 years ago|reply
But how will I read about "Dakota", an avid yoga enthusiast who just happens to be a mom, who enjoys making healthy and savory meals for her family while blogging?

Seriously, I hope this spells an end to the Google ranking imposed nonsense that makes the simple act of searching for a recipe so insufferable.

[+] SamBam|5 years ago|reply
It's definitely grown worse now, but I think that this originated from recipe sites that people actually used to follow, because the blogs were interesting and we got to know the writers, and what's changed is more that we're jumping to the first Google hit and we expect them just to grant us the information we wanted.

There is a difference between opening up a recipe site, like a favorite blog, or the New York Times (which does the same kind of spiel before its recipes), just to read and find out what interesting thing they have posted, vs doing a search for "pasta carbonara," clicking on the first link, and having to read a life-story.

I never mind opening up the recipe section of the New York Times and reading about what's so interesting about this recipe, and memorable times it was served. That's because I trust the article to be vaguely interesting, and reading it is a form of entertainment. There's a reason why no newspaper's recipe section has ever simply been: "Pasta Carbonara: 1 lb pasta. 2 oz Pancetta. 5 egg yolks. Cheese. Combine as directed below."

So I feel like the in-vogue hatred of these recipe site styles is more a reflection of how expectations on consuming and searching for recipes has changed, more than significant changes in how recipes have always worked.

[+] wastedhours|5 years ago|reply
It's gone to extremes now, but to be honest - if someone wants to blog and post a recipe at the same time, that's their prerogative?

Some people actually enjoy reading those things too. There's a place for straight recipe sites, and a place for personal word-vomit blogs with a recipe at the bottom.

The web would be a sad place if you were only allowed to write your recipes in LaTeX.

[+] jurip|5 years ago|reply
If a recipe is too hard to find, just move on. If there's a few paragraphs of things you don't care about before the recipe, press page down. Maybe let Dakota write what she wants on her own site.

One reason I haven't seen mentioned here for that personal content: it's also a question of building context and trust. If I go to Bon Appetit for a recipe, I know they've tested it a few times and it should be more or less OK even if I don't recognize the author. If I go to a barebones anonymous just-the-recipes site, I have no faith that it ever worked and if it did, it wasn't just a fluke and it was written down right.

Having some detail around a recipe from a previously unknown source allows me to build a connection to a persona in my head, genuine or otherwise. If a recipe doesn't work I'll know to avoid the site in the future. If it does work I can remember that connection and come back to the site again with some more confidence.

[+] zwieback|5 years ago|reply
Dakota also owns many beautiful bowls and whisks handmade from sustainable materials, featured in 20 photos before the recipe hidden behind another link

I hate blogger recipes, luckily there are enough cooking sites that are well curated.

[+] tclancy|5 years ago|reply
Now we need an bot that parses the comments and applies AI to do . . . something with the people who say the substituted half the ingredients for what they had and left the others out to reply and tell them what recipe they actually made and that they can stuff their 1 star review.
[+] SkyBelow|5 years ago|reply
Does it pose an actual problem?

When I search for recipes, I type in the food I want + recipe and then open the top 5 or so links. I quick scan for a list of ingredients. If I don't easily spot on in a few seconds I move on. I'll do this until I have a couple different lists of ingredients for making the item. This ends up taking less than a minute or two. That just isn't a significant portion of time compared to how long I'll spend comparing different recipes to find a common theme to follow.

Maybe it is because I never follow a single recipe but instead combine the common themes from a couple that the whole life story before the recipe shtick isn't something that bothers me.

[+] tmountain|5 years ago|reply
It's a running joke in our house. I start off wanting to make some mashed potatoes, and time and time again, I have to suffer through someone's life story--the camping trip in North Dakota when Susan's husband first discovered his love of homemade sour cream--etc. Makes me wonder if a super barebones recipe site that literally just has recipes and absolutely no fluff would be something people would gravitate towards.
[+] memset|5 years ago|reply
Interesting! I wrote https://plainoldrecipe.com (open source!) to solve this, an inadvertently discovered many of the metadata tags described here.

The irony is that the content is required for SEO purposes, but once you’ve landed on the page you don’t want to see it. I wonder if there would be a way to write SEO that only the google bot sees and hide it from humans...

[+] m_ke|5 years ago|reply
Are there any legal issues with scraping recipe sites in a commercial app like that?

I'm assuming ingredients and directions are "facts" so can't be copyrighted, but what about the pictures?

[+] MatthewWilkes|5 years ago|reply
While a recipe isn't protected by copyright in the US (and many other countries, including the UK), the wording of the recipe could well be an original literary work, the layout of the page could attract a copyright (as it does in cookbooks) and you're right that the images would be protected.

All that said, if the import is being used for personal use only and not being edited, then it's little different to printing it out and putting it in a binder. I don't know much about US fair-use laws, but in the UK it would seem that reproducing a recipe in an app for your own use would qualify as fair dealing thanks to being personal study.

That only applies if the imports are specific to the person importing them, of course. If they're shared or published, then it's a different story. Also, if you're importing more than one recipe, so it's a significant amount of the published work, then that'd be an issue too. You can't import a whole cookbook and claim it's personal study, but one recipe out of dozens is probably fine.

[+] ApolloFortyNine|5 years ago|reply
I assume this would go into DMCA territory, since your hosting user submitted content. As long as you don't host the scraped recipes and images publicly, I imagine it would live in a legal grey area if you had a notice that you must be allowed to use the image you upload in your jurisdiction.

It'd be similar to trying to go after google because someone uploaded a copyrighted work to their google drive. I know they have to deal with it if you share the link, but they don't go out of their way to remove content you uploaded to your google drive and never shared.

[+] thinkloop|5 years ago|reply
Scraping is LEGAL, all search engines scrape to some degree for example, there is a fair use component, so you can't "scrape" 100% of a site and stick it on your domain, but you can still scrape more than zero. In general it is leaning more acceptable than less.
[+] dilliwal|5 years ago|reply
If the robots.txt file has no restrictions parsing and scraping is fine. Of course not all scrappers respect robots.txt but they should But as an internet’s citizen better to always reference the source
[+] logfromblammo|5 years ago|reply
The simple truth is that the core recipes are fact-based and non-copyrightable, and the 1000-word blogspam recipe header is both copyrightable and garners better search result rankings.

So the business model is to take facts from the public domain, wrap it in bullshit prose, and then SEO the bullshit to have higher ranking than the naked source facts, for more unique visitors and ad revenue.

Making comments about "providing recipes for free" are exactly as useful as comments about "providing phone numbers for free" or "providing mailing addresses for free" or "providing the original text of 'Little Women' for free" or "providing the steps of the long division algorithm for free".

Obfuscating the public domain is not a valuable service. Automatically removing the obfuscation is valuable. A "Project Gutenberg" style repository of recipes would be recurringly donation-worthy.

[+] stx|5 years ago|reply
This could also be useful for websites that do not print well. I have run into a few occasions where adds and other website elements printed with the actual recipe. The result was a small recipe divided on several pages mostly covered with other content. There were pictures and text formatting that I could not copy out. Often for stuff like that I just pull the HTML and edit until it prints well but I would rather have an easier way.
[+] WrtCdEvrydy|5 years ago|reply
Here's the question... why is it so difficult to do this in Android?

Seriously, AndroidDriver for Selenium was last updated 2013... and importing it throws an HttpClient error now. Update that client and you get a class duplication hell that is impossible to exit.

All I needed was to interact with 2-3 fields on a webpage but it's been eight hours and now I hate my life.

[+] openthc|5 years ago|reply
Checkout BrowserStack -- it's dead easy -- and even if you're not using their platform, their docs are good for showing the Selenium/Driver usage.
[+] MatthewWilkes|5 years ago|reply
I believe that the Webdriver/Chromedriver approach is the current recommended way of doing this.
[+] zwieback|5 years ago|reply
Cool, now the next interesting step would be to categorize recipes, maybe some kind of clustering algorithm, to see how similar they are and whether they have a common ancestor.

When I look at a recipe and notice some unusual proportions I usually check against Joy of Cooking or some other standard book. I've noticed that often everything old is new again.

[+] qrv3w|5 years ago|reply
This is great! Its a wonderful write-up.

I've also made something almost identical - a Go library for recipes scrapers for ingredients [1] and instructions [2]. Instead of the LCA method here, in my version I try to find the longest sequence of highest scoring HTML tags and those are "ingredients" or "instructions". It works very well (although I think this one works better).

Like the article mentioned, I found that the heuristics for finding HTML elements with ingredients turn out to be surprisingly simple - they usually include just a number, a measurement, and a food! This simple heuristic worked better than other sophisticated things I tried.

[1]: https://github.com/schollz/ingredients

[2]: https://github.com/schollz/instructions

[+] fulldecent2|5 years ago|reply
I saw all the terrible SEOd recipe websites and my first thought was: I should make a better recipe website that is simpler and is better SEOd.

---

FIRST EXAMPLE:

How to cook chicken on a skillet

Step 1 -- get this much chicken [picture]

Step 2 -- cook on skillet for 5 minutes

OPTIONAL -- here are seasonings you may add [pictures]

RELATED:

- How to cook a lot of chicken on a skillet [LINK]

- How to fry chicken breast [LINK]

---

But then I didn't understand how any of these websites are making money so I didn't do it.

[+] RhodesianHunter|5 years ago|reply
The reason all of these websites are so terrible with the long winded intro-stories is precisely because they do better with SEO.
[+] nicbou|5 years ago|reply
I just started transcribing every recipe I make. Even if you can extract all the essential information from a recipe site, some changes are needed:

- I need to convert recipes to metric. I am neither equipped nor inclined to cook in freedom units.

- A "can" or a "packet" is not a standard unit of measurement.

- Package sizes vary between countries. I often adjust recipes to avoid wasting food.

- I cook by mass, not volume. I convert the units them round them.

- Instructions are sometimes too verbose. I make them easier to follow while my hands are busy.

- I will make my own changes and I must write them down somewhere.

Besides, sites go down and links break. Food.com broke many of my bookmarks a few years ago. Other sites went dark. My recipes are plain text. They are editable, searchable, editable, and available offline.

[+] tincholio|5 years ago|reply
I wish I had the willpower to do this consistently...
[+] mark_l_watson|5 years ago|reply
Hey Ben, thanks for that write up! You may not have time for this, but your article and the intersection of food/recipes and computer science would make a good book, at least I would read it.

I wrote [1] about 12 years ago in Clojure because for health reasons I had to track my intake of vitamin K, then decided to track all nutrients in the USDA nutrition database. I am working on a semantic web product (with another semantic product in planning) but maybe the end of this year will get to rewriting my food web app in Common Lisp and as a macOS app. I am adding a link to your article and these comments here to my notes for that project. Useful stuff.

[1] http://cookingspace.com

[+] welanes|5 years ago|reply
Neat write-up, and thanks for putting me on to jsonld.js - looks useful.

I'm building https://simplescraper.io and we're trying to create heuristics to update CSS selectors whenever a website changes. People become unhappy when a scrape task that ran smoothly on Monday suddenly returns nothing on Tuesday so while it's a tough nut to crack it's super important.

We use a combination of XPath, historical data and data type (the value may change but the type and length often remain the same or similar) to narrow down the options.

Of course there's more sophisticated methods using Machine learning etc. but it's fun to try different approaches to solve this problem.

[+] kevindong|5 years ago|reply
I personally just find recipes, make it as written from the website, and then (if I actually like it), I'll convert it to be sane for actually following and output into Apple Notes.

What I mean by that is most recipes call for using wwwaaayyy more intermediary bowls/plates than actually required (e.g. if spices, chopped veggies, and minced garlic are going into the pot at the same time, there's no point in using three bowls) or list ingredients out of order of how you'd actually use them.

[+] peterwwillis|5 years ago|reply
So far the best way I've found to search for recipes is to search in a foreign language. Translate what you're looking for, then search and translate back to English. There are still recipe blogs, but 5 instead of 5,000, and usually an authentic dish, not what Michelle The Stir Fry Queen From Michigan thinks constitutes a "Moroccan" dish because it has cinnamon and tomatoes.

Would love to see someone put together a search engine that excludes recipe blogs and penalizes SEO.

[+] jangstrom|5 years ago|reply
This is pretty interesting. I wonder how the recipe parsers from MyFitnessPal or Pinterest compare to this. Sometimes I think they do pretty good, but often they do miss the mark. My guess is on Pinterest they only treat something as a Recipe if it contains the metadata mentioned in the article, and do the easy parse if so. MFP seems to try something a bit more advanced, but I've never been super-impressed with its parsing abilities.
[+] imgabe|5 years ago|reply
This is great. I made a similar product at No Nonsense Recipes https://nononsense.recipes because I was also tired of dealing with all the dreck on recipe sites. I did scrape some recipes to seed the site with but haven't integrated it as a feature yet.

I did ignore the photos though, since while recipes are not subject to copyright, photos are.