Introduction to web scraping with Python

[+] Loic|8 years ago|reply

It is making one mistake, it is parsing and scraping in the same loop. You should pull the data, store them and have another process accessing the data store and perform the parsing and understanding of the data. A "quick" parsing can be done to pull the links and build your frontier, but the data should be pulled and stored for the main parsing.

This allows you to test your parsing routines independently of the target website, this allows you to later compare with previous versions and this allows you to reparse everything in the future, even after the original website long gone is.

My recommendation is to use the WARC archive format to store the results, this way you are on the safe side (the storage is standardized), it compresses very well and the WARC are easy to handle (they are immutable store, nice for backups).

[+] jackschultz|8 years ago|reply

1000% percent this. I write about Python web scraping a lot and the big one is that there's two parts. First is gathering the pages you need to scrape locally, and the second is scraping the pages you've saved. You need need to separate those two to avoid hitting their servers over and over when you're tying to debug the scraping code. My way is to write the first in a file called gather.py, and then the other in scrape.py. Have that in mind before doing and heavy scraping.

Since my scrapes aren't always the biggest, feel free to just save the html in a local folder and then scrape from there. This part of the project depends on how many pages you need to scrape, the size of the files, whether you need to store the data, whether it's a one time scrape or croned, etc. Either way, save the files, and scrape from there.

Here are some of the posts I've done on the subject if people reading the comments want to see more about scraping.

- https://bigishdata.com/2017/05/11/general-tips-for-web-scrap... - https://bigishdata.com/2017/06/06/web-scraping-with-python-p... - https://bigishdata.com/2017/05/11/general-tips-for-web-scrap...

[+] ivansavz|8 years ago|reply

For a simple caching solution that works well with requests, you can look at cachecontrol:

    from cachecontrol import CacheControl
    
    sess = requests.session()
    cached_sess = CacheControl(sess)
    response = cached_sess.get('http://google.com')

Very good for interactive debugging when you have to make multiple GET requests. First time you'll hit the webserver, after that it's all served from cache.

[+] dmn001|8 years ago|reply

There is no issue with parsing and scraping in the same loop as long as there is caching in there as well. You don't want to be hitting the server repeatedly whilst you're debugging.

A project like Scrapy should have caching on by default, but it seems to be an afterthought. Repeatable and reproducible parsing of cached websites is necessary, e.g. if you find additional data fields that you want to parse without downloading the entire site over again.

[+] Alex3917|8 years ago|reply

Agreed on saving the files first. Here is a code snippet that implements something similar but saves each URL response first, albeit not using WARC:

https://pastebin.com/6F962RVJ

[+] tekkk|8 years ago|reply

Yeah this might be handy for small stuff but it's way too naive for anything bigger than couple pages. I recently had to scrape some pictures and meta-data from a website and while scripts like these seemed cool they really didn't scale up at all. Consider navigation, following URLs and downloading pictures all while remaining in the limits what's considered non-intrusive.

My first attempt, similar to this, failed miserably as the site employed some kind of cookie check that immediately blocked my requests by returning 403.

As mentioned in article I then moved on to Scrapy https://scrapy.org/. While seemingly a bit overkill once you create your scraper it's easy to expand and use the same scaffold on other sites too. Also it gives a lot more control on how gently you scrape and outputs nicely json/jl/csv with the data you want.

Most problems I had was with the Scrapy pipelines and getting it to output properly two json files and images. I could write a very short tutorial on my setup if I wasn't at work and otherwise busy right now.

And yes it's a bit of grey area but for my project (training a simple CNN based on the images) I think it was acceptable considering that I could have done the same thing manually (and spent less time too).

[+] dorfsmay|8 years ago|reply

Python Requests has a notion if "session" which takes care of cookies etc... Use it all the time when needing to automate tasks that require to sign in.

[+] qrybam|8 years ago|reply

I've been through the rigmarole of writing my own crawlers and and find Scrapy very powerful. I've run into roadblocks with dynamic/Javascript heavy sites; for those parts selenium+chromedriver works really well.

As parent and others have said: this is a grey area so make sure to read the terms of use and/or gain permission before scraping.

[+] drej|8 years ago|reply

I love requests+lxml, use it fairly regularly, just a few quick notes:

1. lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

2. Don't forget to check the status code of r (r.status_code or less generally r.ok)

3. Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.

4. Kind of automatic, but I'll say it anyway - scraping is a gray area, always make sure that what you're doing is legitimate.

[+] masklinn|8 years ago|reply

> 1. lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

Caveat: lxml's HTML parser is garbage, so is BS's, they will parse pages in non-obvious ways which do not reflect what you see in your browser, because your browser follows HTML5 tree building.

html5lib fixes that (and can construct both lxml and bs trees, and both libraries have html5lib integration), however it's slow. I don't know that there is a native compatible parser (there are plenty of native HTML5 parsers e.g. gumbo or html5ever but I don't remember them being able to generate lxml or bs trees).

> 2. Don't forget to check the status code of r (r.status_code or less generally r.ok)

Alternatively (depending on use case) `r.raise_for_status()`. I'm still annoyed that there's no way to ask requests to just check it outright.

> Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.

FWIW cssselect simply translates CSS selectors to XPath, and while I don't know for sure I'm guessing it has an expression cache, so it should not be noticeably slower than XPath (CSS selectors are not a hugely complex language anyway)

[+] jacobush|8 years ago|reply

I had reason to gather news articles and extract keywords and authors - can't remember why I didn't use BeatifulSoup, because that was my first choice. In the end I used lxml with its html5parser:

https://github.com/vonj/scraping

Regarding legality - something frowned upon is putting load on servers. You may get blocked, rate limited or worse if you put too much strain on servers. Especially when you as I did, experimented with different query options to get the data I needed, and had to rerun scraping a number of times.

A neat trick I found, was to configure a web proxy locally (squid, in my case) to aggressively cache EVERYTHING. This was, new runs only went out to the news sites for new queries I had never run before. Very helpful, it also speeded up the development to access files locally (cached in squid) instead of having to go out to the internet all the time.

[+] sonofgod|8 years ago|reply

Yep. requests.session fixes the vast majority of cookie / login / session / problems. Replaying the same headers is also a powerful technique, so long as they aren't constantly changing.

ASPX is still horrid, but just about doable if you pull all the hidden form variables out of the HTML and put them back into the header verbatim. But you're probably best off going down a headless browser route at that point.

If anyone's wanting someone with scraping experience (UK/remote), I'm currently available... ([email protected])

[+] austincheney|8 years ago|reply

This is perhaps the fastest way to screenscrape a dynamically executed website.

1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: https://github.com/prettydiff/getNodesByType/blob/master/get...

2. Extract the text content from the text nodes and ignore nodes that contain only white space:

let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do { if ((/^(\s+)$/).test(text[a].textContent) === false) { output.push(text[a].textContent); } a = a + 1; } while (a < b); output;

That will gather ALL text from the page. Since you are working from the DOM directly you can filter your results by various contextual and stylistic factors. Since this code is small and executes stupid fast it can be executed by bots easily.

[+] unknown|8 years ago|reply

[deleted]

[+] chinathrow|8 years ago|reply

I wonder how many folks using this will obey the robots.txt as explained nicely within the article:

"Robots

Web scraping is powerful, but with great power comes great responsibility. When you are scraping somebody’s website, you should be mindful of not sending too many requests. Most websites have a “robots.txt” which shows the rules that your web scraper should obey (which URLs are allowed to be scraped, which ones are not, the rate of requests you can send, etc.)."

[+] robattila128|8 years ago|reply

Not many. Browser testing libraries are widely repurposed as automation tools by black hats.

[+] laktek|8 years ago|reply

I found a lot of use cases for web scraping is kinda ad-hoc and usually, occurs as part of another task (eg. a research project or enhancing a record). I ended up releasing a simple hosted API service called Page.REST (https://page.rest) for people who would like to save that extra dev effort and infrastructure cost.

[+] gmac|8 years ago|reply

I agree, and I find scripting a web browser via the developer console a really productive approach.

First, it's completely interactive.

Second, it's the browser, so absolutely everything works. It doesn't matter if the data you want is only loaded by an obscure JS function when a hidden form is submitted on a button click. Just find the button, .click() it, and wait for a mutation event.

I have a write up on this[1], but I need to extend it with some more advanced examples.

[1] https://github.com/jawj/web-scraping-for-researchers

[+] jancurn|8 years ago|reply

With a headless browser the web scraping script can be even simpler. For example, have a look at the same scraper for datawhatnow.com at https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-com

[+] martinald|8 years ago|reply

I've found .NET great for scraping. More so than Python as LINQ can be really really useful for weird cases I find.

My usual setup on OSX is .NET Core + HTMLAgilityPack + Selenium.

[+] dennisgorelik|8 years ago|reply

Did you consider using AngleSharp instead of HTMLAgilityPack?

https://github.com/AngleSharp/AngleSharp

[+] WhitneyLand|8 years ago|reply

Could CSS selectors, with a few minor extensions, be just as good at XPath for this kind of thing?

I guess a lot of the reason I find xpath frustrating is my usage frequency corresponds exactly to the time needed to forget the syntax and have to relearn/refresh it in my head.

If CSS selectors needed only a few enhancements to compete with XPath, it might be worth enhancing a selector library to enable quick ramp up speed for more web people.

[+] TACIXAT|8 years ago|reply

In the Chrome console you can right click elements in the sources tab and select Copy > Copy XPath.

For example, your comment:

//*[@id="15541111"]/td/table/tbody/tr/td[3]/div[2]

[+] tycho01|8 years ago|reply

> If CSS selectors needed only a few enhancements to compete with XPath

You may want to try ParslePy, it combines CSS/XPath functionality, allowing you to declaratively specify the selector paths in a JSON file. I just made a PR to allow YAML over JSON, but not sure if Pip picked up on it yet.

[+] donjh|8 years ago|reply

As an alternative to lxml or BeautifulSoup, I've used a library called PyQuery (https://pythonhosted.org/pyquery/) with some success. It has a very similar API to jQuery.

[+] staticautomatic|8 years ago|reply

I can't stress enough what a bad idea it usually is to copy XPath expressions generated by dev tools. They tend to be super inefficient for traversing the tree (e.g. beginning with "*" for no reason), and don't make good use of tag attributes.

[+] victor106|8 years ago|reply

How do you guys manage masking the IP address when you want to scrape using your python script?

[+] dmn001|8 years ago|reply

I find there is really no need to hide or mask the IP address when web scraping. The use of proxies or Tor to do so is completely unnecessary and maybe prohibitive e.g. try using Google in Tor.

[+] nathell|8 years ago|reply

I wrote a Clojure library that facilitates writing this sort of scripts in a relatively robust way:

https://github.com/nathell/skyscraper

[+] rlander|8 years ago|reply

Just wanted to say that skyscraper is awesome! Thanks for building it!

[+] vectorEQ|8 years ago|reply

lxml is nice. i would as suggested parse and scrape in different threads so you can speed up a bit, but it's not required per se. if you can't get the data you see on the website using lxml there might be ajax or other stuff implemented. to capture these streams / datas use a headless browser like phantomJS or so. Article looks good to me for 'simple' scrapings and is a good base to start playing with the concepts.

The nice thing about making a scraper from scratch like this is that you get to decide it's behaviour and fingerprint ,and you wont get blocked as some known scraperr. that being said, most people would appreiciate if you parse their robots.txt , but depending on your geographical locatin this might be an 'extra' step which isnt needed... (i'd advise to do it anyway if you are a friendly ;) and maybe put in user agent for requests something like 'i don't bite' to let ppl know you are benign...) if you get blocked while trying to scrape you can try to fake site into thinking you are browser just by setting user agent and other headrs appropriately. if you dont know which these are, open nc -nlvp 80 on your local machine and wget or firefox into it to see headers...

Deciding on good xpath or 'markers' to scrape can be automated, but it's often ,. if you need good accurate data from a singlular source, a good idea to manually go through the html and seek some good markers...

an alternate method of scraping is automating wget --recursive + links -dump to render html pages to txt output and grep or w/e these for what data you need... tons of methods can be devised... depending on your needs some will be more practical and stable than others.

saving files is only usefull if you need assurance on data quality and if you want to be able to tweak the results without having to re-request the data from the server. (just point to local data directory instead...). this way you can setup a harvester and parsers fr this datas.

if you want to scrape or harvest LARGE data sets consider a proxy network or something like a tor connection jugling docker instance or so to ensure rate limiting is not killing your hrvesters...

good luck have fun and don't kill peopels servers with your traffic spam, that's a dick move.... (throttle/humanise your scrapings...)

[+] davidpelayo|8 years ago|reply

Anyone knows good repo to do the same in Go?

[+] q3k|8 years ago|reply

goquery [1] is pretty nice.

[1] - https://github.com/PuerkitoBio/goquery

[+] qrv3w|8 years ago|reply

I've been working on this: https://github.com/schollz/crawdad.

Its a redis-backed distributed scraper that's easy to continue after interruptions.

[+] ianertson|8 years ago|reply

[deleted]

[+] wyldfire|8 years ago|reply

What gives?

"Coming Soon

We're not ready yet..."

60 comments