top | item 41711835

Web scraping with your web browser: Why not?

150 points| 8chanAnon | 1 year ago |8chananon.github.io

Includes working code. First article in a planned series.

73 comments

order

joshdavham|1 year ago

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

Completely agree with this sentiment.

I just spent the last couple of months developing a chrome extension, but recently also did an unrleated web scraping project where I looked into all the common tools like beautiful soup, selenium, playwright, pupeteer, etc, etc.

All of these tools were needlessly complicated and I was having a ton of trouble with sites that required authentication. I then realized it would be way easier to write some javascript and paste it in my browser to do the scraping. Worked like a charm!

moritzwarhier|1 year ago

Is Playwright really that complicated?

I feel that when it has been set up, it's very straightforward to use.

Maybe in contrast to other solutions you posted? Not sure about that though; having only brief experiences with both, Playwright seems like an improved Cypress to me.

metadat|1 year ago

You might like Tamper monkey. You can add a button to kick it off or whatever your heart desires.

Tampermoney also works around CORs issues with relative ease.

smallerfish|1 year ago

I wrote a prototype of a browser extension that scraped your bookmarks + 1 degree, and indexed everything into an in-memory search index (which gets persisted in localstorage). I took over the new tab page with a simple search UI, with instant type-ahead search.

Rough aspects:

a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.

b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)

c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.

d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.

changing1999|1 year ago

In my experience building a browser-based scraper I preferred scraping pages by a direct in-browser visit rather that a fetch request. A direct visit from a real browser is basically undetectable by anti-bot software (unless you try to do something funny like automated deep crawling and scraping). So applied to your usecase it would have to go through every bookmark + 1 degree to index it. Maybe even in an offscreen canvas (haven't tried that though, could be detectable).

8chanAnon|1 year ago

>Some walled garden sites seem completely unscrapable

Any examples besides Linkedin? Tell me what sites you're trying to target and I'll have a look to see what can be done with them. It takes some pretty evil Javascript obfuscation to block me and only one site has been able to do that. I doubt that the sites you're hitting are anywhere near that evil, lol. I would appreciate it if you have a good example that I could use in a future article.

paulryanrogers|1 year ago

How often did it crawl? Once per day shouldn't trigger any blockers.

gmac|1 year ago

Yes: I find it surprising that this isn't a more widespread approach. It's how I've taught web scraping to my PhD students for some years.

https://github.com/jawj/web-scraping-for-researchers

hombre_fatal|1 year ago

It’s not widespread because it’s much more complicated than making an http request and reading the results from the body. You don’t spin up a browser, much less the full GUI, unless it’s a last resort.

hildenae|1 year ago

I understand that "with/in your web browser" implies a extention or simmilar, but i have good experience using Selenium and Python to scrape websites. Some sites are trickier than others, and when you are instrumenting a browser it easily triggers bot prevention, but you are also able to easily scrape pages that build the DOM using JS and simmilar. I have considered, but not looked into compiling my own Firefox to disable i.e. navigator.webdriver, but it feels a bit to much work.

This is my project for extracting my (your) webshop order & item data https://gitlab.com/Kagee/webshop-order-scraper

simlan|1 year ago

I also did something similar for my spring project. The idea was to buy a used car and I was frustrated with the BS the listing sites claimed as fair price etc..

I went the browser extension route and used grease monkey to inject custom JavaScript. I patched the window.fetch and because it was a react page it did most of the work for me providing me with a slightly convolute JSON doc everytime I scrolled. Getting the data extracted was only a question of getting a flask API with correct CORS settings running.

Thanks for posting using a local proxy for even more control could be helpful in the future.

throwaway48476|1 year ago

Apparently there is no web extension API to inspect the body of a fetch response and you have to override window.fetch

Seems like an omission in the spec.

linsomniac|1 year ago

There is an extension called "Amazon Order History Reporter" that will scrape Amazon to download your order history. I've used it a couple times and it works brilliantly.

seanwilson|1 year ago

> So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.

I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.

your_friend|1 year ago

Interesting, I’ve tried checkbot recently and it failed to do any cloudflare gated website, even 1 page. But maybe I’m on the old version

ggorlen|1 year ago

I wrote a similar post on in-browser scraping: https://serpapi.com/blog/dynamic-scraping-without-libraries/

My approach is a step or two more automated (optionally using a userscript and a backend) and runs in the console on the site under automation rather than cross-origin, as shown in OP.

In addition to being simple for one-off scripts and avoiding the learning curve of a Selenium, Playwright or Puppeteer, scraping in-browser avoids a good deal of potential bot detection issues, and is useful for constant polling a site to wait for something to happen (for example, a specific message or article to appear).

You can still use a backend and write to file, trigger an email or SMS, etc. Just have your userscript make requests to a server you're running.

gabrielsroka|1 year ago

Why do you need a proxy or to worry about CORS? Why not just point your browser to rumble.com and start from there?

I've posted here about scraping for example HN with JavaScript. It's certainly not a new idea.

2020: https://news.ycombinator.com/item?id=22788236

CharlieDigital|1 year ago

    > Why do you need a proxy or to worry about CORS? 
Not sure about OP, but you might want to point to a proxy depending on the site/content you are scraping and your location. For example, if you are in Canada but you want to scrape in USD, you might need to use a proxy located in the US to get US prices.

    > Why not just point your browser to rumble.com and start from there?
Some endpoints use simple web application firewall rules that will block IPs. In this case, a rotating proxy can help evade the blocks (and prevent your legitimate traffic from being blocked). Some domains use more sophisticated WAFs like Imperva and will do browser fingerprinting so you'll need even more advanced techniques to scrape successfully.

Source: work at a startup that does a lot of scraping and these are issues we've run into. Our entire office network is blocked from some sites due to some early testing without a proxy.

ljw1004|1 year ago

In my web-scraping I've gravitated towards the "cheerio" library for javascript.

I kind of don't want to use DOMParser because it's browser-only... my web-scrapers have to evolve every few years as the underlying web pages change, so I really want CI tests, so it's easiest to have something that works in node.

datadrivenangel|1 year ago

I've been playing around with this idea lately as well! There are a lot of web interfaces that are hostile to scraping, and I see no reason why we shouldn't be able to use the data we have access to for our own purposes. CUSTOMIZE YOUR INTERFACES

flashgordon|1 year ago

Ah I remember doing this almost 20 years ago and even rotating through 1500 proxies to not get tripped up by ddos detectors :). A plugin is one of the ways to scrape as it also looks like a human (ie more js run, more divs loaded and so on).

turingfeel|1 year ago

If you want to get your personal IP and fingerprint blacklisted across major providers and large ranges, unfortunately this is how you do it. Just keep the rates low.

zarzavat|1 year ago

Obviously anyone scraping on their home IP is being foolish. Getting blacklisted is the least bad thing that can happen.

As for fingerprinting, you can just use a different computer. Most people probably have a bunch of old computers lying around, right? If not, computers are cheap.

acheong08|1 year ago

I actually did that with a firefox extension + containers to scrape ChatGPT a long while back (before the APIs)

https://github.com/acheong08/ChatGPT-API-agent

Worked pretty well but browsers took up too much memory per tab so automating thousands of accounts (what i wanted) was infeasible

pimlottc|1 year ago

When I have to do some really quick ad-hoc webscraping, I often just select all text on the page, copy it, and then switch to a terminal window where I build a pipeline that extracts the part I need (using pbpaste to access the clipboard). Very quick and dirty for when you just need to hit a few pages.

ricardo81|1 year ago

I've found using a local proxy helps when using Puppeteer and a proxy. The way chrome authenticates to a proxy keeps the connection open which can sometimes mess up rotating proxy endpoints, and having to close/re-open browsers per page is just too inefficient.

changing1999|1 year ago

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

My guess would be that some companies are doing it (I worked at a major tech company that is/was), just not publicizing this fact as crawling/scraping is such a gray legal area.

chaosharmonic|1 year ago

> You can find plenty of tutorials on the Internet about the art of web scraping... and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser...

Um... [0]

[0] https://bhmt.dev/blog/scraping

8chanAnon|1 year ago

Rather long so I'll read it later. Thanks for the tip. Got more or is that it?

dewey|1 year ago

I've read through that (hard to read, because of the bad formatting) but I still don't understand why you would do that instead of Playwright, Puppeteer etc. - The only reason seems to be "This technique certainly has its limits.".

8chanAnon|1 year ago

>bad formatting

If you can elaborate, I would very much appreciate it. I'm always interested in doing better.

Why use Puppeteer etc. when you don't have to? What is the argument for using these additional tools versus not using them?

bdcravens|1 year ago

Solutions that want to automate in the context of their customers' browser. For example, ListPerfectly, a solution for cross-listing to eBay, Poshmark, etc, does this in their browser extension.

nsonha|1 year ago

sorry the format of this site is just too annoying for me to bother to read it. If this is about the shocking revelation that you can paste some code into the browser console, aka manually extracting information, then manually put that into whatever workflow that you need that information for, then I don't think that is called web scrapping, it's just browsing the web with code.

micahdeath|1 year ago

Excel/Word Macro using a WebBrowser object in a Form (old IE did this nicely; Haven't done that since Edge came out.)

deisteve|1 year ago

is there anything that runs on WASM for scraping? the issue is that you need to enable flags and turn off other security features to scrape on your web browser and this is why its not popular but with WASM that might change

8chanAnon|1 year ago

WASM runs in a sandbox. It can only talk to the outside world via Javascript so you can forget the idea that it might be a way to crack through browser security.

Maybe somebody will make a web browser with all of the security locks disabled. Sort of like the Russian commander in "Hunting for Red October" who disabled his missiles' security features in order to more effectively target the American sub but then got blown up by his own missile.

ttshaw1|1 year ago

How is this different from scraping in, say, Selenium in non-headless mode?

twelve40|1 year ago

I think Selenium's killer use case is (aside from legacy/inertia) cross-browser and cross-language. In exchange, it comes with a ton of its own baggage, since it's an additional layer in between you and your task, with its own Selenium-specific bugs, behavior limitations and edge cases.

If you don't need cross-browser and Chrome is all you need, then something like a simple Chrome extension and/or Chrome DevTools Protocol cuts out a lot of middle-man baggage and at least you will be wrangling the browser behavior directly, without any extra idiosyncrasies of middle layers.

squigz|1 year ago

This horrendous color scheme makes it impossible for me to read this.