Show HN: An API that takes a URL and returns a file with browser screenshots

xnx|1 year ago

For anyone who might not be aware, Chrome also has the ability to save screenshots from the command line using: chrome --headless --screenshot="path/to/save/screenshot.png" --disable-gpu --window-size=1280,720 "https://www.example.com"

cmgriffing|1 year ago

Quick note: when trying to do full page screenshots, Chrome does a screenshot of the current view, then scrolls and does another screenshot. This can cause some interesting artifacts when rendering pages with scroll behaviors.

Firefox does a proper full page screenshot and even allows you to set a higher DPS value. I use this a lot when making video content.

Check out some of the args in FF using: `:screenshot --help`

input_sh|1 year ago

Firefox equivalent:

    firefox -screenshot file.png https://example.com --window-size=1280,720

A bit annoyingly, it won't work if you have Firefox already open.

azhenley|1 year ago

Very nice, I didn't know this. I used pyppeteer and selenium for this previously which seemed excessive.

martinbaun|1 year ago

Oh man, I needed this so many times didn't even think of doing it like this. I tried using Selenium and all different external services. Thank you!

Works in chromium as well.

antifarben|1 year ago

Does anyone know whether this would also be possible with Firefox, including explicit extensions (i.e. uBlock) and explicit configured block lists or other settings for these extensions?

hulitu|1 year ago

> Chrome also has the ability to save screenshots

Too bad that no browser is able to print a web page.

Onavo|1 year ago

What features won't work without GPU?

jot|1 year ago

If you’re worried about the security risks, edge cases, maintenance pain and scaling challenges of self hosting there are various solid hosted alternatives:

- https://browserless.io - low level browser control

- https://scrapingbee.com - scraping specialists

- https://urlbox.com - screenshot specialists*

They’re all profitable and have been around for years so you can depend on the businesses and the tech.

* Disclosure: I work on this one and was a customer before I joined the team.

ALittleLight|1 year ago

Looking at your urlbox - pretty funny language around the quota system.

>What happens if I go over my quota?

>No need to worry - we won't cut off your service. We automatically upgrade you to the next tier so you benefit from volume discounts. See the pricing page for more details.

So... If I go over the quota you automatically charge me more? Hmm. I would expect to be rejected in this case.

edm0nd|1 year ago

https://www.scraperapi.com/ is good too. Been using them to scrape via their API on websites that have a lot of captchas or anti scraping tech like DataDome.

rustdeveloper|1 year ago

Happy to suggest another web scraping API alternative I rely on: https://scrapingfish.com

bbor|1 year ago

Do these services respect norobot manifests? Isn't this all kinda... illegal...? Or at least non-consensual?

theogravity|1 year ago

there's also our product, Airtop (https://www.airtop.ai/), which is under the scraping specialist / browser automation category that can generate screenshots too.

jchw|1 year ago

One thing to be cognizant of: if you're planning to run this sort of thing against potentially untrusted URLs, the browser might be able to make requests to internal hosts in whatever network it is on. It would be wise, on Linux, to use network namespaces, and block any local IP range in the namespace, or use a network namespace to limit the browser to a wireguard VPN tunnel to some other network.

leptons|1 year ago

This is true for practically every web browser anyone uses on any site that they don't personally control.

remram|1 year ago

Very important note! This is called Server-Side Request Forgery (SSRF).

anonzzzies|1 year ago

Is there a self hosted version that does this properly?

jot|1 year ago

Too many developers learn this the hard way.

It’s one of the top reasons larger organisations prefer to use hosted services rather than doing it themselves.

morbusfonticuli|1 year ago

Similar project: gowitness [1].

A really cool tool i recently discovered. Next to scraping and performing screenshots of websites and saving it in multiple formats (including sqlite3), it can grab and save the headers, console logs & cookies and has a super cool web GUI to access all data and compare e.g the different records.

I'm planning to build my personal archive.org/waybackmachine-like web-log tool via gowitness in the not-so-distant future.

[1] https://github.com/sensepost/gowitness

quink|1 year ago

> SCREENSHOT_JPEG_QUALITY

Not two words that should be near each other, and JPEG is the only option.

Almost like it’s designed to nerd-snipe someone into a PR to change the format based on Accept headers.

gkamer8|1 year ago

> Almost like it's designed to nerd-snipe someone into a PR to change the format based on Accept headers

pls

westurner|1 year ago

simonw/shot-scraper has a number of cli args, a GitHub actions repo template, and docs: https://shot-scraper.datasette.io/en/stable/

From https://news.ycombinator.com/item?id=30681242 :

> Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing

> "visual-regression": https://github.com/topics/visual-regression

hedora|1 year ago

It'd be nice if it produced a list of bounding boxes + URL's you'd get if you clicked on the bounding box.

Then it'd be close to my dream of a serverless web browser service, where the client just renders a clickmap .png or .webp, and the requests go to a farm of "one request per page load" ephemeral web browser instances. The web browsers could cache the images + clickmaps they return in an S3 bucket.

Assuming the farm of browsers had a large number of users, this would completely defeat fingerprinting + cookies. It'd also provide an archive (as in durable, not as in high quality) of the browsed static content.

mlunar|1 year ago

Similar one I wrote a while ago using Pupetteer for the IoT low power display purposes. Neat trick is that it learns the refresh interval, so that it takes a snapshot just before it's requested :) https://github.com/SmilyOrg/website-image-proxy

rpastuszak|1 year ago

Cool! In using sth similar on my site to generate screenshots of tweets (for privacy purposes):

https://untested.sonnet.io/notes/xitterpng-privacy-friendly-...

manmal|1 year ago

Being a bit frustrated with Linkwarden’s resource usage, I’ve thought about making my own self hosted bookmarking service. This could be a low effort way of loading screenshots for these links, very cool! It‘ll be interesting how many concurrent requests this can process.

OptionOfT|1 year ago

Have you looked into Wallabag?

codenote|1 year ago

I thought it was a scale of code that could have been included in Abbe. https://github.com/US-Artificial-Intelligence/abbey

Was the motivation for separating it based on security considerations, as stated in the "Security Considerations"? https://github.com/US-Artificial-Intelligence/ScrapeServ?tab...

gkamer8|1 year ago

Yes, sort of - that and scaling reasons. It's actually in that same repo now but in a different service. I'd like to remove it from the Abbey repo entirely eventually.

kevinsundar|1 year ago

I'm looking for something similar that can also extract the diff of content on the page over time, in addition to screenshots. Any suggestions?

I have a homegrown solution using an LLM and scrapegraphai for https://getchangelog.com but would rather offload that to a service that does a better job rendering websites. There's some websites that I get error pages from using playwright, but they load fine in my usual Chrome browser.

arnoldcjones|1 year ago

Good point on offloading it as for the amount of work that's required in setting up a wrapper for something like Puppeteer, Playwright etc that also works with a probably quite specific setup, I've found the best way to get a quality image consistently is to just subscribe to one of the many SASS' out there that already do this well. Some of the comments above suggest some decent screenshot-as-a-service products.

Really depends on how valuable your time is over your (or your companies) money. I prefer going for the quality (and more $) solution rather than the solution that boasts cheap prices, as I tend to avoid headaches of unreliable services. Sam Vines Boots theory and all that.

For image comparison I've always found using pixelmatch by Mapbox works well for PNG's

https://github.com/mapbox/pixelmatch

caelinsutch|1 year ago

The easiest solution to this is probably extracting / formatting the content, then running a diff on that. Otherwise you could use snapshot testing algorithms as a diffing method. We use browserbase and olostep which both have strong proxies (first one gives you a playwright instance, second one just screenshot + raw HTML).

joshstrange|1 year ago

This is cool but at this point MCP is the clear choice for exposing tools to LLMs, I'm sure someone will write a wrapper around this to provide the same functionality as an MCP-SSE server.

I want to try this out though and see how I like it compared to the MCP Puppeteer I'm using now (which does a great job of visiting pages, taking screenshots, interacting with the page, etc).

mpetrovich|1 year ago

Reminds me of this open source library I wrote to do the same thing: https://github.com/nextbigsoundinc/imagely

It uses puppeteer and chrome headless behind the scenes.

nottorp|1 year ago

Why is the repo called artificial intelligence when it just runs browsers?

tantaman|1 year ago

us ai?

bangaladore|1 year ago

The website [1] is very strange. What does U.S. stand for? If I were to stumble on this I'd assume it was a fishing / scam website trying to impersonate the government. Bad vibes all around.

[1] - https://us.ai/

wildzzz|1 year ago

It's one guy running his little AI startup fresh out of college. Claims to be a former national security analyst but makes no such claim on his LinkedIn.

ge96|1 year ago

the very same

aspeckt-112|1 year ago

I’m looking forward to giving this a go. Great idea!

robertclaus|1 year ago

We developed a service like this internally at a previous company. It was nice to have a generic "preview" generating service.

cchance|1 year ago

The fact the github doesn't have a screenshot seems... like a sad omission

_nolram|1 year ago

I'm working on a project that requires automated website screenshots, and I've hit the cookie banner problem. I initially tried a brute-force approach, cataloging common button classes and text to simulate clicks, but the sheer variety of implementations makes it unmanageable. So many different classes, button texts etc. I've resorted to "https://screenshotone.com", because it takes a perfect screenshot every time, never had a single cookie banner visible on the screenshots.

I would really like to know how this is handled. Maybe there is someone here that can share some knowledge.

unknown|1 year ago

[deleted]

synthomat|1 year ago

That's nice and everything but what to do about the EU cookie banners? Does hosting outside of the EU help?

cess11|1 year ago

No. Tell the services you're using to stop with the malicious compliance.

busymom0|1 year ago

Would recommend using SeleniumBase's CDP mode to search for those substrings, click accept on those cookie banners and then take screenshot.

gkamer8|1 year ago

Yeah the EU cookie banners are annoying, I'm hoping to do some automation to click out of them before taking the screenshots

s09dfhks|1 year ago

hmm cant defeat cloudflare unfortunately, otherwise not bad

ranger_danger|1 year ago

No license?

gkamer8|1 year ago

Oh wow, totally forgot. Just added MIT.

Mani_Pathak|1 year ago

[deleted]

100 comments