For anyone who might not be aware, Chrome also has the ability to save screenshots from the command line using:
chrome --headless --screenshot="path/to/save/screenshot.png" --disable-gpu --window-size=1280,720 "https://www.example.com"
Quick note: when trying to do full page screenshots, Chrome does a screenshot of the current view, then scrolls and does another screenshot. This can cause some interesting artifacts when rendering pages with scroll behaviors.
Firefox does a proper full page screenshot and even allows you to set a higher DPS value. I use this a lot when making video content.
Check out some of the args in FF using: `:screenshot --help`
Does anyone know whether this would also be possible with Firefox, including explicit extensions (i.e. uBlock) and explicit configured block lists or other settings for these extensions?
If you’re worried about the security risks, edge cases, maintenance pain and scaling challenges of self hosting there are various solid hosted alternatives:
Looking at your urlbox - pretty funny language around the quota system.
>What happens if I go over my quota?
>No need to worry - we won't cut off your service. We automatically upgrade you to the next tier so you benefit from volume discounts. See the pricing page for more details.
So... If I go over the quota you automatically charge me more? Hmm. I would expect to be rejected in this case.
https://www.scraperapi.com/ is good too. Been using them to scrape via their API on websites that have a lot of captchas or anti scraping tech like DataDome.
there's also our product, Airtop (https://www.airtop.ai/), which is under the scraping specialist / browser automation category that can generate screenshots too.
One thing to be cognizant of: if you're planning to run this sort of thing against potentially untrusted URLs, the browser might be able to make requests to internal hosts in whatever network it is on. It would be wise, on Linux, to use network namespaces, and block any local IP range in the namespace, or use a network namespace to limit the browser to a wireguard VPN tunnel to some other network.
A really cool tool i recently discovered. Next to scraping and performing screenshots of websites and saving it in multiple formats (including sqlite3), it can grab and save the headers, console logs & cookies and has a super cool web GUI to access all data and compare e.g the different records.
I'm planning to build my personal archive.org/waybackmachine-like web-log tool via gowitness in the not-so-distant future.
It'd be nice if it produced a list of bounding boxes + URL's you'd get if you clicked on the bounding box.
Then it'd be close to my dream of a serverless web browser service, where the client just renders a clickmap .png or .webp, and the requests go to a farm of "one request per page load" ephemeral web browser instances. The web browsers could cache the images + clickmaps they return in an S3 bucket.
Assuming the farm of browsers had a large number of users, this would completely defeat fingerprinting + cookies. It'd also provide an archive (as in durable, not as in high quality) of the browsed static content.
Similar one I wrote a while ago using Pupetteer for the IoT low power display purposes. Neat trick is that it learns the refresh interval, so that it takes a snapshot just before it's requested :) https://github.com/SmilyOrg/website-image-proxy
Being a bit frustrated with Linkwarden’s resource usage, I’ve thought about making my own self hosted bookmarking service. This could be a low effort way of loading screenshots for these links, very cool! It‘ll be interesting how many concurrent requests this can process.
Yes, sort of - that and scaling reasons. It's actually in that same repo now but in a different service. I'd like to remove it from the Abbey repo entirely eventually.
I'm looking for something similar that can also extract the diff of content on the page over time, in addition to screenshots. Any suggestions?
I have a homegrown solution using an LLM and scrapegraphai for https://getchangelog.com but would rather offload that to a service that does a better job rendering websites. There's some websites that I get error pages from using playwright, but they load fine in my usual Chrome browser.
Good point on offloading it as for the amount of work that's required in setting up a wrapper for something like Puppeteer, Playwright etc that also works with a probably quite specific setup, I've found the best way to get a quality image consistently is to just subscribe to one of the many SASS' out there that already do this well. Some of the comments above suggest some decent screenshot-as-a-service products.
Really depends on how valuable your time is over your (or your companies) money. I prefer going for the quality (and more $) solution rather than the solution that boasts cheap prices, as I tend to avoid headaches of unreliable services. Sam Vines Boots theory and all that.
For image comparison I've always found using pixelmatch by Mapbox works well for PNG's
The easiest solution to this is probably extracting / formatting the content, then running a diff on that. Otherwise you could use snapshot testing algorithms as a diffing method. We use browserbase and olostep which both have strong proxies (first one gives you a playwright instance, second one just screenshot + raw HTML).
This is cool but at this point MCP is the clear choice for exposing tools to LLMs, I'm sure someone will write a wrapper around this to provide the same functionality as an MCP-SSE server.
I want to try this out though and see how I like it compared to the MCP Puppeteer I'm using now (which does a great job of visiting pages, taking screenshots, interacting with the page, etc).
The website [1] is very strange. What does U.S. stand for? If I were to stumble on this I'd assume it was a fishing / scam website trying to impersonate the government. Bad vibes all around.
It's one guy running his little AI startup fresh out of college. Claims to be a former national security analyst but makes no such claim on his LinkedIn.
I'm working on a project that requires automated website screenshots, and I've hit the cookie banner problem. I initially tried a brute-force approach, cataloging common button classes and text to simulate clicks, but the sheer variety of implementations makes it unmanageable. So many different classes, button texts etc. I've resorted to "https://screenshotone.com", because it takes a perfect screenshot every time, never had a single cookie banner visible on the screenshots.
I would really like to know how this is handled. Maybe there is someone here that can share some knowledge.
xnx|1 year ago
cmgriffing|1 year ago
Firefox does a proper full page screenshot and even allows you to set a higher DPS value. I use this a lot when making video content.
Check out some of the args in FF using: `:screenshot --help`
input_sh|1 year ago
azhenley|1 year ago
martinbaun|1 year ago
Works in chromium as well.
antifarben|1 year ago
hulitu|1 year ago
Too bad that no browser is able to print a web page.
Onavo|1 year ago
jot|1 year ago
- https://browserless.io - low level browser control
- https://scrapingbee.com - scraping specialists
- https://urlbox.com - screenshot specialists*
They’re all profitable and have been around for years so you can depend on the businesses and the tech.
* Disclosure: I work on this one and was a customer before I joined the team.
ALittleLight|1 year ago
>What happens if I go over my quota?
>No need to worry - we won't cut off your service. We automatically upgrade you to the next tier so you benefit from volume discounts. See the pricing page for more details.
So... If I go over the quota you automatically charge me more? Hmm. I would expect to be rejected in this case.
edm0nd|1 year ago
rustdeveloper|1 year ago
bbor|1 year ago
theogravity|1 year ago
jchw|1 year ago
leptons|1 year ago
remram|1 year ago
anonzzzies|1 year ago
jot|1 year ago
It’s one of the top reasons larger organisations prefer to use hosted services rather than doing it themselves.
morbusfonticuli|1 year ago
A really cool tool i recently discovered. Next to scraping and performing screenshots of websites and saving it in multiple formats (including sqlite3), it can grab and save the headers, console logs & cookies and has a super cool web GUI to access all data and compare e.g the different records.
I'm planning to build my personal archive.org/waybackmachine-like web-log tool via gowitness in the not-so-distant future.
[1] https://github.com/sensepost/gowitness
quink|1 year ago
Not two words that should be near each other, and JPEG is the only option.
Almost like it’s designed to nerd-snipe someone into a PR to change the format based on Accept headers.
gkamer8|1 year ago
pls
westurner|1 year ago
From https://news.ycombinator.com/item?id=30681242 :
> Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing
> "visual-regression": https://github.com/topics/visual-regression
hedora|1 year ago
Then it'd be close to my dream of a serverless web browser service, where the client just renders a clickmap .png or .webp, and the requests go to a farm of "one request per page load" ephemeral web browser instances. The web browsers could cache the images + clickmaps they return in an S3 bucket.
Assuming the farm of browsers had a large number of users, this would completely defeat fingerprinting + cookies. It'd also provide an archive (as in durable, not as in high quality) of the browsed static content.
mlunar|1 year ago
rpastuszak|1 year ago
https://untested.sonnet.io/notes/xitterpng-privacy-friendly-...
manmal|1 year ago
OptionOfT|1 year ago
codenote|1 year ago
Was the motivation for separating it based on security considerations, as stated in the "Security Considerations"? https://github.com/US-Artificial-Intelligence/ScrapeServ?tab...
gkamer8|1 year ago
kevinsundar|1 year ago
I have a homegrown solution using an LLM and scrapegraphai for https://getchangelog.com but would rather offload that to a service that does a better job rendering websites. There's some websites that I get error pages from using playwright, but they load fine in my usual Chrome browser.
arnoldcjones|1 year ago
Really depends on how valuable your time is over your (or your companies) money. I prefer going for the quality (and more $) solution rather than the solution that boasts cheap prices, as I tend to avoid headaches of unreliable services. Sam Vines Boots theory and all that.
For image comparison I've always found using pixelmatch by Mapbox works well for PNG's
https://github.com/mapbox/pixelmatch
caelinsutch|1 year ago
joshstrange|1 year ago
I want to try this out though and see how I like it compared to the MCP Puppeteer I'm using now (which does a great job of visiting pages, taking screenshots, interacting with the page, etc).
mpetrovich|1 year ago
It uses puppeteer and chrome headless behind the scenes.
nottorp|1 year ago
tantaman|1 year ago
bangaladore|1 year ago
[1] - https://us.ai/
wildzzz|1 year ago
ge96|1 year ago
aspeckt-112|1 year ago
robertclaus|1 year ago
cchance|1 year ago
_nolram|1 year ago
I would really like to know how this is handled. Maybe there is someone here that can share some knowledge.
unknown|1 year ago
[deleted]
synthomat|1 year ago
cess11|1 year ago
busymom0|1 year ago
gkamer8|1 year ago
s09dfhks|1 year ago
ranger_danger|1 year ago
gkamer8|1 year ago
Mani_Pathak|1 year ago
[deleted]