top | item 46681774

Defeating AI scraping by rethinking webpage rendering

2 points| exodys | 1 month ago

Consider the idea of someone creating a project that rendered out webpages in images and sent those over the web; updating them whenever an input is received, much like a video game input loop. If everything was server side rendered, how difficult would it be for scraping? The idea of an un-copyable webpage is enticing, assuming that you would not like your data scraped. I know computer vision is a thing, but the error rate may be enough?

6 comments

order

bryanrasmussen|1 month ago

Now, the second question: What do you think the performance is of a page rendered to one big image and downloaded to render.

Web pages can render in pieces, images not so much. At least not the way web pages can. What is the resolution of a web page - the resolution of a web page really depends on the browser and the OS, some web pages render really high definition because that is what their OS allows (Macs for example), some browsers have more color spaces available than just RGB - many nowadays, so if your site uses more advanced color spaces are you going to render to an RGB image, meaning that your customers get less popping designs with your solution than with the browser. Or are you going to render to the most advanced image resolution possible meaning the images are going to be even bigger and it will be even harder to download.

Are you going to render multiple resolutions to give the correct resolution to user agent, so that you can save on bandwidth - by having done more renders on the server and having your customer pay for more renders.

What is caching behavior here?

I believe performance of this solution would by necessity be sub-optimal. Nobody likes a sub-optimal performance on the web, because almost all of the web is entertainment development, and people won't accept poor performance on their entertainment.

https://medium.com/luminasticity/on-premature-optimization-i...

bryanrasmussen|1 month ago

I believe you would run into accessibility laws that would make your project extremely illegal.

on edit: actually probably not illegal, that would be the wrong word, extremely open to financially ruinous lawsuits would be the correct phrasing.

exodys|1 month ago

I guess it would have to provide audio in order to be accessible, and if blind and deaf... well, do most people prepare applications for the blind and deaf?

bryanrasmussen|1 month ago

sorry I don't normally go around poking holes in people's hopeful business plans, but I have thought about the problem of keeping sites from being scraped before, and there are really two issues that everybody needs to work that go against a decent anti-scraping tool and that is

automated testing requirements make it difficult to keep things from scraped, because scraping is an automation process, and automated testing is obviously and automation process.

and the needs for accessibility. Which even if you didn't care about the moral requirements of being accessible the legal requirements force you to be.

And I have also given some thoughts to automated generation of images for a sort of graphing application so I am familiar with the performance issues as well, so your question of course hit all the points that I knew something about.

bryanrasmussen|1 month ago

the real battle with anti-scraping is human heuristic identification, which to get around a scraper needs to make their automated process behave more and more like a human, which results in making the process less and less financially rewarding.