(no title)
shanehoban | 1 year ago
However, if you have a headless browser setup for scraping, and simply fetch the current URL while on the page[0], you can get the plain text, and do a regex search for email addresses which will get you the email address - albeit this is a strange approach to take I admit.
[0]: fetch('./').then((res) => res.text()).then((text) => console.log(text))
nolok|1 year ago
Most basic scrappers, the ones that are not for your testing or devtools or automation or ... Actually use basic text, without any interpretation. They grep the source code, they don't run a dom and javascript engine, because it's a major difference in computing needs and speed.
I am not saying there is no evil scrapper doing dom evaluation, there are tons, I am reacting to your "FIRST line of defense", that one is scrambling the raw text, which is why we got there.
What parent is saying, is that this is trying to upgrade the defense that we have generated to stop the threat that evolved, but it forgot why we got there and thus makes itself vulnerable to the original threat.
animuchan|1 year ago
This technique protects from a "neither here nor there" subset of programs, I wonder how large is that set in practice.
cqqxo4zV46cp|1 year ago
nkozyra|1 year ago
This is trivial to overcome for most basic scrapers and not much harder even if you try to obfuscate with paths for more sophisticated ones.