top | item 45774927

(no title)

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

discuss

rightbyte|4 months ago

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

chaps|4 months ago

Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.

horseradish7k|4 months ago

but not when crawling. you don't know the page format in advance - you don't even know what the page contains!