top | item 39729950

(no title)

kmike84 | 1 year ago

A great initiative!

We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.

It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.

Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.

Ada / can_ada look very promising!

discuss

TkTech|1 year ago

can_ada dev here. Scrapy is a fantastic project, we used it extensively at 360pi (now Numerator), making trillions of requests. Let me know if I can help :)