top | item 40919198

(no title)

mnmkng | 1 year ago

In one word. Nothing.

But I personally think it does some things a little easier, a little faster and little more conveniently than the other libraries and tools out there.

Although there’s one thing that the JS version of Crawlee has which unfortunately isn’t in Python yet, but it will be there soon. AFAIK it’s unique among all libraries. It’s automatically detecting whether a headless browser is needed or if HTTP will suffice and using the most performant option.

discuss

localfirst|1 year ago

is there anything that uses a computer vision model/ocr locally to extract data?

I find some dynamic sites purposefully make it extremely difficult to parse and they obfuscate the XHR calls to their API

I've also seen some websites pollute the data when it detects scraping which results in garbage data but you don't know until its verified

mnmkng|1 year ago

We tried a self hosted OCR model a few years ago, but the quality and speed wasn’t great. From experience, it’s usually better to reverse engineer the APIs. The more complicated they are, the less they change. So it can sometimes be painful to set up the scrapers, but once they work, they tend to be more stable than other methods.

Data pollution is real. Also location specific results, personalized results, A/B testing, and my favorite, badly implemented websites are real as well.

When you encounter this, you can try scraping the data from different locations, with various tokens, cookies, referrers etc. and often you can find a pattern to make the data consistent. Websites hate scraping, but they hate showing wrong data to human users even more. So if you resemble a legit user, you’ll most likely get correct data. But of course, there are exceptions.