top | item 40869702

(no title)

zepearl | 1 year ago

My recommendation specifically about Amazon: don't webcrawl it (you'd need direct help from Amazon itself to get their data in some other way, some direct and consistent data interface).

Reason:

Wrote about ~10 years ago a crawler to discover books and to scan all of the books' ratings (the "stars" given to each book by a reviewer). It was just a hobby project.

At the beginning everything worked well, but after some weeks the layout of the pages and/or the technical IDs behind them started changing in an inconsistent way (some pages were still ok, others were slightly different, others were completely different).

My initial code was (surprisingly, hehe) excellent (nicely structured, flow & tasks of sections easy to understand, etc...), then I kept adding if/else conditions to the crawler at multiple places to make it able to cope with the new layouts/changes, after a couple of months I could hardly understand anything out of it (main point: I never knew if I could delete some portions as 1) it was garbled and 2) I didn't know if Amazon would present again old pages which would make old code relevant again).

Btw. (not directly relevant for the question) the organization of books was (is?) as well a mess:

the same book can be sold with multiple (often slightly, sometimes huge) different titles and/or authors (if more than 1 author wrote it) => ultimate confusion => at that time I fixed that by comparing the names of reviewers and their "stars": if book X had about 90% of the same reviewers AND same "stars" as book Y (data presented by Amazon can slightly change from query to query) then they were mooost probably the same thing (without comparing at all the title nor ISIN - from time to time titles of different versions of the same book were very different, even a human would have been very challenged to identify them as being the same thing, but based on what I saw Amazon knows very well what-is-what therefore even a book-version that sold 0 copies gets all reviews that got its twin book-version that previously sold 10000000 copies).

A honest "Good luck!" with your startup :o)

discuss

order

mateozaratefw|1 year ago

Thanks! sounds hard asf. I believe we can handle the inconsistence problem because we want <all> the products (so it'll be really general), however i don't think Amazon can list all their products on the webpage because of the huge amount. It'll be harder to reach the product url's than to scrap them :/