top | item 19249723

(no title)

mozillas | 7 years ago

A couple of years ago I wrote a feed reader which would check every hour for new items in a few hundreds feeds. This script was running on a $5/mo server(initially it ran of an old laptop I had for easier debugging) and it would post the the new data to the database located on the website server. So I was using two machines, one for the crawler and one for the website, and two databases too I think. The one for the feed crawler was very simple with only the list of urls and the latest item url, so I don't show it again. That was the theory, at least, feeds are a bit more complicated in real life.

That's what I did, but I might have had different requirements. If you don't have a lot to crawl and you don't have to do it very often(once a week or less), you can probably space out the requests enough so that the server doesn't feel it. It helps a lot if you use some caching as well for the website itself in this case. I think it depends a lot on the requirements of the project. But using two machines is safer I think, although it might complicate things a bit.

Keep in mind that there's probably better technical advice out there than mine. I'm a hobbyist developer.

discuss

No comments yet.