top | item 19245069

(no title)

mozillas | 7 years ago

You could try making something like BuiltWith https://builtwith.com/ycombinator.com. They have a fairly simple front-end(nothing happening in real-time) and you could offer stats about the top 10K Alexa websites, which makes a project look better than having some dummy data, I think.

I also remember about a side project that involves by crawling HN and displaying the most mentioned books in the comments. I think the author was making a bit of money by having a link to Amazon for each book(and using an affiliate code). You could do something similar but for popular Wikipedia articles for example. Or you could use Reddit as a source and instead of searching for popular books to search for sneakers or music.

Of course, I don't know if these ideas are too simple or too complicated, interesting or not for you and the person who will give you a grade.

discuss

maceurt|7 years ago

Yes, this could be cool does not seem too simple or too hard.

On a related note a lot of different comments seem to be mentioning scraping some sort of data from a website on a continual basis. Would I just create a script that is attached to an extra worker that would send its data to the actual database that would in turn be read by the web server? Or would I want to just have the web server itself get the data and write to the database?

mozillas|7 years ago

A couple of years ago I wrote a feed reader which would check every hour for new items in a few hundreds feeds. This script was running on a $5/mo server(initially it ran of an old laptop I had for easier debugging) and it would post the the new data to the database located on the website server. So I was using two machines, one for the crawler and one for the website, and two databases too I think. The one for the feed crawler was very simple with only the list of urls and the latest item url, so I don't show it again. That was the theory, at least, feeds are a bit more complicated in real life.

That's what I did, but I might have had different requirements. If you don't have a lot to crawl and you don't have to do it very often(once a week or less), you can probably space out the requests enough so that the server doesn't feel it. It helps a lot if you use some caching as well for the website itself in this case. I think it depends a lot on the requirements of the project. But using two machines is safer I think, although it might complicate things a bit.

Keep in mind that there's probably better technical advice out there than mine. I'm a hobbyist developer.