A nice set of challenges -- kitty at a school with tens of thousands of bucks a year or less immediately.
---
Also, a word count on patio11's submissions: 1,052,351. For comparison, all 7 Harry Potter books total 1,084,170 words. patio11 has written the entire Harry Potter series worth of content on HN. Just... wow.
Thanks, I had been curious about that number for a while. The last time I checked it was 500k or so.
For folks who want to do interesting things with the API but don't want to be abusive to Firebase's servers, I whipped up a quick ruby script to cache a particular user's comments/submissions on disk: https://gist.github.com/patio11/1550cad3a02edd175049
It tries to rate limit itself by putting 200ms of sleep between requests, so downloading all of my comments would take ~30 minutes.
"I release this work unto the public domain." -- feel free to adapt it to your needs.
Usage is "ruby slurper.rb $USERNAME $MAX_COMMENTS_TO_FETCH."
This... is cool, but also kinda sucks for me. I've invested dozens of hours into writing an extremely complicated scraper for my Android version of HN.
The newest version (still under development, probably a month or two from release) adds support for displaying polls, linking to subthreads, and full write support (voting, commenting, submitting, etc). I'm fine with switching to a new API (Square's Retrofit will make it super easy to switch), but without submitting, commenting, and upvote support I have to disable a bunch of features I worked really hard on. Also it would've been cool to know this was coming about 3 months ago so I didn't waste my time.
Anyways, quick question on how it works -- when I query for the list of top stories
I'm sorry you just invested a lot of time in scraping. I know from experience what a pain that is. We said several times that the API was coming, and I've made it clear to anyone who asked, but there's just no way to reach everybody. All: in the future, please get answers to questions like this by emailing [email protected].
Re write access and logged-in access, if that turns out to be how people want to use the API, that's the direction we'll go. But we think it's important to launch an initial release and develop it based on feedback. There are many other use cases for this data besides building a full-featured client: analyzing history, providing notifications, and so on. It will be fascinating to see what people build!
This... is cool, but also kinda sucks for me. I've invested dozens of hours into writing an extremely complicated scraper for my Android version of HN.
This definitely does suck. I feel your pain. But it's also part of the package of scraping websites. You go in knowing that it could break at any time.
Yes. While with HTTP pipelining you can request them all over a single TCP connection using a single SSL session, you will need to make an HTTP request for each item you want.
If you're on a supported platform, the Firebase SDKs handle all this efficiently and can even provide real-time change notifications.
I'm also currently writing a scraper[1] for the HN frontpage (for my WIP Hacker News redesign), and while there's a limited Algolia API available, it doesn't do much good if users can't post comments, upvote etc. Same goes for the official one now.
So, @anyone involved with the API project, can you give us an estimate on when will the OAuth-based user-specific API be rolled out? I'm fining with pausing my efforts until then, if it's going to be soon, in order to go a less complex and error-prone path.
[Firebase Dev Advocate]
@airlocksoftware - Yes, you should make separate requests for each story. You can attach a listener to the topstories node (https://www.firebase.com/docs/web/guide/retrieving-data.html...) and when that’s triggered, you can make a request for the data on each story. Using the Firebase SDK, each request will get made using the same connection. I'd recommend using our SDK instead of the REST API so you don't have to worry about managing your own connections and retries.
Just wanted to drop a comment on the awesomeness of your app.
Hacker News 2 is by far the best Hacker News app, not just on Android, but on all mobile platforms i've tried (so, iOS, Android and Windows Phone)
Awesome work you are doing.
Yeah, I like it a lot, but I've put tons of time into my scraper for Reader YC (https://github.com/krruzic/Reader-YC). I support everything but polls currently. This api is nice but my scraper actually supports more... No option to get Show HN, Ask HN or New afaik. Still glad this is out!
So why, in the first place, would I want another mobile app rather than just opening the fully functional website (which is pretty simple & basic already) on my mobile browser?
I've been working on a Hacker News client for Windows Phone over the past several weeks and am very close to an initial release, so I feel somewhat ambivalent about this.
On the one hand, of course it's great that HN is finally getting a proper API and also modernizing its markup (which is a mess even if you ignore all the tables – for example, the first paragraph in a comment usually isn't wrapped in <p> tags), but on the other hand this current v0 version is very lacking and impractical for a regular client application.
Since the top stories (limited to 100) and child comments are only available as a list of IDs a client app would have to make a separate HTTP request for every single item, which is obviously not something you'd want to do especially in a mobile environment. Other lists apart from the top stories (new, show, ask, best, active etc.) don't seem to be available at all right now.
Of course this is just the first version, and the documentation promises improvements over time – which I don't doubt at all – but there's no clear indication that the API will be at feature-parity with the current website, even excluding anything that requires authentication, by October 28. So this means that I – and other developers of client apps or unofficial APIs – will probably have to write new scraping code once the new rendering engine (which I assume refers to the website) arrives instead of being able to switch to the new API immediately.
Now I guess I might just be needlessly worried, especially since the blog post explicitly says that the new API "should hopefully making switching your apps fairly painless", but then why not wait until it's actually ready for that before making the announcement? Putting a half-baked API out there a few days/weeks (?) in advance before it's fully fleshed out doesn't seem all that helpful, at least to me.
(As I find myself pondering the idea of standing something up like this on an dual-stacked server purely so that I could access HN from my IPv6-only test network... hmmm...)
To everyone asking about logged-in access and write access: this is just a first release! Where it goes from here will depend, in good iterative fashion, on what people want.
How does this differ from the Algolia HN API in terms of data access? (https://hn.algolia.com/api) I was able to download all HN data recently with ease using that endpoint. Authentication?
EDIT: After looking at the documentation there are two new aspects of the Firebase API not in the Algolia API:
1) Ability to see deleted/dead stories.
2) Endpoint for user data.
Question to kogir/dang: Has the "delay" field (Delay in minutes between a comment's creation and its visibility to other users) always been there?
[Firebase founder here] This is pretty exciting for us, we're glad kogir, dang, kevin and sctb chose to expose HN's data through Firebase. We're seen quite a few startups (and big companies like Nest) do this, since building, maintaining, and documenting a public API often isn't a easy task.
This makes it really easy to add average karma to the comment section for every user. For instance, you can paste the below into the console, and should add average karma data for each user.
Here is something I built with the Algolia API awhile back and just haven't gotten around to cleaning it up to post here.
It lets you download all comments/stories for a user as a JSON or CSV file, breaks down karma between comments and stories, and plots comment/story counts, karma, etc. over time on a line chart (clicking will show you the details via an hnsearch).
Also I built some npm modules so you can get this information via the commandline.
I really appreciate giving a 3 week heads up before moving to a new frontend structure. It's a nice gesture, but I have this horrible feeling that there's only about a 10% chance that my Hacker News app gets updated in time.
I know you can't not iterate because people are scraping, but it does stink. At least this will make everything more future-proof going forward.
However, it may be nice to give a bit more heads up than 3 weeks. I know a lot of apps can take ~2 weeks to get through the review process for iOS.
I've built a library for iOS (https://github.com/bennyguitar/libHN) that handles scraping, commenting, submitting, voting, etc pretty well and allows me to make as few web calls as necessary to use HN. It looks like I'd have to drop functionality and completely change the networking scheme to match this API - something I'm not willing to do yet.
Correct me if I'm wrong here, but to get every comment on a post, I'd have to recursively get each item for each child. Instead, right now, I can make one network request and get all comments for a story. Granted, I have to parse the HTML (which I hate), but it's a much cleaner solution than going through every item, checking the children and then getting those items ad infinitum. Again, I just glanced over the documentation, but that seems untenable to me.
I welcome the idea, but this barely qualifies as an API. The most useful part is the "current top stories" - but what timeframe exactly? Seems to be over 3 days at least and can't be customized. And even my test parsing of the 100 top stories took a good minute.
And that returns only the ids, nothing else. To get basic information like the score, title or url you have to lookup the ids individually. And even the story items do not contain such basic information as the number of comments. And you can't calculate it yourself since only the top comments are even returned (as ids of course). So you'll have to recursively dig through the comments to get the number.
This is even more curious as there is a very solid Algolia API where you can filter for submission time, story score, number of comments and even return a greater number of results + access page numbers to get even more.
To get the information of a single algolia api call you will need hundreds or thousands (in case of nested comments) "official" API calls. Hoping for updates
If up/down vote data were included in the API, much needed experimentation on collaborative filtering would be made possible! This is Hacker News after all.
Right now one team, Ycombinator, is trying to fix important issues in the ranking and moderation of posts and comments. Many of us are frustrated by the increasing domination of popularity (and hatred) over quality and relevance. A lot of good submissions and comments are simply buried, never to be found. There is too much muck to have to wade through. The timing of posts and comments plays a much larger role than quality. I could go on and on.
Imagine a Netflix Prize-like flowering of experiments and collaboration, leveraging the hacker community's collective smarts and enthusiasm. Many of us have ideas, but right now are unable to test them. What a shame if a great idea dies on a notepad.
There are two possible issues with opening up voting data: gaming and privacy. If having vote data allows someone to game the front page, then only include it with some delay (2 days?) so that it could't be used to game the front page. This will still allow experimentation with collaborative filtering algorithms and the like.
My take on the privacy issue is that anonymity isn’t that important for a site like Hacker News:
1. Startup culture is about straight talk, putting your money where your mouth is, and open critical feedback, both in the giving and receiving. There are precedents for exposing voting data (e.g. Quora, Facebook, Stack Exchange).
2. HN is not aimed at political discussions or other topics where anonymity can be paramount.
3. Pseudonymity is sufficient for those who don’t want their votes and comments tied back to their actual identity.
Thoughts?
I would love to hear from others who yearn to experiment with alternate algorithms and strategies for improving Hacker News.
There are many legitimate views on this, but FWIW mine differs from yours. I believe that anonymity actually is important for a site like Hacker News, and the odds of us ever publishing the vote data—even pseudo-anonymized—are small. Sorry to disappoint.
I built a scraper around 3 years ago (been through a few usernames since then), and I've had to change it once 3 months ago because the HTML output added quotes around HTML attributes.
Even though it's read only, I'll continue to use my scraper rather than the API simple because it's one request, rather than the API would require one request for the top IDs and then one call per story, so it would be 31 calls instead of just 1.
Unless I'm missing something, it seems fairly poorly designed for top stories, and non existent for new stories.
------
EDIT: Looks like I missed the text about updating to a new rendering system in 3 weeks time, and to iterate designs faster to allow mobile friendly theming. Looks Like I WILL be updating to use the API
yeah, I just have the same problem here... and then I have basically the same question as someone mentioned below... new stories through the api? do we have to get the max-id and then get everything below the max-id and check if its a story? and other ideas?
Yay! I've been wanting something like this to come out. I've been playing around with some new tech stacks and built a css replacer of hacker news, but always wanted an actual api to make it easier.
There's a bunch of css pages that come out for hacker news, but I couldn't find anything that aggregates them. This will be alot easier to extend and customize the site.
I'm not seeing any api's for the jobs or show sections though? Hopefully this might come in the future?
The Firebase JavaScript library makes make this impressively straightforward to use. I built a clone using React.js and Firebase's library. Because v0 of the API requires a request for each news story, it's not possible to use Firebase's React mixin yet.
I'm definitely excited about the API and the future possibilities with it. Looks like a great start. I do have a few questions and suggestions, though.
Is there any chance of getting more than just the top 100 stories returned? I think it will be a lot more useful for api consumers if you can use a query parameter to set the limit (within reason, usually 1,000) and a number of results to skip. For now, scraping is still more desirable to me since I can retrieve any number of results in their current order.
Better yet, but more complex: a number to skip and a certain timestamp so I don't see the same article on two pages due to natural upvoting, downvoting, or rank decay.
Also, if there's any flexibility still with property names, I'd suggest these changes for clearer semantics:
"deleted" -> "hidden" (since they're obviously not deleted)
"by" -> "author" (for more clarity)
"kids" -> "children" (the common convention)
Please do allow other sites to use HN logins. Then the community could develop useful sister services.
For example, a site where HN members can upvote and rate different development tools, libraries, IDEs, management tools, etc. All with backlinks to HN discussions. It's a great community and there are many ways we could share knowledge and experience.
[+] [-] christiangenco|11 years ago|reply
I count 8,483 submissions. I'm sure there's something interesting to be done with all of this data. A word frequency chart?
---
Edit: So apparently there's a ruby gem that lets you feed it a body of text and generates pseudo-random phrases based on that text.
I present to you the patio11 impersonator: https://gist.github.com/christiangenco/e8d085e47479be0131e1
One of my favorites:
---Also, a word count on patio11's submissions: 1,052,351. For comparison, all 7 Harry Potter books total 1,084,170 words. patio11 has written the entire Harry Potter series worth of content on HN. Just... wow.
[+] [-] minimaxir|11 years ago|reply
Yesterday, I published an analysis on all Hacker News comments: http://minimaxir.com/2014/10/hn-comments-about-comments/
There's a lot of interesting trends in the data. Let me know if you want to know anything in particular and I'll get back to you. :)
[+] [-] patio11|11 years ago|reply
For folks who want to do interesting things with the API but don't want to be abusive to Firebase's servers, I whipped up a quick ruby script to cache a particular user's comments/submissions on disk: https://gist.github.com/patio11/1550cad3a02edd175049
It tries to rate limit itself by putting 200ms of sleep between requests, so downloading all of my comments would take ~30 minutes.
"I release this work unto the public domain." -- feel free to adapt it to your needs.
Usage is "ruby slurper.rb $USERNAME $MAX_COMMENTS_TO_FETCH."
[+] [-] scott_karana|11 years ago|reply
https://hacker-news.firebaseio.com/v0/user/tptacek.json?prin...
[+] [-] colinbartlett|11 years ago|reply
[+] [-] bcoates|11 years ago|reply
[+] [-] jaredsohn|11 years ago|reply
Click on the line chart to do an hnsearch for the time period.
Update: Site should be back up. It crashes occasionally (that's part of why I hadn't post it yet.)
[+] [-] benohear|11 years ago|reply
[+] [-] x0x0|11 years ago|reply
[+] [-] airlocksoftware|11 years ago|reply
https://play.google.com/store/apps/details?id=com.airlocksof...
The newest version (still under development, probably a month or two from release) adds support for displaying polls, linking to subthreads, and full write support (voting, commenting, submitting, etc). I'm fine with switching to a new API (Square's Retrofit will make it super easy to switch), but without submitting, commenting, and upvote support I have to disable a bunch of features I worked really hard on. Also it would've been cool to know this was coming about 3 months ago so I didn't waste my time.
Anyways, quick question on how it works -- when I query for the list of top stories
https://hacker-news.firebaseio.com/v0/topstories.json?print=...
it just returns a list of ids. Do I have to make a separate request for each story
https://hacker-news.firebaseio.com/v0/item/8863.json?print=p...)
to assemble them into a list for the front page, or am I missing something?
[+] [-] dang|11 years ago|reply
Re write access and logged-in access, if that turns out to be how people want to use the API, that's the direction we'll go. But we think it's important to launch an initial release and develop it based on feedback. There are many other use cases for this data besides building a full-featured client: analyzing history, providing notifications, and so on. It will be fascinating to see what people build!
[+] [-] dionidium|11 years ago|reply
This definitely does suck. I feel your pain. But it's also part of the package of scraping websites. You go in knowing that it could break at any time.
[+] [-] kogir|11 years ago|reply
If you're on a supported platform, the Firebase SDKs handle all this efficiently and can even provide real-time change notifications.
[+] [-] thegeomaster|11 years ago|reply
So, @anyone involved with the API project, can you give us an estimate on when will the OAuth-based user-specific API be rolled out? I'm fining with pausing my efforts until then, if it's going to be soon, in order to go a less complex and error-prone path.
[1]: https://github.com/geomaster/hnop/blob/master/backend/src/hn...
[+] [-] sararob|11 years ago|reply
[+] [-] tudborg|11 years ago|reply
[+] [-] TheAlchemist|11 years ago|reply
[+] [-] deft|11 years ago|reply
[+] [-] jkimmel|11 years ago|reply
[+] [-] cJ0th|11 years ago|reply
Thanks very much guys!
[+] [-] dimillian|11 years ago|reply
[+] [-] hokkos|11 years ago|reply
[+] [-] kolev|11 years ago|reply
[+] [-] Spockulus_Rift|11 years ago|reply
[deleted]
[+] [-] piyush_soni|11 years ago|reply
[+] [-] Livven|11 years ago|reply
On the one hand, of course it's great that HN is finally getting a proper API and also modernizing its markup (which is a mess even if you ignore all the tables – for example, the first paragraph in a comment usually isn't wrapped in <p> tags), but on the other hand this current v0 version is very lacking and impractical for a regular client application.
Since the top stories (limited to 100) and child comments are only available as a list of IDs a client app would have to make a separate HTTP request for every single item, which is obviously not something you'd want to do especially in a mobile environment. Other lists apart from the top stories (new, show, ask, best, active etc.) don't seem to be available at all right now.
Of course this is just the first version, and the documentation promises improvements over time – which I don't doubt at all – but there's no clear indication that the API will be at feature-parity with the current website, even excluding anything that requires authentication, by October 28. So this means that I – and other developers of client apps or unofficial APIs – will probably have to write new scraping code once the new rendering engine (which I assume refers to the website) arrives instead of being able to switch to the new API immediately.
Now I guess I might just be needlessly worried, especially since the blog post explicitly says that the new API "should hopefully making switching your apps fairly painless", but then why not wait until it's actually ready for that before making the announcement? Putting a half-baked API out there a few days/weeks (?) in advance before it's fully fleshed out doesn't seem all that helpful, at least to me.
[+] [-] shill|11 years ago|reply
[+] [-] dstaley|11 years ago|reply
[+] [-] ryanseys|11 years ago|reply
[+] [-] danyork|11 years ago|reply
(As I find myself pondering the idea of standing something up like this on an dual-stacked server purely so that I could access HN from my IPv6-only test network... hmmm...)
[+] [-] andrewstuart2|11 years ago|reply
Good work.
[+] [-] dang|11 years ago|reply
[+] [-] dang|11 years ago|reply
[+] [-] minimaxir|11 years ago|reply
EDIT: After looking at the documentation there are two new aspects of the Firebase API not in the Algolia API:
1) Ability to see deleted/dead stories.
2) Endpoint for user data.
Question to kogir/dang: Has the "delay" field (Delay in minutes between a comment's creation and its visibility to other users) always been there?
[+] [-] jamest|11 years ago|reply
[+] [-] nacs|11 years ago|reply
Is HN data already in Firebase (as its primary data store) or is content from HN's DB getting 'mirrored/cloned' on-demand to Firebase for the API?
[+] [-] jcampbell1|11 years ago|reply
[+] [-] jaredsohn|11 years ago|reply
It lets you download all comments/stories for a user as a JSON or CSV file, breaks down karma between comments and stories, and plots comment/story counts, karma, etc. over time on a line chart (clicking will show you the details via an hnsearch).
Also I built some npm modules so you can get this information via the commandline.
http://hnuser.herokuapp.com/.
Example: http://hnuser.herokuapp.com/user/tptacek/
The Chrome extension hasn't been updated for awhile (it just superimposes a small amount of this information on the user page).
[+] [-] josephwegner|11 years ago|reply
I know you can't not iterate because people are scraping, but it does stink. At least this will make everything more future-proof going forward.
However, it may be nice to give a bit more heads up than 3 weeks. I know a lot of apps can take ~2 weeks to get through the review process for iOS.
[+] [-] bennyg|11 years ago|reply
I've built a library for iOS (https://github.com/bennyguitar/libHN) that handles scraping, commenting, submitting, voting, etc pretty well and allows me to make as few web calls as necessary to use HN. It looks like I'd have to drop functionality and completely change the networking scheme to match this API - something I'm not willing to do yet.
Correct me if I'm wrong here, but to get every comment on a post, I'd have to recursively get each item for each child. Instead, right now, I can make one network request and get all comments for a story. Granted, I have to parse the HTML (which I hate), but it's a much cleaner solution than going through every item, checking the children and then getting those items ad infinitum. Again, I just glanced over the documentation, but that seems untenable to me.
[+] [-] s9w|11 years ago|reply
And that returns only the ids, nothing else. To get basic information like the score, title or url you have to lookup the ids individually. And even the story items do not contain such basic information as the number of comments. And you can't calculate it yourself since only the top comments are even returned (as ids of course). So you'll have to recursively dig through the comments to get the number.
This is even more curious as there is a very solid Algolia API where you can filter for submission time, story score, number of comments and even return a greater number of results + access page numbers to get even more.
To get the information of a single algolia api call you will need hundreds or thousands (in case of nested comments) "official" API calls. Hoping for updates
[+] [-] eevilspock|11 years ago|reply
Right now one team, Ycombinator, is trying to fix important issues in the ranking and moderation of posts and comments. Many of us are frustrated by the increasing domination of popularity (and hatred) over quality and relevance. A lot of good submissions and comments are simply buried, never to be found. There is too much muck to have to wade through. The timing of posts and comments plays a much larger role than quality. I could go on and on.
Imagine a Netflix Prize-like flowering of experiments and collaboration, leveraging the hacker community's collective smarts and enthusiasm. Many of us have ideas, but right now are unable to test them. What a shame if a great idea dies on a notepad.
There are two possible issues with opening up voting data: gaming and privacy. If having vote data allows someone to game the front page, then only include it with some delay (2 days?) so that it could't be used to game the front page. This will still allow experimentation with collaborative filtering algorithms and the like.
My take on the privacy issue is that anonymity isn’t that important for a site like Hacker News:
1. Startup culture is about straight talk, putting your money where your mouth is, and open critical feedback, both in the giving and receiving. There are precedents for exposing voting data (e.g. Quora, Facebook, Stack Exchange).
2. HN is not aimed at political discussions or other topics where anonymity can be paramount.
3. Pseudonymity is sufficient for those who don’t want their votes and comments tied back to their actual identity.
Thoughts?
I would love to hear from others who yearn to experiment with alternate algorithms and strategies for improving Hacker News.
[+] [-] dang|11 years ago|reply
[+] [-] comeonnow|11 years ago|reply
Even though it's read only, I'll continue to use my scraper rather than the API simple because it's one request, rather than the API would require one request for the top IDs and then one call per story, so it would be 31 calls instead of just 1.
Unless I'm missing something, it seems fairly poorly designed for top stories, and non existent for new stories.
------
EDIT: Looks like I missed the text about updating to a new rendering system in 3 weeks time, and to iterate designs faster to allow mobile friendly theming. Looks Like I WILL be updating to use the API
[+] [-] tomw1808|11 years ago|reply
[+] [-] jxm262|11 years ago|reply
http://jmaat.me/hn
There's a bunch of css pages that come out for hacker news, but I couldn't find anything that aggregates them. This will be alot easier to extend and customize the site.
I'm not seeing any api's for the jobs or show sections though? Hopefully this might come in the future?
[+] [-] kogir|11 years ago|reply
[+] [-] ssorallen|11 years ago|reply
https://github.com/ssorallen/hackernews-react
[+] [-] andrewstuart2|11 years ago|reply
Is there any chance of getting more than just the top 100 stories returned? I think it will be a lot more useful for api consumers if you can use a query parameter to set the limit (within reason, usually 1,000) and a number of results to skip. For now, scraping is still more desirable to me since I can retrieve any number of results in their current order.
Better yet, but more complex: a number to skip and a certain timestamp so I don't see the same article on two pages due to natural upvoting, downvoting, or rank decay.
Also, if there's any flexibility still with property names, I'd suggest these changes for clearer semantics: "deleted" -> "hidden" (since they're obviously not deleted) "by" -> "author" (for more clarity) "kids" -> "children" (the common convention)
[+] [-] paulsutter|11 years ago|reply
For example, a site where HN members can upvote and rate different development tools, libraries, IDEs, management tools, etc. All with backlinks to HN discussions. It's a great community and there are many ways we could share knowledge and experience.