I wrote http://www.arxiv-sanity.com/ (code is open source on github: https://github.com/karpathy/arxiv-sanity-preserver) as a side project intended to mitigate the problem of finding newest relevant work in an area (among many other related problems such as finding similar papers, or seeing what others are reading) and it sees a steady number of few hundred users every day and a few thousand accounts. It's meant to be designed around modular views of lists of arxiv papers, each view supporting a use case. I'm always eager to hear feedback on how people use the site, what could be improved, or what other use cases could be added.
Andrej, thank you very much for making this site. I use it every day.
A problem: I think one of the most necessary things that are missing from arXiv.org is comments. People just come, read, and then take their discussions somewhere else, fragmented all around the net. Arxiv-Sanity already filters just the ML articles and does personalized feeds, maybe it could also be a place of discussion. I know it potentially leads to other complications (like moderation), but I really think readers would benefit from reviews, questions and answers.
The current ML related discussion sites (blogs, /r/machinelearning, G+, Twitter, StackExchange and YC) are often mixed with lots of noise. I'd like to read what researchers think.
Another suggestion: add links to code repositories, where they are available. Maybe some of your trusted users could be empowered with the right to add such links, if it's too much work for a single person. If interesting discussions are reported on other pages on the internet, they could also be added to the article, to make them easier to find.
For me getting alerted when there are new papers that cite papers that are relevant towards my current research topic would be ideal. Google scholars has alerts on authors and search queries but for me they don't have enough recall.
Its much easier to tell when a paper is relevant for me if it happens to cite 3 of the commonly used datasets for my particular task.
btw I use arxiv-sanity, its pretty great, thanks a lot!
(1) I manually check the proceedings of the important conferences in my subfield when they come out.
(2) I check my field's arXiv every other day or so.
(3) Google Scholar alerts me of papers that it thinks will interest me, based on my own papers, and it's very useful. Most of what it shows me is in fact interesting for me, and it sometimes catches papers from obscure venues that I wouldn't see otherwise. The problem is that you need to have papers published for this to work, and also, it's only good for stuff close to your own work, not that much for expanding horizons - (1), (2) and Google Scholar search are better for that.
Yep, this is what I do, except that security papers don't make it to arXiv so I also keep an eye on twitter (I have followed a bunch of academic security people) and a couple subreddits (/r/ReverseEngineering, /r/REMath, and /r/systems). It's not ideal, but it works out okay.
None of them are a substitute for a proper related work search when I'm writing up a paper though, this is just to keep current on what the trends and interests of the community are.
Write one influential paper. Then all the later papers in the same sub-subfield probably cite your paper. Go to Google Scholar and check the latest citations to your paper.
Ok, it doesn't need to be your paper. Just find a paper that was so influential that others working on the same problem probably will cite it, and monitor the new citations.
So, that's close to how I operate (basically bibliography-surfing), though with one handicap: what do you use to track citations?
Particularly something that's generally open.
Best tool I've got ready access to is Google Scholar. There are citations indices I can get access to, by going on-site to a specific facility, but that's pretty limiting when the rest of my work can be done (and the bulk of my materials) are in my office.
(And yes, I'm aware that having to go to where the indices are is how it Used to Be Done, and in fact, I Did That. Technology has moved on.)
Huh, that seems so obvious in retrospect. This is basically how I've grown into jazz. I find someone I like, find out who they played with, who those folks played with, and so on.
Just FYI, you should know about SHARE. It's an effort to create a free, open dataset of research activity across the research lifecycle. You can read more at
So, if you want to see a reddit for research, better news feeds, etc., it is the SHARE dataset that can provide that data. SHARE won't build all those things--we want to facilitate others in doing so. You can contribute at
The tooling is all free open source, and we're just finishing up work on v2. You can see an example search page http://osf.io/share, currently using v1. Some more info on the problem and our approach....
What is SHARE doing?
SHARE is harvesting, (legally) scraping, and accepting data to aggregate into a free, open dataset. This is metadata about activity across the research lifecycle: publications and citations, funding information, data, materials, etc. We are using both automatic and manual, crowd-sourced curation interfaces to clean and enhance what is usually highly variable and inconsistent data. This dataset will facilitate metascience (science of science) and innovation in technology that currently can't take place because the data does not exist. To help foster the use of this data, SHARE is creating example interfaces (e.g., search, curation, dashboards) to demonstrate how this data can be used.
Why is SHARING doing it?
The metadata that SHARE is interested in is typically locked behind paywalls, licensing fees, restrictive terms of service and licenses, or a lack of APIs. This is the metadata that powers sites like Google Scholar, Web of Science, and Scopus--literature search and discovery tools that are critical to the research process but that are incredibly closed (and often incredibly expensive to access). This means that innovation is exclusive to major publishers or groups like Google but is otherwise stifled for everyone else. We don't see theses, dissertations, or startups proposing novel algorithms or interfaces for search and discovery because the barrier of entry in acquiring the data is too high.
Hi. This looks really interesting. Unfortunately the results page after a search freezes the stock browser on my LG G3.
I've also read the front page, the about page, and your post several times, and I'm not exactly clear what you provide. I thought I'd do some searches to see the product made sense. A search for a field in interested in, arthritis, yielded zero results. Okay, so... no medical research? A search for "reddit" yielded results, and mentions of "providers". I'm not clear what providers are... is reddit a provider, or the research papers, or the publishers, or the researchers...?
I'll read more later when I'm not on mobile, maybe it will be clearer.
I'm starting a project related to analysing published research, so this is a field I'm very interested in. I hope SHARE can help in some way, and I'll definitely be keeping tabs on your work. Thanks for posting.
I know this question is probably a little off topic for this post but I'm very eager to get some kind of answer.
What should I be reading? I'm a computer science student, I want to go into a "Software Engineering" line of work. Are there any places to read up on related topics? I have yet to find something that interests my direct field of choice. Is there one on in academia writing about software?
I also like NLP and other interesting parts. Basically all practical software and their applications are things that interest me.
ICSE [1] and FSE [2] are the top software engineering research conferences. Skimming the titles/abstracts of their papers each year doesn't take long.
Also, they generally have industry or "in practice" tracks that have postmortems from the big software companies in case you want something more applied.
I'll suggest a minority position: If you feel the need to keep up at the bleeding edge of your field, your work is probably replaceable, i.e., if you didn't do it then someone else would do it a year later.
Instead, read more review papers and seminal papers in your field.
There are a lot of papers on sentiment analysis if I recall correctly. I would look into literature on parsing and statistical analysis, a lot of big data stuff is related to that and there are a lot of books on big data. Very popular field to hire people for as well, a lot of big companies want people to massage their data into giving them useful avenues for money-making.
Tossing out a contrarian view: I'm finding there's a tremendous amount of good information and publishing that's old. Keeping up with the cutting-edge can be interesting, but you have to do a lot of the filtering yourself.
Finding out how to identify the relevant older work in your field, finding it, reading it, and seeing for yourself how it's aged, been correctly -- or quite often incorrectly -- presented and interpreted, and what stray gems are hidden within it can be highly interesting.
I've been focusing on economics as well as several other related fields. Classic story is that Pareto optimisation lay buried for most of three decades before being rediscovered in the 1920 (I think I've got dates and timespans roughly right). The irony of economics itself having an inefficient and lossy information propogation system, and a notoriously poor grip on its own history, is not minor.
The Internet Archive, Sci-Hub, and various archives across the Web (some quite highly ideological in their foundation, though the content included is often quite good) are among my most utilised tools.
Libraries as well -- ILL can deliver virtually anything to you in a few days, weeks at the outside. It's quite possible to scan 500+ page books in an hour for transfer to a tablet -- either I'm getting stronger or technology's improving, as I can carry 1,500 books with one hand.
I made a simple service for myself (http://paperfeed.io) which is a feed of all the new papers in journals I care about. I can "star" papers for reading later. Works extremely well for my habits.
You're welcome to try it (not sure if the signup workflow still works; let me know). I'll be happy to hear your feedback.
Edit: you can upvote papers, and they'll float to the top just like on HN.
This might be off topic but would you mind sharing how you wrote the website and if you have any tutorial that you can recommend? I want to design something extremely similar for a different application but I do not have much knowledge in web development (I am more experience in programming for numerical and data analysis). I figure this might be a good project to get my feet wet. Thanks!
I actually just manually check arxiv every morning for the new submissions in my field. It's like getting in the habit of browsing reddit except with a lot less cute animal pictures (maybe because I'm not in biology).
ArXiv has email search alerts. I subscribe to a few topics, they are well formatted plain text digests.
I also have a few ScienceDirect search alerts set up, that come in once every few weeks typically with 1-5 papers.
And Google Scholar, if you use it and you are logged in with an account, learns from your search history and suggests new papers for you to read. It's relatively good.
I don't. If I'm working on something and need (or want) the latest cutting edge algorithms then I search for papers in that area as I need it. Otherwise, there's simply too much stuff going on to try reading through everything, or even a filtered down subset. Only a very small portion of it will be remotely relevant to my work or my interests.
If there's a fundamental new result in basic CS or something like that, I figure I'll hear about it on HN or another news site.
I can imagine it's different for people actively working on new research, though.
For programming language research, 1) the RSS feed of http://lambda-the-ultimate.org/ (Lambda the Ultimate), and 2) my old-school paper subscription to ACM SIGPLAN, which includes printed proceedings for most of the relevant ACM conferences (POPL, PLDI, OOPSLA etc.)
In addition to the important conferences proceedings, it's common for researchers to work in a very narrow subfield where everybody knows everybody. They keep seeing each other at various events where they discuss their ongoing work.
Surprising that feed.ly hasn't been mentioned. It's like gmail for feeds, and it has all the arxiv categories prepopulated. My workflow is as follows: (i) check feedly every day, see ~20-30 new articles, (ii) skim all the abstracts in 5-10 minutes, (iii) mark 0-2 to read later in the day, (iv) mark rest as read, and repeat.
Others have tried and they don't get enough traffic to get it to take off but since low levels of hosting are free, I could just keep it out there for a long time.
There should be something like reddit for academic papers. With upvotes and what not. But I guess it takes people longer to read a paper than to read reddit content.
It's a neat idea, but I would want identity verification - only upvotes from people well-versed in the field should "count", precisely so it doesn't become Reddit. Which means you would have a chicken-and-egg problem when the service got started and few experts were on it yet.
I've been using http://www.sparrho.com throughout my PhD (in Biochemistry) and I was so impressed with its recommendation engine that I joined their team last year. We've been making a lot of changes to the Sparrho platform lately, including adding a pinboard feature to help lab groups and journal clubs coordinate their reading and keep their comments in a single place. Our database are updated hourly with papers from 45,000+ sources from all scientific and engineering fields, including arXiv. Most of our users set up Sparrho email alerts to replace journal eTOCs/newsletters, RSS feeds and Google Scholar alerts. I'd love to hear what you think! Free sign up here: http://www.sparrho.com
Yes. I have an account there. Saw either in their newsletter or on their site recently, that they say some X0 million people (researchers) are using it.
[+] [-] karpathy|9 years ago|reply
[+] [-] visarga|9 years ago|reply
A problem: I think one of the most necessary things that are missing from arXiv.org is comments. People just come, read, and then take their discussions somewhere else, fragmented all around the net. Arxiv-Sanity already filters just the ML articles and does personalized feeds, maybe it could also be a place of discussion. I know it potentially leads to other complications (like moderation), but I really think readers would benefit from reviews, questions and answers.
The current ML related discussion sites (blogs, /r/machinelearning, G+, Twitter, StackExchange and YC) are often mixed with lots of noise. I'd like to read what researchers think.
Another suggestion: add links to code repositories, where they are available. Maybe some of your trusted users could be empowered with the right to add such links, if it's too much work for a single person. If interesting discussions are reported on other pages on the internet, they could also be added to the article, to make them easier to find.
[+] [-] dspoka|9 years ago|reply
Its much easier to tell when a paper is relevant for me if it happens to cite 3 of the commonly used datasets for my particular task.
btw I use arxiv-sanity, its pretty great, thanks a lot!
[+] [-] pfd1986|9 years ago|reply
I feed in a .bib file with papers I like and use a Naive Bayes classifier to find papers I might like in news feeds (science, nature, PNAS, etc).
It works pretty well. As a bonus you can use post high ranked papers to slack or use papers sent to me by other people to repopulate the bib file.
Always welcoming suggestions: https://github.com/pfdamasceno/shakespeare
[+] [-] bchjam|9 years ago|reply
[+] [-] the_duke|9 years ago|reply
[+] [-] harry_puttar|9 years ago|reply
[+] [-] ztianjin|9 years ago|reply
[+] [-] Al-Khwarizmi|9 years ago|reply
(2) I check my field's arXiv every other day or so.
(3) Google Scholar alerts me of papers that it thinks will interest me, based on my own papers, and it's very useful. Most of what it shows me is in fact interesting for me, and it sometimes catches papers from obscure venues that I wouldn't see otherwise. The problem is that you need to have papers published for this to work, and also, it's only good for stuff close to your own work, not that much for expanding horizons - (1), (2) and Google Scholar search are better for that.
[+] [-] moyix|9 years ago|reply
None of them are a substitute for a proper related work search when I'm writing up a paper though, this is just to keep current on what the trends and interests of the community are.
[+] [-] copperx|9 years ago|reply
For example, I usually log in to the ACM site and go to my SIGs and see what's new there. I've never thought about visiting arXiv.
[+] [-] Havoc|9 years ago|reply
The one place where one could actually use a "Follow" button for other people...there isn't one. Classic.
[+] [-] ChuckMcM|9 years ago|reply
[+] [-] mbjorling|9 years ago|reply
https://blog.acolyer.org/
[+] [-] sampo|9 years ago|reply
Ok, it doesn't need to be your paper. Just find a paper that was so influential that others working on the same problem probably will cite it, and monitor the new citations.
[+] [-] chrisamiller|9 years ago|reply
[+] [-] dredmorbius|9 years ago|reply
Particularly something that's generally open.
Best tool I've got ready access to is Google Scholar. There are citations indices I can get access to, by going on-site to a specific facility, but that's pretty limiting when the rest of my work can be done (and the bulk of my materials) are in my office.
(And yes, I'm aware that having to go to where the indices are is how it Used to Be Done, and in fact, I Did That. Technology has moved on.)
[+] [-] devin|9 years ago|reply
[+] [-] jeffspies|9 years ago|reply
http://share-research.org
So, if you want to see a reddit for research, better news feeds, etc., it is the SHARE dataset that can provide that data. SHARE won't build all those things--we want to facilitate others in doing so. You can contribute at
https://github.com/CenterForOpenScience/share
The tooling is all free open source, and we're just finishing up work on v2. You can see an example search page http://osf.io/share, currently using v1. Some more info on the problem and our approach....
What is SHARE doing?
SHARE is harvesting, (legally) scraping, and accepting data to aggregate into a free, open dataset. This is metadata about activity across the research lifecycle: publications and citations, funding information, data, materials, etc. We are using both automatic and manual, crowd-sourced curation interfaces to clean and enhance what is usually highly variable and inconsistent data. This dataset will facilitate metascience (science of science) and innovation in technology that currently can't take place because the data does not exist. To help foster the use of this data, SHARE is creating example interfaces (e.g., search, curation, dashboards) to demonstrate how this data can be used.
Why is SHARING doing it?
The metadata that SHARE is interested in is typically locked behind paywalls, licensing fees, restrictive terms of service and licenses, or a lack of APIs. This is the metadata that powers sites like Google Scholar, Web of Science, and Scopus--literature search and discovery tools that are critical to the research process but that are incredibly closed (and often incredibly expensive to access). This means that innovation is exclusive to major publishers or groups like Google but is otherwise stifled for everyone else. We don't see theses, dissertations, or startups proposing novel algorithms or interfaces for search and discovery because the barrier of entry in acquiring the data is too high.
[+] [-] austinjp|9 years ago|reply
I've also read the front page, the about page, and your post several times, and I'm not exactly clear what you provide. I thought I'd do some searches to see the product made sense. A search for a field in interested in, arthritis, yielded zero results. Okay, so... no medical research? A search for "reddit" yielded results, and mentions of "providers". I'm not clear what providers are... is reddit a provider, or the research papers, or the publishers, or the researchers...?
I'll read more later when I'm not on mobile, maybe it will be clearer.
I'm starting a project related to analysing published research, so this is a field I'm very interested in. I hope SHARE can help in some way, and I'll definitely be keeping tabs on your work. Thanks for posting.
[+] [-] the_duke|9 years ago|reply
[+] [-] gravypod|9 years ago|reply
What should I be reading? I'm a computer science student, I want to go into a "Software Engineering" line of work. Are there any places to read up on related topics? I have yet to find something that interests my direct field of choice. Is there one on in academia writing about software?
I also like NLP and other interesting parts. Basically all practical software and their applications are things that interest me.
[+] [-] azhenley|9 years ago|reply
Also, they generally have industry or "in practice" tracks that have postmortems from the big software companies in case you want something more applied.
[1] http://2016.icse.cs.txstate.edu/
[2] http://www.cs.ucdavis.edu/fse2016/
[+] [-] jessriedel|9 years ago|reply
Instead, read more review papers and seminal papers in your field.
[+] [-] hood_syntax|9 years ago|reply
[+] [-] kirang1989|9 years ago|reply
[+] [-] dredmorbius|9 years ago|reply
Finding out how to identify the relevant older work in your field, finding it, reading it, and seeing for yourself how it's aged, been correctly -- or quite often incorrectly -- presented and interpreted, and what stray gems are hidden within it can be highly interesting.
I've been focusing on economics as well as several other related fields. Classic story is that Pareto optimisation lay buried for most of three decades before being rediscovered in the 1920 (I think I've got dates and timespans roughly right). The irony of economics itself having an inefficient and lossy information propogation system, and a notoriously poor grip on its own history, is not minor.
The Internet Archive, Sci-Hub, and various archives across the Web (some quite highly ideological in their foundation, though the content included is often quite good) are among my most utilised tools.
Libraries as well -- ILL can deliver virtually anything to you in a few days, weeks at the outside. It's quite possible to scan 500+ page books in an hour for transfer to a tablet -- either I'm getting stronger or technology's improving, as I can carry 1,500 books with one hand.
[+] [-] stenl|9 years ago|reply
You're welcome to try it (not sure if the signup workflow still works; let me know). I'll be happy to hear your feedback.
Edit: you can upvote papers, and they'll float to the top just like on HN.
[+] [-] syntaxing|9 years ago|reply
[+] [-] semaphoreP|9 years ago|reply
[+] [-] semi-extrinsic|9 years ago|reply
I also have a few ScienceDirect search alerts set up, that come in once every few weeks typically with 1-5 papers.
And Google Scholar, if you use it and you are logged in with an account, learns from your search history and suggests new papers for you to read. It's relatively good.
[+] [-] iandanforth|9 years ago|reply
[+] [-] jlarocco|9 years ago|reply
If there's a fundamental new result in basic CS or something like that, I figure I'll hear about it on HN or another news site.
I can imagine it's different for people actively working on new research, though.
[+] [-] housel|9 years ago|reply
[+] [-] eatbitseveryday|9 years ago|reply
- OSDI - SOSP - FAST - EuroSys - APSys - NSDI - SIGCOMM - ATC - ISMM - PLDI - VLDB
These days, accepted papers in specialized conferences are actually on mixed topics these days.. like you'll see security and file systems in SOSP
[+] [-] yodsanklai|9 years ago|reply
[+] [-] tachim|9 years ago|reply
[+] [-] dredmorbius|9 years ago|reply
Yes, this is precisely the sort of application RSS is excellent for.
[+] [-] inputcoffee|9 years ago|reply
http://www.ivoryturret.com/
I hope it catches on.
Others have tried and they don't get enough traffic to get it to take off but since low levels of hosting are free, I could just keep it out there for a long time.
[+] [-] otaviogood|9 years ago|reply
[+] [-] adamnemecek|9 years ago|reply
[+] [-] Analemma_|9 years ago|reply
[+] [-] trurl42|9 years ago|reply
Is something like that for papers on the arXiv.
[+] [-] wodenokoto|9 years ago|reply
[+] [-] sitkack|9 years ago|reply
[+] [-] azuajef|9 years ago|reply
[+] [-] roadnottaken|9 years ago|reply
[+] [-] sybilckw|9 years ago|reply
[+] [-] outerspace|9 years ago|reply
[+] [-] vram22|9 years ago|reply