top | item 25346456

Ask HN: What's the best paper you've read in 2020?

705 points| luizfelberti | 5 years ago

I know there are classics that get posted every time this question comes around, so bias them towards more recent ones :)

192 comments

order
[+] tgb|5 years ago|reply
Here's a wonderful one I read a little over a year ago:

"Estimating the number of unseen species: A bird in the hand is worth log(n) in the bush" https://arxiv.org/abs/1511.07428 https://www.pnas.org/content/113/47/13283

It deals with the classic, and wonderful, question of "If I go and catch 100 birds, and they're from 20 different species, how many species are left uncaught?" There's more one can say about that than it might first appear and it has plenty of applications. But mostly I just love the name. Apparently PNAS had them change it for the final publication, sadly.

[+] bordercases|5 years ago|reply
> But mostly I just love the name. Apparently PNAS had them change it for the final publication, sadly.

Big game from an organization with that acronym.

[+] eigenvalue|5 years ago|reply
This sounds like a similar problem to an exercise in the famous probability textbook by William Feller about estimating the total number of fish in a lake by catching N of them and tagging them, and then throwing them back in the lake and coming back later to catch another N fish. You check how many of those fish are tagged and derive your estimate from that using the hypergeometric distribution. See the pages here:

https://imgur.com/oRF0OD3 https://imgur.com/UtCZL5A

[+] specialist|5 years ago|reply
Nice. My math-fu is very weak. I dimly recall a notion for estimating the number of unfound bugs for a code base. Is this similar?
[+] eeZah7Ux|5 years ago|reply
Is it similar to the german tank problem?
[+] loup-vaillant|5 years ago|reply
Builds systems à la carte: Theory and practice

https://www.cambridge.org/core/services/aop-cambridge-core/c...

I've always hated build systems. Stuff cobbled together that barely works, yet a necessary step towards working software. This paper showed me there's hope. If we take build systems seriously, we can come up with something much better than most systems out there.

[+] yellowcake0|5 years ago|reply
This Nature paper,

Non-invasive early detection of cancer four years before conventional diagnosis using a blood test

https://www.nature.com/articles/s41467-020-17316-z

Major breakthrough in cell-free diagnostics. The methylation pattern of DNA can be used to identify early-stage cancer, i.e. circulating tumor DNA (ctDNA) has a distinct methylation pattern.

The results are based on data from a ten year study which must have cost a fortune to run.

[+] benrapscallion|5 years ago|reply
Small correction: the paper is in Nature Communications and not Nature.
[+] eurasiantiger|5 years ago|reply
So we could, in principle, selectively demethylate those regions of DNA and cure some cancers?
[+] petra|5 years ago|reply
That's great news!

So now after we know that, what are the treatment options at that stage ?

[+] phobosanomaly|5 years ago|reply
Combine that with the breakthrough in protein folding and mRNA vaccines and we could have a rapid pipeline for custom, targeted immunotherapies for not just new bugs, but new cancers.
[+] screye|5 years ago|reply
The pair of these papers: (Don't read them in full.)

1.Attention is not explanation (https://arxiv.org/abs/1902.10186)

2.Attention is not not Explanation (https://arxiv.org/abs/1908.04626)

Goes to show the complete lack of agreement between researchers in the explainability space. Most popular packages (allen NLP, google LIT, Captum) use saliency based methods (Integrated gradients) or Attention. The community has fundamental disagreements on whether they capture anything equivalent to importance as humans would understand it.

An entire community of fairness, ethics and Computational social science is built on top of conclusions using these methods. It is a shame that so much money is poured into these fields, but there does not seem to be as strong a thrust to explore the most fundamental questions themselves.

(my 2 cents: I like SHAP and the stuff coming out of Bin Yu and Tommi Jakkola's labs better..but my opinion too is based in intuition without any real rigor)

[+] checkyoursudo|5 years ago|reply
I think one of the most interesting papers I read this year was Hartshorne & Germine, 2015, "When does cognitive functioning peak? The asynchronous rise and fall of different cognitive abilities across the life span": https://doi.org/10.1177/0956797614567339

There are lots of good bits, such as: 'On the practical side, not only is there no age at which humans are performing at peak on all cognitive tasks, there may not be an age at which humans perform at peak on most cognitive tasks. Studies that compare the young or elderly to “normal adults” must carefully select the “normal” population.' (italics in original)

This seems to me to comport with the research suggesting that most or all of the variance in IQ across the life span can be accounted for by controlling for mental processing speed; i.e., you are generally faster when you are younger, but you are not more correct when you are younger.

[+] luizfelberti|5 years ago|reply
For me, it was "Erasure Coding in Windows Azure Storage" from Microsoft Research (2016) [0]

The idea that you can achieve the same practical effect of a 3x replication factor in a distributed system, but only increasing the cost of data storage by 1.6x, by leveraging some clever information theory tricks is mind bending to me.

If you're operating a large Ceph cluster, or you're Google/Amazon/Microsoft and you're running GCS/S3/ABS, if you needed 50PB HDDs before, you only need 27PB now (if implementing this).

The cost savings, and environmental impact reduction that this allows for are truly enormous, I'm surprised how little attention this paper has gotten in the wild.

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[+] H8crilA|5 years ago|reply
The primary reason why you should be using 3x or higher replication is the read throughput (which makes it only really relevant for magnetic storage). If the data is replicated 1.6x then there's only 1.6 magnetic disk heads per each file byte. If you replicate it 6x then there's 6 magnetic disk heads for each byte. At ~15x it becomes cheaper to store in SSD with ~1.5x reed-solomon/erasure code overhead since SSD has ~10x the per-byte cost of HDD.

(there are also effects on the tail latency of both read and write, because in a replicated encoding you are less likely to be affected by a single slow drive).

(also, for insane performance which is sometimes needed you can mlock() things into RAM; the per-byte cost of RAM is ~100x the cost of HDD and ~10x the cost of SSD).

[+] laluser|5 years ago|reply
It's not even the reduction in storage costs in this paper that is groundbreaking. They talk about a way to not only reduce storage costs, but optimize for repairs. Repairs are costly at scale and reducing resources where possible: network, cpu, disk reads, etc is ideal.
[+] cmukka|5 years ago|reply
On this same note I would also suggest some papers which show you can do so much better than simple erasure coding -

[1] Clay Codes - https://www.usenix.org/conference/fast18/presentation/vajha . This paper was also implemented on Ceph and the results are shown in the paper.

and, [2] HeART: improving storage efficiency by exploiting disk-reliability heterogeneity - https://www.usenix.org/conference/fast19/presentation/kadeko... . This paper talks about how just one erasure code is not enough and employing code conversions over the disk-reliability we can get up to 30% savings!

[+] evmar|5 years ago|reply
The Google File System (GFS) paper from 2003 mentions erasure codes. Which isn't to say they did it then, but rather that the technique of using erasure coding was known back then. (And surely before GFS too, I just picked it as an example of a large data storage system that used replication and a direct predecessor to the systems you mentioned.)

https://static.googleusercontent.com/media/research.google.c...

[+] chrisdone|5 years ago|reply
A paper that profoundly influenced my language design: “Programming with Polymorphic Variants” https://caml.inria.fr/pub/papers/garrigue-polymorphic_varian...

And the earlier paper “A Polymorphic Type System for Extensible Records and Variants” https://web.cecs.pdx.edu/~mpj/pubs/96-3.pdf

Row types are magically good: they serve either records or variants (aka sum types aka enums) equally well and both polymorphically. They’re duals. Here’s a diagram.

              Construction                Inspection

    Records   {x:1} : {x:Int}             r.x — r : {x:Int|r}
              [closed]                    [open; note the row variable r]
    
    Variants  ‘Just 1 : <Just Int|v>      case v of ‘Just 0 -> ...
              [open; note the row var v]  v : <Just Int>
                                          [closed]
Neither have to be declared ahead of time, making them a perfect fit in the balance between play and serious work on my programming language.
[+] lacker|5 years ago|reply
Attention Is All You Need

https://arxiv.org/abs/1706.03762

It's from 2017 but I first read it this year. This is the paper that defined the "transformer" architecture for deep neural nets. Over the past few years, transformers have become a more and more common architecture, most notably with GPT-3 but also in other domains besides text generation. The fundamental principle behind the transformer is that it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net.

If you are interested in GPT-3 and want to read something beyond the GPT-3 paper itself, I think this is the best paper to read to get an understanding of this transformer architecture.

[+] tehsauce|5 years ago|reply
“it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net”

This might be misleading, the amount of computation for processing a sequence size N with a vanilla transformer is still N^2. There has been recent work however which has tried to make them scale better.

[+] abecedarius|5 years ago|reply
It's clearly important but I found that paper hard to follow. The discussion in AIMA 4th edition was clearer. (Is there an even better explanation somewhere?)
[+] mrfox321|5 years ago|reply
I would argue that input scaling is not fundamental to Transformers.

Recurrent neural network size is also independent of input sequence length.

The successful removal of inductive bias is really what differentiates this from previous sequence-to-sequence neural networks.

[+] accraze|5 years ago|reply
Three papers stick out for me in the IML / participatory machine learning space this year:

1) Michael, C. J., Acklin, D., & Scheuerman, J. (2020). On interactive machine learning and the potential of cognitive feedback. ArXiv:2003.10365 [Cs]. http://arxiv.org/abs/2003.10365

2) Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the people back in: Contesting benchmark machine learning datasets. ArXiv:2007.07399 [Cs]. http://arxiv.org/abs/2007.07399

3) Jo, E. S., & Gebru, T. (2020). Lessons from archives: Strategies for collecting sociocultural data in machine learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829

Also a great read related to IML tooling for audio recognition:

1) Ishibashi, T., Nakao, Y., & Sugano, Y. (2020). Investigating audio data visualization for interactive sound recognition. Proceedings of the 25th International Conference on Intelligent User Interfaces, 67–77. https://doi.org/10.1145/3377325.3377483

[+] groceryheist|5 years ago|reply
Measuring the predictability of life outcomes with a scientific mass collaboration.

http://www.pnas.org/lookup/doi/10.1073/pnas.1915006117

You might think that it's possible to use machine learning to predict whether people will be successful using established socio-demographic, psychological, and educational metrics. It turns out that it's very hard and simple regression models outperform the fanciest machine learning ideas for this problem.

The way this study was done is also interesting and paves the way for new kinds of collaborative scientific projects that take on big questions. It draws on communities like Kaggle, but applies it to scientific questions not just pure prediction problems.

[+] nhf|5 years ago|reply
> simple regression models outperform the fanciest machine learning ideas for this problem

This reminds me of a classic paper: "Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors."

Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American psychologist, 34(7), 571.

https://core.ac.uk/download/pdf/190386677.pdf

[+] thanksforthe42|5 years ago|reply
Luck/Fortune is so critical.

Genetics and net worth can be blown away by a good or bad group of friends. And unfortunately you start having friends before you are concious enough to realize the impact.

[+] SurfingToad|5 years ago|reply
One of my favorites is definitely A Unified Framework for Dopamine Signals across Timescales (https://doi.org/10.1016/j.cell.2020.11.013), simply because of its experimental design. They 'teleported' rats in VR to see how their dopamine neurons responded, to determine whether TD learning explains dopamine signals on both short and long timescales. Short answer: it does.
[+] valbaca|5 years ago|reply
fyi, TD = Temporal Difference

"Temporal difference (TD) error is a powerful teaching signal in machine learning"

[+] jonnycomputer|5 years ago|reply
Interesting I'll have to take a look at this paper.
[+] m3at|5 years ago|reply
On the Measure of Intelligence, François Chollet [1]

Fellow HNer seems to have liked a lot of ML paper, this is not breaking the trend. This is a great meta paper questioning the goal of the field itself, and proposing ways to formally evaluate intelligence in a computational sense. Chollet is even ambitious enough to propose a proof of concept benchmark! [2] I also like some out of the box methods people tried to get closer to a solution, like this one combining cellular automata and ML [3]

[1] https://arxiv.org/abs/1911.01547 [2] https://github.com/fchollet/ARC [3] https://www.kaggle.com/arsenynerinovsky/cellular-automata-as...

[+] nojvek|5 years ago|reply
Big fan of Chollet. Really enjoyed the paper.

Also a big fan of Hutter prize. Good AGI is lossless compression.

[+] louis-paul|5 years ago|reply
Meaningful Availability, Hauer et al.: https://www.usenix.org/system/files/nsdi20spring_hauer_prepu...

A good incremental improvement in service level indicator measurements for large-scale cloud services.

Obligatory The Morning Paper post: https://blog.acolyer.org/2020/02/26/meaningful-availability/

[+] hobofan|5 years ago|reply
Even if not implemented in such a sophisticated manner, "meaningful availability" is a better metric than pure uptime/downtime for most websites.

At one startup we worked at we had availability problems for some time, with the service going down in a semi-predictable manner ~2 times a day (and the proper bugfix a few weeks away). Because once a day the service went down was in the middle of the night with no one on call, pure availability was 80-90%. Given that it was a single country app with no one trying to do any business during the night, meaningful availability was ~99%. Knowing that gave us peace of mind and made tackling the problem a much more relaxed ordeal than the crunch time for a few weeks I've seen at other companies in similar situations.

[+] imglorp|5 years ago|reply
Keeping CALM: When Distributed Consistency Is Easy

In computing theory, when do you actually need coordination to get consistency? They partition the space into two kinds of algorithm, and show that only one kinds needs coordination.

CACM, 9/2020. https://cacm.acm.org/magazines/2020/9/246941-keeping-calm/fu...

[+] sriku|5 years ago|reply
Not a paper, but a fantastic talk by Samy Bengio "Towards better understanding generalisation in deep learning" at ASRU2019.

Some pretty mind blowing insights - ex: if you replace one layer's weights in a trained classification network with the initialisation weights for the layer (or some intermediate checkpoint as well), many networks show relatively unaffected performance for certain layers ... which is seen as a generalisation since it amounts to parameter reduction. However, if you replace with fresh random weights (although initialisation state is itself another set of random weights), the loss is high! Some layers are more sensitive to this than others in different network architectures.

I recently summarised this to a friend who asked "what's the most important insight in deep learning?" - to which I said - "in a sufficiently high dimensional parameter space, there is always a direction in which you can move to reduce loss". I'm eager to hear other answers to that question here.

[+] legel|5 years ago|reply
Nice. Love the Bengio Bros. Yoshua especially was right there with Geoffrey Hinton, Yann LeCun, Andrew Ng as the earliest pioneers of successful deep learning, for over 20 years. (While most technologists were crazy about this thing called the World Wide Web in the late 1990's, these guys were shaping brain-inspired AI algorithms and representations.)

Anyways, one of the papers by Yoshua that was really influential on my master's thesis was published in 2009, has received 8956 citations to date on Google Scholar, called "Learning Deep Architectures for AI". For many young researchers, even though this paper pre-dates much of the hyped architectures of the current era, I would still recommend it for its timeless views on deep representations as equivalent to architectural units of learning and knowledge, including its breakdown of deep networks as compositions of functions.

"Learning Deep Architectures for AI" by Yoshua Bengio (2009) - https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf

[+] DoctorOetker|5 years ago|reply
thats not surprising to me: if we view the weights in the whole network as "individual cells" in a population, and if we pretend that at each update of network weights, before the update each weight undergoes cell division such that one daughter cell / weight is an increase in weight, and the other a decrease, then each component of the gradient descent vector can be viewed as the fitness function for that specific cell or weight: an increase or a decrease. From this perspective each cell forms its own niche in the ecosystem, and its no surprise that replacing a cell with its ancestor is roughly compatible with the final network cells: the symbiosis goes both ways.

The reason for Bengio demonstrating this on a complete layer is obviously to demonstrate that this is NOT due to redundantly routing information to the next layer (think holographically, for robustness). And using non-ancestor random weights illustrates that the ecosystem fitness suffers if redundant / holographic routing is prevented while also using non ancestral cells / weights...

[+] ggleason|5 years ago|reply
I'm a big fan of the various "gradual" approaches so this paper really caught my eye.

Gradualizing the Calculus of Inductive Constructions (https://hal.archives-ouvertes.fr/hal-02896776/)

I'm not sure if this is precisely the direction things should go in order to improve the utilisation of specification within software development but it's a very important contribution. As yet my favourite development style has been with F-star but F-star also leaves me a bit in a lurch when the automatic system isn't able to find the answer. Too much hinting in the case of hard proofs.

Eventually there will be a system that lets you turn the crank up on specification late in the game, allows lots of the assertions to be discharged automatically, and then finally saddles you with the remaining proof obligations in a powerful proof assistant.

[+] ricksunny|5 years ago|reply
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7062204/

Chen, Y.W.; Yiu, C.B.; Wong, K.Y. Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CL (pro)) structure: Virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates. F1000Research 2020, 9, 129.

This paper (based on a machine learning-driven open source drug docking tool from Scripps Institute) from Feb/Mar formed the basis for the agriceutical venture I started for supporting pandemic management in Africa. We’re in late stage trialing talks with research institutes here in East Africa.

https://www.emske-phytochem.com