Show HN: Arxiv Vanity – Read academic papers from Arxiv as responsive web pages

[+] bfirsh|8 years ago|reply

We were frustrated by the experience of reading machine learning papers on screens (particularly phones/tablets). There are lots of good tools for authoring HTML papers (Distill, Authorea, etc) but nothing that deals with the vast number of PDF papers that already exist.

So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s still pretty janky, but for the papers that do render correctly, the experience is so much better than reading a PDF. For example:

https://www.arxiv-vanity.com/papers/1705.04085v3/

https://www.arxiv-vanity.com/papers/1708.00884/

https://www.arxiv-vanity.com/papers/1705.06031v2/

The source for the LaTeX to HTML renderer is on GitHub[0]. It’s built on Pandoc[1] and Distill.pub’s template[2].

[0] https://github.com/arxiv-vanity/engrafo

[1] https://pandoc.org

[2] https://github.com/distillpub/template

[+] JBorrow|8 years ago|reply

One of the things that I came across when writing my own janky pdf/latex->html converter for lecture notes[0] is that Pandoc doesn't handle references and subfigures correctly, even with pandoc-crossref and pandoc-citeproc enabled. I had to write a little python module[1] that used regex to extract those and then handle them on my own separately... This is definitely something you should look at.

[0] https://dmaitre.phyip3.dur.ac.uk/NPP/notes/ [1] https://github.com/JBorrow/latex-pandoc-preprocessor

[+] Drup|8 years ago|reply

I was looking for a way to turn my (soon-to-be-defended) PhD thesis into an epub, and investigated the various LaTeX2Html converters. I was pretty disappointed when I realized that all of them are terrible and have no hope of handling my manuscript. My current solution is to create a rendering of my thesis in a5 format. :/

This look quite a bit better, so here is the question: what do you not support at the moment?

[+] KGIII|8 years ago|reply

I'm not sure I understand. Well, I understand what you're doing but I'm not sure why you'd dislike PDF.

PDF has the great benefit of rendering the same on every system. With very few exceptions, PDF will look exactly the same on every system and will print the same on every system.

HTML doesn't really have that same benefit.

Don't get me wrong, I think your service is a great idea for those who would like HTML formatted results, but I'm not understanding the complaint about PDF.

Could you expand on why you don't like PDF?

[+] Myrmornis|8 years ago|reply

> particularly phones/tablets

I understand the problem with a phone, but PDFs on an ipad/tablet are beautiful and a joy to read. Much better to read the text as originally typeset than to put it through a process such as this which risks corrupting minor but important details in the mathematical content.

On my phone I put it in landscape mode and that allows me to read a PDF OK, but I don't really get why one would read academic papers on a phone, why not use a tablet?

However I'm very interested in engrafo. It sounds like it will allow me to automatically publish blog style content from my LaTeX sources without having to fork the LaTeX content into a markdown / HTML version.

I just don't understand why you don't like reading academic papers as PDFs on tablets!

[+] Eridrus|8 years ago|reply

This is cool, it would be nice to have a chrome extension to take me directly to this from the page/pdf.

[+] popcorncolonel|8 years ago|reply

It would be amazing if we could browse "Latest" by category, and for a certain day, much like: https://arxiv.org/list/math.NT/recent

[+] nextos|8 years ago|reply

It's very nice. You should expand to cover bioRxiv (biology) too.

[+] apepe|8 years ago|reply

That first article and how it looks when imported into Authorea in one click: https://www.authorea.com/users/3/articles/208068-automatic-e... (just a couple of labels and si units which do not render). Note: it is forkable and can be commented upon.

[+] leephillips|8 years ago|reply

In all three cases I find the original PDFs more pleasant to read. HTML typography is not up to snuff. I read them on a laptop, however, and I can see that this would be useful if one is forced to read on a phone.

(One thing that is very ugly in the PDFs, and most scholarly papers, is the use of different-colored boxes for hyperlinks. Authors, please consider putting

\usepackage[colorlinks]{hyperref}

in your LaTeX preambles.)

[+] jpeloquin|8 years ago|reply

This is really cool and has a lot of potential. Academic papers are dense and heavily cross-referenced, so experimenting with new display formats that do more to help the reader could make researchers a lot more productive. For example, citation tooltips are a big time saver compared to cross-referencing the bibliography. However, it's also beneficial for every paper to look the same because this makes skimming easier. To get both innovation and consistency is to develop tools, like Arxiv Vanity, that automatically transform the source document. This example makes me hopeful that we'll someday have similar tools for the commercial publishers' papers.

As for immediate tweaks, I tentatively suggest making the text 100% black (like the original PDF) instead of rgba(0, 0, 0, 0.8). The higher contrast will help those of us with less-than-great eyes.

[+] dang|8 years ago|reply

This is amazing. I hope you'll keep working on it. There's always a long tail of details that need taking care of when trying to cover a large corpus, and ploughing through successive 80%'s is (as you are no doubt acutely aware) serious grunt work. But you've made a fabulous start, so I hope you find the stamina to do it!

[+] bfirsh|8 years ago|reply

Yeah, even building upon Pandoc's LaTeX parsing, 3 months of grunt work got us this 20% working. Over the next 12 months we'll get the other 80% working. :)

[+] jknz|8 years ago|reply

A big challenge is to get references working correctly. LaTeXML is quite good at converting latex documents to html [1], including references such as Theorem 2.1, equation (8.1) etc.

For instance, the paper [2] appears to be quite readable on mobile, and clicking/tapping on a reference such as (8.1) leads you to equation (8.1) as you would expect.

The auto-generation of Arxiv-Vanity is really nice, maybe it would be easy to add the LatexML output too?

[1]: http://www.albany.edu/~hammond/demos/Html5/arXiv/lxmlexample...

[2]: http://www.albany.edu/~hammond/demos/Html5/arXiv/LaTeXML/110...

[+] ddinh|8 years ago|reply

This is an awesome tool. Thanks!

Only issue I've run into so far is that cross-references to theorem numbers don't seem to always work correctly, e.g. you'll see a lot of "Theorem ?" in https://www.arxiv-vanity.com/papers/1607.06711/.

[+] bfirsh|8 years ago|reply

Ah, looks like we don't support theorems. You can track it here: https://github.com/arxiv-vanity/engrafo/issues/157

Thanks!

[+] peatmoss|8 years ago|reply

Not sure if all these are x-refs to theorems or not, but there seem to be lots of [?] links: https://www.arxiv-vanity.com/papers/1602.08927/

That said, on cursory look, this is pretty impressive. latex->web converters have existed for a long time, and this appears to have navigated some aspects quite well!

[+] beezle|8 years ago|reply

It failed on all three papers I tried it on. https://arxiv.org/abs/1710.05313 https://arxiv.org/abs/1710.06689 https://arxiv.org/abs/1710.07508

[+] bfirsh|8 years ago|reply

We haven't implemented many of the LaTeX packages used in papers that aren't machine learning papers yet - sorry. :(

[+] tothrow2017|8 years ago|reply

This looks really cool. (The program had some issues with the bibliography and with custom layout, but other than that, was great.)

It would be nice if an option to output MathML existed.

(Why MathML?

In brief, it allows treating Maths as a first-class citizen on the web.

For instance, with MathML the reader can choose what font the equations will be rendered in — if you prefer STIX or Latin Modern Math, then you can specify it with CSS, and the browser will correctly render it. With the mash of spans within spans that arXiv-vanity uses, you couldn't change the font, as then the pre-calculated spacings would be wrong. (Alternatively, the publisher could easily offer several styles, without having to re-render everything, just by changing the CSS.)

Arguably, client-side MathJax offers the same flexibility as MathML, but it's much, much slower, while rendering MathML in firefox is as fast as rendering standard, static HTML.

Another application of MathML is embedding it in SVGs for beautiful graphs.

MathML can also be pasted into other applications that support it, such as Thunderbird and Mathematica. )

[+] sturmen|8 years ago|reply

This is awesome! I was literally rolling my eyes this morning about trying to read an arXiv paper on my phone.

[+] ldenoue|8 years ago|reply

shameless plug: give https://docushow.com a try!

[+] strin|8 years ago|reply

Great work.

I've also been working on a similar open-source project "Sharead".

https://github.com/strin/sharead

It has a chrome extension that uploads Arxiv papers, and you can manage papers with tags.

It also automatically converts pdf to HTML using a library called pdf2html:

https://github.com/coolwanglu/pdf2htmlEX

[+] j2kun|8 years ago|reply

Looks like it's failing to process some standard tex commands (e.g. \textup) as well as some user defined macros. See the many display errors in https://www.arxiv-vanity.com/papers/1710.07406/

Of course, it goes without saying that I want this.

[+] mastazi|8 years ago|reply

When the render fails, why are you redirecting to the pdf file intead of redirecting to the abstract? E.g. here (link stolen from another comment in this page) https://www.arxiv-vanity.com/papers/1608.04012/

[+] jxramos|8 years ago|reply

Noob question but how far does Calibre take pdf to epub conversion? I've been really interested in learning more about the epub file format and was greatly intrigued to discover it extends xhtml and is essentially a zip folder if I've gathered that much correctly.

[+] dbranes|8 years ago|reply

Unfortunately it's failing on first things I tried with a not-so-helpful error message:

https://www.arxiv-vanity.com/papers/1608.04012/

https://www.arxiv-vanity.com/papers/0903.3065/

Also a lot of MathJax failures (maybe Latex variables names?) https://www.arxiv-vanity.com/papers/1709.09439/

[+] bfirsh|8 years ago|reply

Those problems are normally Pandoc parsing errors. Considering it's open source, perhaps we should print the error message so people can actually help fix it...

The MathJax failures are either things that MathJax doesn't support, or use of \DeclareMathOperator which we haven't added support for yet.

Edit: Added a more useful error message. :) https://www.arxiv-vanity.com/papers/1608.04012/

[+] gumby|8 years ago|reply

Thank you thank you thank you! I detest the reader-hostile PDF (and WTF? why would you write something and then make it inconvenient to read???)

Unfortunately, among its sins, PDF discards a lot of the presentation semantics (headers, footnotes etc). Congrats on doing a credible job trying to reconstruct some of that! It's a tough, thankless job.

I was horrified when Adobe introduced PDF and indeed it has turned out at least as badly as I had feared.

[+] Analog24|8 years ago|reply

I believe it's reconstructed from the Latex source, which is how every paper is submitted to the ArXiv. Not to diminish this site but I'm guessing that generating HTML from Latex is a lot easier than doing it from PDF format.

[+] impendia|8 years ago|reply

I am curious, would you mind explaining why you dislike PDF?

(I'm an academic and I'm used to PDFs and I like them myself.)

[+] zitterbewegung|8 years ago|reply

This is awesome ! Going on my home screen now. Love the design. Maybe you could ask Arxiv to have a button on their site that would direct it so that it opens on your site .

[+] vimarshk|8 years ago|reply

This is so good. I do prefer the HTML over PDF in this scenario.

[+] auggierose|8 years ago|reply

Not really my use case as I read PDF papers on an iPad Pro 12.9 inch, which is just fine, but very neat work!

I tried it on this one: https://www.arxiv-vanity.com/papers/1702.03277/

Some commands don't work (\textsl, \rotatebox, ...) and the thank you footnote is incorporated into the title, but otherwise very readable!

[+] captn3m0|8 years ago|reply

This is so much awesome. Thanks for building this.

[+] maxxxxx|8 years ago|reply

Nice! PDF is the worst format I can think of to present papers. Especially for reading on mobile this will be of great help.

[+] leephillips|8 years ago|reply

PDF is the only format that will preserve the typographical details that are important in many technical papers; it also avoids the relatively bad rendering created by the browsers.

PDF is usually bad, of course, on small screens, unless the publisher makes special versions.

[+] lvs|8 years ago|reply

I think this is a dramatic difference of opinion between coders and everyone else in science, if you excuse the generalization. I say that because of the large number of tools to try to one-up the PDF for academic scholarship (e.g. Readcube). Publishers looking to invent ways to define their own value have been slowly trying to force these things onto readers, but almost everyone in my field hates them and wishes they would die. Formatting and typography are critical in many fields, and a PDF is the canonical way to maintain these aspects of a paper.

[+] huangc10|8 years ago|reply

Definitely useful in certain situations. Can't comment on how well the conversion works (yet) but I can see how this might be useful to a lot of people.

Me? I still mostly prefer reading physical academic papers because of needing to flip back and forth for re-reading (clarification) and adding personal notes/graphs/calculations.

Good job guys.

134 comments