PDF Is the World's Most Important File Format

[+] crazygringo|7 years ago|reply

I didn't "get" PDF's until I started doing academic research.

But the ability to collect papers, books, documents, etc. all in a single format that I can read on any device, and mark up with highlights and notes, has been a game-changer.

Yes it's a lowest-common-denominator format. And it's designed for human reading and manual office tasks, not computer processing of data. But it works. And it's supported everywhere.

Doesn't matter if I'm on the Apple or Google or Microsoft or Adobe stack. Doesn't matter if the PDF is 20 years old. It just works.

[+] timw4mail|7 years ago|reply

PDFs for reading is one thing. Fillable PDFs are evil incarnate. Apparently there are multiple ways to do it, and only Acrobat Reader is capable of making them all work.

Then you have abominations like embedded flash...

As a standalone file representing the format of a book, PDFs are a good format. But then PDFs can unfortunately (sometimes) store much more, and then can be a security minefield.

[+] maxxxxx|7 years ago|reply

After spending time extracting data and text from PDF I would also say it’s the worst file format. We have perfectly fine structured documents, convert them to PDF to lose half of the information and then we spend insane effort to get the data back somehow.

It also doesn’t render well on different resolutions.

PDF is a perfect case study how inferior solutions can become standards.

[+] svnpenn|7 years ago|reply

PDF isn't and was never meant as a data storage format. It is meant as a presentation format. You write a file in latex or docx then convert to PDF for sharing. This means the recipient only needs a PDF reader, not whatever tools you used to create the file. This is similar to writing a file in C and compiling to an executable. You can distribute the source, but most (outside of programmers) just want the executable so they don't have a to worry about the tooling.

[+] SigmundA|7 years ago|reply

PDF is based on Postscript, it designed for print perfect output, its essentially a bunch of rendering commands for a printer its not supposed to be a word processing format that can be edited later its the final output from a word processor or whatever.

[+] ryandvm|7 years ago|reply

Agreed. PDF is intended to accurately render a specific presentation format - often desktop. There is nothing quite like the melange of frustration you get when you are trying to fix something and realizing you have to open up a PDF product manual on your phone. Dragging and zooming all over the place like it's Google fucking Maps. Bonus points when the text isn't even searchable.

[+] cgtyoder|7 years ago|reply

It's important to differentiate between the file format and software that renders that file format. Admittedly there are better and worse PDF viewers, but that shouldn't be the final (or even most important) determiner of how important it is.

[+] oever|7 years ago|reply

PDF/A-1a is a form of PDF where the text is embedded in structured way. It is a 'tagged PDF'. This is meant to make PDF files accessible. It makes it possible to extract the text well.

Some governments, e.g. Dutch government, mandate that PDFs be tagged PDFs.

[+] _emacsomancer_|7 years ago|reply

I think .doc(x) is clearly the worst file format - they're not as portable as people seem to think they are, and they're not long-lasting. PDF, for all of its failing, doesn't suffer from these issues.

A better solution, perhaps, for preserving structure, would be TeX, markdown or Org files, but PDF has the advantage of not having to be compiled and being ready for presentation/consumption (possessing platform invariance).

[+] jimjimjim|7 years ago|reply

I loved it when pdf started gaining traction.

back then there was only 1 format in the world .doc

PDF became standard because it was a SUPERIOR solution for the use cases that most people wanted.

[+] bluejay2387|7 years ago|reply

I have to agree. PDF is the biggest pain in the rear I've had to deal with in my development career. I am also not crazy about the fact that such an important standard for storing information is so heavily influenced by one company (though PDF is an open standard at least).

[+] type0|7 years ago|reply

PDF, What is it FOR?

Here's computerphile video that explains it

https://www.youtube.com/watch?v=48tFB_sjHgY

[+] interlocutor|7 years ago|reply

PDF is based on PostScript, which is a Page Description Language (PDL) also from Adobe. The first PostScript printer, the Apple LaserWriter, launched the desktop publishing revolution. The interesting thing about PostScript is that it is a full programming language, with loops and conditionals and so on. When Adobe designed PDF, they kept the same imaging model as PostScript, but stripped out the programming language. Thus they ended up with a dynamic page description language (PostScript) for media (printers) that cannot fully take advantage of a dynamic PDL, and a static PDL (PDF) for media (i.e., computer screens) that could have benefited from a full programming language!

[+] irrational|7 years ago|reply

I don't know. I personally think plain text is a more important file format. It's read and writeable by a million programs. Any first year programming student can easily build their own program to read and write plain text. It is fairly easily parseable. Etc.

I just opened a random PDF on my computer in a text editor and it starts off with "xÕ\ko‹∆˝Œ_1@ÉbD 9|ÌËáƒZçÌDJÇ¢)UZ[nıÚÆÏƒA˛Pˇeœπ"

[+] cosmie|7 years ago|reply

Plain text can be a really problematic format for data preservation.

- While it's fairly easy to read and write plain text, it's also fairly easy to inadvertently introduce unintended artifacts in the process.

- The more frequently a file gets passed around and read and written to, the more likely mojibake[1] will get introduced. This concern rises exponentially when you move to non-US audiences and introduce local-specific encodings. File storage settings, client operating system settings, server configuration settings, database settings, programming languages that touch it along the way. All of them introduce assumptions along the way of a file's encoding, and many failure cases can be subtle and easily go unnoticed at a glance while causing some irreversible damage to downstream recipients.

- Even if you solve for the encoding, you still have structural issues with tabular data. Different parsers treat escaping and quoting policies differently. This can result in data shifts as things get mis-parsed, data corruptions if literal values get interpreted as escape characters or vis versa, etc.

For preserving data, generic plain text tends to get worse and worse over time because it's such a non-opinionated format and even if you document the specifics on econdings and parsing details it's easy for those to get lost over time as things exchange hands or for intermediaries to corrupt the plaintext because they relied on defaults instead of the documented parsing details.

For better or worse, PDF tends to solve the preservation issue while introducing potential barriers on the parsing/processing side.

[1] https://en.wikipedia.org/wiki/Mojibake

[+] ip26|7 years ago|reply

Using your goalposts I'd argue a paper napkin is the most important file format. It's readable & writeable by 7 billion wetware programs, including the drunk ones, and even a first grader can make use of it.

[+] Someone|7 years ago|reply

No, it doesn’t. PDF files start with “%PDF” (https://en.wikipedia.org/wiki/Magic_number_(programming)#Mag...)

[+] SigmundA|7 years ago|reply

It also very inefficient especially if you want things like images embedded in it.

Its not like you can read text without the right program, its still binary, there just happens to be a mostly agreed upon standard and a lot of programs that can decode that standard and render to screen.

PDF viewers are built into most browsers now and allows rich page perfect print ready results.

If I had it my way we would settle on a binary data serialization format, I don't care MsgPack, Protobuf, heck maybe even sqlite and then everyone could have a viewer to snoop around in it. You would still have to understand whats being encoded but you could always "view source" so to speak.

[+] echeese|7 years ago|reply

PDF files begin with %PDF, not sure you have a valid PDF there

[+] jcranmer|7 years ago|reply

I'm not sure how you got that string, since there's no charset except full-Unicode ones that would contain both ∆ and œ, and compressed binary text isn't going to appear to be UTF-16 or UTF-8, so I don't see any obvious charset you could be using.

Hmm. Maybe your charset is Windows-1253, and unassigned characters are being supplanted with the equivalent codepoints from Windows-1252?

[+] mkl|7 years ago|reply

Actually, PDF is primarily a text format, it's just usually compressed, and often contains embedded binary content like fonts and images.

You can see the text code by doing something like this:

  pdftk input.pdf output input_uncompressed.pdf uncompress

You can also edit it in that state, in an editor that preserves binary content, but there's a hard-coded offset table at the end, so if you change the length of something, that needs updating (very fiddly to attempt by hand, but automatable, and some pdf tools automatically fix broken offset tables).

[+] freddie_mercury|7 years ago|reply

There's really no such thing as "plain-text".

Maybe you mean ASCII, which is indeed simple and also not useful for several billion people. (I live in one of the many countries that cannot use ASCII -- the American Standard Code for Information Interchange -- because don't speak American here.)

Or maybe you mean Unicode which is an extremely long spec and absolutely cannot be handled by a first year programming student.

[+] SllX|7 years ago|reply

Arguably it actually is plain text, but I think you can build a case for at least a dozen file formats that they’re the most important file format in the world, so arguably, none of them are.

[+] AnaniasAnanas|7 years ago|reply

The solution to that is postscript. I still see quite a few papers being distributed in postscript form.

[+] jonathanyc|7 years ago|reply

PDF files start with %PDF. You’re probably looking at DEFLATEd or binary data. What do you expect?

[+] jerzyt|7 years ago|reply

My gripe with PDF is that I don't understand why a standard format which is almost 30 years old, requires seemingly weekly updates of the Acrobat Reader, which in turn requires reboot of my work laptop. I upgrade the reader far more often than I actually use it.

[+] SigmundA|7 years ago|reply

Now days you don't need acrobat reader typically, Chrome, Edge and Firefox can all view/print PDF's. If your on a Mac/iOS the viewers are built as the OS uses display postscript internally (quartz) which matches up pretty closely with PDF.

[+] cromulent|7 years ago|reply

Sounds like your gripe is with Adobe, not PDF.

[+] autoexec|7 years ago|reply

Probably because Adobe's PDF reader is full of vulnerabilities and constantly used to spread viruses and compromise machines. I wouldn't be surprised if there were more malware infested PDF files on the internet than legit ones.

[+] jimjimjim|7 years ago|reply

That is acrobat's fault not pdf.

there is churn in all major applications these days regardless of if it is needed. Reader is just trying to regain mind-share back from people viewing pdfs in browsers.

[+] fghtr|6 years ago|reply

https://pdfreaders.org

[+] jcelerier|7 years ago|reply

sumatrapdf works fine

[+] nathan_f77|7 years ago|reply

I run a PDF generation service [1], so this is nice to read. When I first launched the service, I was worried that PDFs and paper forms might become obsolete in the near future, when everyone starts to go paperless and digital. Now that I'm more familiar with the market, this is no longer a concern. (I have banks who are using the software to modernize their operations.)

I also realized that there might be some pressure to turn into the next TurboTax, where the company eventually lobbies against improvements just so we can stay in business. I made a resolution that I'll never do anything like that. But I guess the founders of TurboTax never intended to do that either.

[1] https://formapi.io

[+] bluedino|7 years ago|reply

I love PDFs. The first time I used one was Acrobat 2.0. I bought a shitty SAMS computer book that came with a bonus shitty SAMS computer book on the CD. As a PDF of course.

It was cumbersome on a Pentium 75MHz with 640x480x8 graphics. I'd be blown away 20 years later viewing those same files on a Macbook Pro with Retina display.

However, it was easily the best way to distribute printed documents EXACTLY the way they were meant to be seen.

They weren't meant to be edited, modified, data extracted from...And adobe went from a quick, minimal viwer to a bloated security nightmare by adding 'features.

Luckily, 3rd party and open-source projects came to the rescue.

[+] nyfresh|7 years ago|reply

PDF is the perfect tool for maintaining document formatting but the worst tool for maintaining document data. The format is not concerned the the actual data of document. Programatically extracting data from the simplest PDF is an exercise in patience. I find it kind of odd that this was chosen as the standard all things considered

[+] djhworld|7 years ago|reply

I sometimes read papers but I find them an absolute pain to read them on my phone because the two aren't really designed for each other. So I end up with this frustrating experience of zooming in/out to read this bits I'm interested in - which is impossible to do one handed!

I know there's that arxiv vanity thing, which is cool, but most of the time I get the "sorry we can't render this" error message.

[+] jyriand|7 years ago|reply

PDF is fine if you are using a laptop/desktop/tablet. But it doesn’t fit well into phone screens. Usually I have to turn my phone horizontally and then zoom in little bit. And when I reach the end of the page I have trobles turning it. And sometimes page turning messes up my perfect zoom fit etc. Responsive PDF is what I need...

[+] anon1253|7 years ago|reply

It's also actively harming medical and scientific progress as I allude to here in this talk https://www.youtube.com/watch?v=EM61rn9Gxl4&list=PLjzcwcP9P2...

[+] bla3|7 years ago|reply

This would've been a lot more convincing before mobile happened a decade ago. Reading PDFs on my phone is a pain, and it hasn't improved at all in the last ten years. Looking at the web on mobile used to be painful too, but it's ok now.

[+] levesque|7 years ago|reply

Possibly also the most hated file format. Just the fact that no two PDF editors behave the same way goes to show how bad of a format it is.

[+] wyld_one|7 years ago|reply

It is not open data.

You still can have proprietary blocks of info inside the file.

The lack of open source tools to manipulate the format is a major hindrance IMHO.

It is also a very space wasting as well when people only do a bitmap dump into the file for scans. Forms are also an area that is not open source either.

It has so many hacks and kludges, it would be better if it was trashed and we start with postscript again.

[+] lixtra|7 years ago|reply

What happened to HTML? Nobody could read nor discuss about PDF here without it.

[+] harshreality|7 years ago|reply

The minor html variant standardized as epub is alive and well, and ubiquitous for ebooks — whereever exact page layout (like scientific textbooks) and page # references don't matter.

With very narrow exceptions, that are even narrower than PDF publishers think, epub is and should be preferred even for technical or academic writings. (Kindle's kf8 format is just a repackaged variant of epub.)

[+] SigmundA|7 years ago|reply

Let me know when browser support the full CSS print spec...no Prince doesn't count.

[+] ris|7 years ago|reply

The thing I find craziest about PDF/A is it isn't really a format in itself, just a vague promise not to use certain features in the ensuing file. Whether any reader holds the file to that promise is something I'm quite doubtful of. Instead I suspect most readers will do their best to display anything they're handed, happily passing it through any of the hundred-or-so, possibly legacy-code-powered sub-format decoders the file author wishes - leading to a massive attack surface.

From a developer's point of view, when trying to enforce that submitted files are strictly in PDF/A format, from what I can tell there isn't much more you can do than dissect the file looking for umpteen disallowed features.

Is there an ISO-compliance validator to anyone's knowledge?

[+] mkl|7 years ago|reply

Yes! https://verapdf.org/

[+] jdblair|7 years ago|reply

I remember when PDF came out, I didn't get it. I thought the problem was already solved by compressed postscript files! I was already used to downloading and sometimes printing paper documentation in this format.

It was natural to view Postscript files on NeXT and UNIX machines, and Ghostscript was already thing. What could be better than just using the "native" language of the printer? I didn't realize that this was not a common view, or even possible for most personal computers at the time.

I was also misinformed for quite some time about the internal format of PDF, assuming it was just PS wrapped up in a container. In a sense, this is true, but there's a lot more (embedded fonts, transparency, forms are just a few that come to mind).

[+] mkl|7 years ago|reply

> I was also misinformed for quite some time about the internal format of PDF, assuming it was just PS wrapped up in a container. In a sense, this is true, but there's a lot more (embedded fonts, transparency, forms are just a few that come to mind).

There's also a lot less, as PDF is not a full programming language like PS.

[+] bondolo|7 years ago|reply

PDFs are an accessibility nightmare and most production pipelines are terrible at preserving the semantic structure of the document or, in many cases, even preserving the text. A PDF is in many cases as series of page images which aren't usable for anything other than human viewing or printing. "Export as PDF" generally produces much better results than print to PDF though not from every application. Many days I wish there was an alternative solution based on SVG. While not perfect it certainly would avoid many of the problems of PDF while having all of the important capabilities.

[+] KingMachiavelli|7 years ago|reply

I really hate PDF for many reasons. It seems it's only a partially open format, a lot of features are implementation dependent (how can a file be 'locked' to prevent printing or editing?). There are very few free and FOSS clients that handle forms, highlighting, etc. Some clients do highlighting and annotations but don't save it to the PDF itself.

The failure of epub and other HTLM based formats in this use case, IMO, is that their focus on reflowing to support any display and device makes them inconsistent and therefore impractical for replacing PDF based content.

212 comments