Local PDF Tools – Powered by WebAssembly

[+] svat|5 years ago|reply

Last year I wrote a couple of similar "local" PDF tools that run in the browser with no network requests. Each is just a single HTML file that will work offline:

- https://shreevatsa.net/pdf-pages/ is for extracting pages, inserting blank pages, duplicating or reversing pages, etc.

- https://shreevatsa.net/pdf-unspread/ is for splitting a PDF's "wide" pages (consisting of two-page spreads) in the middle.

- https://shreevatsa.net/mobius-print/ is the earliest of these, and written for a niche use-case: "Möbius printing" of pages, which is printing out an article/paper two-sided in a really interesting order. (I've tried it and love it.)

These don't use WebAssembly, but just use the excellent "pdf-lib" JS library. To keep the file self-contained, I put the whole minified source into a <script> tag at the bottom of the (otherwise hand-written) HTML file.

[+] skrebbel|5 years ago|reply

I hope to never again print more than 2 pages in a go in my life, but if I do, I'm definitely going to use your Möbius printing tool, it's genius.

[+] hackerbrother|5 years ago|reply

Very cool!! Thanks for sharing :)

[+] justsayinghi|5 years ago|reply

Very useful pdf tools

[+] kickbeak|5 years ago|reply

Hey, Thanks for Posting it here, i built this tool, hope you like it, feel free to look at my source and contribute. https://github.com/jufabeck2202/localpdfmerger

[+] Abishek_Muthian|5 years ago|reply

Is there a .pdf tool which allows compression to a defined file size? Tools like ghostscript can compress a .pdf to different levels of quality by using different setting but not a defined file size; I understand that this has to do with the compression algorithm itself and that data could be compressed only to a certain limit, but what if the file size limit is within that limit?

I'm asking this because an user of my problem validation platform wanted a solution for this[1], because websites requiring document upload have a file size limit and often the compressed file is either above or below the prescribed file size limit thereby loosing out on quality unnecessarily.

[1]'Reduce document file size to specific size' (I have added the link to it on my profile, since it's my own platform).

[+] hnick|5 years ago|reply

It's a bit more complicated than it sounds, text streams are generally just compressed as good as they can be using whatever available scheme, the bulk of the space usage is often fonts and images. A PDF itself is not compressed so much as each part of it is compressed individually.

There's not much to do with fonts except don't embed them unless you need to, and don't have duplicate/overlapping subsets, if you do have these it is very tricky to untangle. I'm not aware of any good tool to do it automatically.

For images, it depends on the format. If your PDF has JPEG (DCTDecode) images then it will have to resample to the JPEG spec, if it's TIFF likewise, you can change the number of colour bits, or you can downsample the DPI which is a simple gs command line option, or you can change the compression scheme within the JPEG itself then replace it. There are so many avenues to approach this that I'm not sure it's something easily achieved while still obtaining a good result across all possible PDFs.

Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.

[+] brailsafe|5 years ago|reply

I'd certainly be curious why wasm ends up being 15x slower than native binary in this case, but it's not insurmountable. All of the major commercial PDF editing suites use wasm + their own C++ based pdf engine to great effect.

The article that this is based on is here, and a good read. It seems like it's at least non-trivial to get it working, and I'd wonder how the process looks for other compiled binaries, having not tried to do that implementation from scratch. https://dev.to/wcchoi/browser-side-pdf-processing-with-go-an...

[+] sigvef|5 years ago|reply

Looks like this thread is all about sharing our own related local browser-based PDF tools. Here’s mine: https://pdftotext.github.io

[+] kc0bfv|5 years ago|reply

There are a few versions of tools like this, or similar, available. Here's mine:

https://kc0bfv.github.io/WASM-PDF-Combiner/

I used existing wasm compiles of PDF tools. This use of wasm is pretty awesome to me - I often end up working on very restricted desktop clients with little customization possible, but they always let me run a browser.

[+] redman25|5 years ago|reply

Seems to be stuck at “Loading” for me on iOS safari.

[+] kickbeak|5 years ago|reply

Looks cool

[+] codetrotter|5 years ago|reply

Convenient if you are on a machine where you can’t install software. (Corporate computer, school computer, library computer etc.)

For Linux and macOS computers that you are allowed to install software on I recommend the pdftk command line tool.

Ubuntu family:

  sudo apt install pdftk

macOS with Homebrew:

  brew install pdftk-java

[+] stephenr|5 years ago|reply

> For Linux and macOS computers that you are allowed to install software on I recommend the pdftk command line tool.

If you're on a Mac, the built in Preview tool has had the ability to merge and manipulate PDF documents for years.

[+] yomansat|5 years ago|reply

Can one easily install such apps as a Chrome app/PWA, and deactivate access to the internet since it doesn't need it and one can merge personal PDFs?

[+] mgm__|5 years ago|reply

I created a PDF table extractor tool last year with the same idea that it should be local only. Try it here: https://pdftableutil.possiblenull.com/app/ Also as a Google Docs addon (still local only) https://workspace.google.com/marketplace/app/pdf_table_impor...

I had a bad case of scope creep, so the tool can also extract tables from scanned/image PDFs using OpenCV.js and tesseract OCR wasm build!

[+] kickbeak|5 years ago|reply

Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!

[+] redman25|5 years ago|reply

This is interesting. How accurate would you say it is?

[+] danvk|5 years ago|reply

I’d love a tool (that’s not Acrobat) to manage comments on PDFs.

[+] franga2000|5 years ago|reply

Okular has support for comments, although I don't know if they're compatible with Acrobat's.

[+] naedish|5 years ago|reply

This is helpful. Generally if I need to do any pdf manipulation when I'm away from my own machine I use an android app - PDF Utils [1].

[1] https://play.google.com/store/apps/details?id=pdf.shash.com....

[+] not_knuth|5 years ago|reply

Does anyone know some good tutorials/explanations for understanding the PDF format at the byte level?

[+] gpvos|5 years ago|reply

The official PDF Reference is very readable, see maest's link. Just start at the bit about the file structure and data types, then the basics about the commands in the Contents stream, the graphics and text states, then whatever takes your fancy. Later versions added some extra complications such as compression of the xref table and object streams, you don't need them unless you encounter them. Don't delve too deep into the bit about fonts unless you have to, it might bend your brain. A tool like qpdf to deconstruct a PDF to an uncompressed form is very handy.

[+] hnick|5 years ago|reply

The spec is really well written, and will take you far if you just want the basics (i.e. not the scripting stuff they added after 1.4 which is around where you should usually stop if you just care about printing). My only issue with it is when it comes to fonts or images, you'll have to break out to additional specs to understand those formats since PDF is more of a container.

I do like this minimal example as a way to get started and see how a very basic PDF is built.

https://brendanzagaeski.appspot.com/0004.html

[+] maest|5 years ago|reply

This: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pd... or a newer version.

[+] ithkuil|5 years ago|reply

Anybody knows a simple tool I can use to turn an academic two-column paper into a single column pdf (so I can read it easily on e-paper like a remarkable)?

(Ideally I'd like to be able to run such a tool from browser/phone)

[+] jerry_tk|5 years ago|reply

Maybe try k2pdfopt (https://www.willus.com/k2pdfopt/) It is a Windows/Linux application though.

[+] georgeutsin|5 years ago|reply

Looking forward to using this tool! Are there plans to make this open source?

[+] pabs3|5 years ago|reply

It is based on an existing open source project:

https://github.com/pdfcpu/pdfcpu

[+] pininja|5 years ago|reply

Here’s the open source repo: https://github.com/jufabeck2202/localpdfmerger

[+] Ciantic|5 years ago|reply

It's great to see more PDF tools.

Many times I just want to clip white margins from PDFs so that it is easier to view on tablets or phones. Most viewers don't have a way to force the clipping of pages, so when you change page the zoom is lost and suddenly all the content is squished to center.

Last time I found cli programs to do it aprox. five years ago it was really difficult to find good tools to edit PDFs like that.

It's actually not trivial task, as sometimes pages have different margins, e.g. odd and even pages has different margins on folding side of page.

[+] travis729|5 years ago|reply

For something like this, how do we know that the files are not sent to a server? Am I just trusting the web app? Is there any way to be sure other than having and reading the source?

[+] boustrophedon|5 years ago|reply

You can load the website and then disable the network, either by turning off the connection in your OS or via File -> Work Offline in Firefox.

I just did this and it worked.

[+] fireattack|5 years ago|reply

Open dev tool and monitor network?

[+] grok22|5 years ago|reply

Disconnect from the network and try it? Or disconnect from the network always when using this?

[+] desmap|5 years ago|reply

Something I still miss is a free and easy PDF tool which lets you delete, reorder and add pages from multiple PDFs. On Windows there is just Xodo but its UX is unfortunately subpar and on macOS you have Preview where the UI is better but once you have multiple PDFs from where you get the pages it can get confusing.

[+] c-st|5 years ago|reply

On Linux there is https://github.com/pdfarranger/pdfarranger It has a nice UI and is pretty easy to use.

[+] karthickgururaj|5 years ago|reply

PdfTk does this. I use the CLI version, but I think there is one with a GUI as well.

[+] number6|5 years ago|reply

On Windows there ist: https://www.pdf24.org/

[+] simonmales|5 years ago|reply

I have this same idea on my to-do list. Great that people are experimenting with webapps that don't send any data!

[+] redman25|5 years ago|reply

JavaScript apis for browsers can do a lot now. It’s great! I built something similar recently with Mozilla’s PDF library. It’s for diffing PDFs but everything happens locally. https://parepdf.com

[+] pabs3|5 years ago|reply

Hmm, I think I would just compile the pdfcpu Go source to native code, that might be faster than WebAssembly?

[+] whoisjohnkid|5 years ago|reply

Definitely

[+] horst_vie|5 years ago|reply

This is a nice usecase for pdfcpu. If you are pdftk user give the pdfcpu CLI a spin. It is multi platform and has some nice features baked in. https://pdfcpu.io/

59 comments