While useful it needs a big red warning to potential leakers. If they were personally served documents (such as via email, while logged in, etc) there really isn't much that can be done to ascertain the safety of leaking it. It's not even safe if there are two or more leakers and they "compare notes" to try and "clean" something for release.
The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.
This doesn't seem to be designed for leakers, i.e. people sending PDF's -- it's specifically for people receiving untrusted files, i.e. journalists.
And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.
I don't see why it would need a warning for something it's not designed for at all.
I seem to remember Yahoo finance (I think it was them, maybe someone else) introducing benign errors into their market data feeds, to prevent scraping.
This lead to people doing 3 requests instead of just 1, to correct the errors, which was very expensive for them, so they turned it off.
I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.
Oof, that's a great point. We briefly touched on this a few weeks ago, but from the angle of canary tokens / tracking pixels [1].
Security-wise, our main concern is protecting people who read suspicious documents, such as journalists and activists, but we do have sources/leakers in our threat model as well. Our docs are lacking in this regard, but we will update them with information targeted specifically to non-technical sources/leakers about the following threats:
- Digital watermarking (what you're pointing out here)
- Fingerprinting (camera, audio, stylometry)
- Canary tokens (not metadata per se, but still a de-anonymization vector)
If you come in FOSDEM next week, we plan to talk about this subject there [2].
The goal here isn't to provide a false sense of security, nor frighten people. It's plain old harm reduction. We know (and encourage) sources to share documents that can help get a story out, but we also want to educate them about the circumstances in which they may contain their PII, so that they can make an informed choice.
Why not leak a dataset of N full text paraphrasings of the material, together with a zero-knowledge proof of how to take one of the paraphrasings and specifically "adjust" it to the real document (revealed in private to trusted asking parties)? Then the leaker can prove they released "at least the one true leak" without incriminating themselves. There is a cryptographic solution to this issue.
Wouldn't comparing between two downloads reveal if the files are watermarked immediately though. Especially the sentence or other steganographic watermarks embedded in the text itself should show up pretty clearly to a simple comparison.
Heh, I've seen this a bunch of times and it's of interest to me, but honestly? It's sooooo limiting by being an interface without a complementary command line tool. Like, I'd like to put this into some workflows but it doesn't really make sense to without using something like pyautogui. But maybe I'm missing something hidden in the documentation.
There is indeed a dangerzone-cli tool¹, and it should be made more visible. We plan on updating/consolidating our docs in the foreseeable future, to make things clearer.
Also, plans are here to make it possible to use dangerzone as a library, which should help use cases like the one you mention.
Shameless self promotion: preview.ninja is a site I built that does this and supports 300+ file formats. I'm currently weekend coding version 2.0 which will support 500+ formats and allow direct data extraction in addition to safe viewing.
It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.
For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.
That's something I do from time to time as well. AFAIK Google Drive renders all documents on the server-side (which implicitly means that they don't trust the browser sandbox), so that's a reasonable price to pay for less privacy.
Dealing with sensitive documents though is another story, you just can't upload them to a third-party service. That's where projects like Dangerzone come into play.
Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?
A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:
(Hi, disclaimer: I'm one of the current dangerzone maintainers)
That's a good question :-)
Opening PDFs, or images, or any other document directly inside your machine, even with a limited PDF viewer, potentially exposes your environment to this document.
The reason is that exploits in the image/font/docs parsing/rendering libraries can happen and are exploited in the wild. These exploits make it possible for an attacker to access the memory of the host, and in the worse case allow code execution.
Actually, that's the very threat Dangerzone is designed to protect you from.
We do that by doing the docs to pixel conversion inside a hardened container that uses gVisor to reduce the attack surface ¹
One other way to think about it is to actually consider document rendering unsafe. The approach Dangerzone is taking is to make sure the environment doing the conversion is as unprivileged as possible.
In practice, an attack is still possible, but much more costly: an attacker will be required to do a container escape or find a bug in the Linux kernel/gVisor in addition to finding an exploit in document rendering tools.
Not impossible, but multiple times more difficult.
Is there any benefit of this tool over opening docs in Windows Sandbox/VM with disabled network? Conversion can be easily done with a simple tool that screenshots each page within the sandbox (could be done for example with few lines of AHK script).
(Hi, disclaimer: I'm one of the current dangerzone maintainers)
You are correct: that's basically what Dangerzone is doing!
The challenges for us are to have a sandbox that keeps being secure and make it possible for non-tech folks (e.g. journalists) to run this in their machines easily.
About the sandbox:
- Making sure that it's still updated requires some work: that's testing new container images, and having a way to distribute them securely to the host machines ;
- In addition to running in a container, we reduce the attack surface by using gVisor¹ ;
- We pass a few flags to the Docker/Podman invocation, effectively blocking network access and reducing the authorized system calls ;
Also, in our case the sandbox doesn't mount the host filesystem in any way, and we're streaming back pixels, that will be then written to a PDF by the host (we're also currently considering adding the option to write back images instead).
The other part of the work is to make that easily accessible to non-tech folks. That means packaging Podman on macOS/Windows, and providing an interface that works on all major OSes.
I'd rather have 2 minimal (headless, no network, etc) virtual machines. One runs pandoc for the conversion and the other runs ghostscript on the result. Nowadays you can let a web browser run pretty much anything so you don't need to build a vm image anymore.
Could we make a method to sanitize PDF’s that preserves the metadata?
It would be better to strip active content like javascript and actions, without flattening the PDF and losing all the text data having the original text is better than sending it through ocr again.
To review documents received from a hostile and dishonest actor in litigation I used disposable VMs in qubes on a computer with a one way (in only) network connection[1], while running the tools (e.g. evince) in valgrind and with another terminal watching attempted network traffic (an approach that did detect attempted network callbacks from some documents but I don't think any were PDFs).
This would have been useful-- but I think I would have layered it on top of other isolation.
([1] constructed from a media converter pair, a fiber splitter to bring the link up on the tx side, and some off the shelf software for multicast file distribution).
coppsilgold|1 month ago
https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking
https://arxiv.org/abs/1111.3597
The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.
crazygringo|1 month ago
And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.
I don't see why it would need a warning for something it's not designed for at all.
alphazard|1 month ago
I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.
apyrgiotis|1 month ago
Security-wise, our main concern is protecting people who read suspicious documents, such as journalists and activists, but we do have sources/leakers in our threat model as well. Our docs are lacking in this regard, but we will update them with information targeted specifically to non-technical sources/leakers about the following threats:
- Metadata (simple/deep)
- Redactions (surprisingly easy to get wrong)
- Physical watermarking (e.g., printer tracking dots)
- Digital watermarking (what you're pointing out here)
- Fingerprinting (camera, audio, stylometry)
- Canary tokens (not metadata per se, but still a de-anonymization vector)
If you come in FOSDEM next week, we plan to talk about this subject there [2].
The goal here isn't to provide a false sense of security, nor frighten people. It's plain old harm reduction. We know (and encourage) sources to share documents that can help get a story out, but we also want to educate them about the circumstances in which they may contain their PII, so that they can make an informed choice.
[1]: https://social.freedom.press/@dangerzone/115859839710582670
[2]: https://fosdem.org/2026/schedule/event/JZ3F8W-dangerzone_ble...
(Dangerzone dev btw)
robertk|1 month ago
rtkwe|1 month ago
normie3000|1 month ago
Isn't this what newspapers do?
jevinskie|1 month ago
https://github.com/caradoc-org/caradoc
http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...
chaps|1 month ago
tclancy|1 month ago
How hard did you look the other times?
crazygringo|1 month ago
It doesn't seem to be meant for usage at scale -- it's not for general-purpose conversion, as the resulting files are huge, will have OCR errors, etc.
almet|1 month ago
There is indeed a dangerzone-cli tool¹, and it should be made more visible. We plan on updating/consolidating our docs in the foreseeable future, to make things clearer.
Also, plans are here to make it possible to use dangerzone as a library, which should help use cases like the one you mention.
¹ https://github.com/freedomofpress/dangerzone/blob/main/dange...
mike_d|1 month ago
It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.
1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...
gu009|1 month ago
For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.
Just ran a quick test:
- 1-page Excel PDF export: 3.7MB
- Processing with Dangerzone (OCR enabled): 131KB
mikepurvis|1 month ago
crazygringo|1 month ago
The size is probably font embedding.
And then the OCR will probably not be 100% correct if you ever intend to copy-paste from it.
dfajgljsldkjag|1 month ago
bob1029|1 month ago
creatonez|1 month ago
gleenn|1 month ago
apyrgiotis|1 month ago
That's something I do from time to time as well. AFAIK Google Drive renders all documents on the server-side (which implicitly means that they don't trust the browser sandbox), so that's a reasonable price to pay for less privacy.
Dealing with sensitive documents though is another story, you just can't upload them to a third-party service. That's where projects like Dangerzone come into play.
PaulDavisThe1st|1 month ago
philipkglass|1 month ago
https://github.com/mate-desktop/atril
A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:
https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...
almet|1 month ago
That's a good question :-)
Opening PDFs, or images, or any other document directly inside your machine, even with a limited PDF viewer, potentially exposes your environment to this document.
The reason is that exploits in the image/font/docs parsing/rendering libraries can happen and are exploited in the wild. These exploits make it possible for an attacker to access the memory of the host, and in the worse case allow code execution.
Actually, that's the very threat Dangerzone is designed to protect you from.
We do that by doing the docs to pixel conversion inside a hardened container that uses gVisor to reduce the attack surface ¹
One other way to think about it is to actually consider document rendering unsafe. The approach Dangerzone is taking is to make sure the environment doing the conversion is as unprivileged as possible.
In practice, an attack is still possible, but much more costly: an attacker will be required to do a container escape or find a bug in the Linux kernel/gVisor in addition to finding an exploit in document rendering tools.
Not impossible, but multiple times more difficult.
¹ We covered that in more details in this article https://dangerzone.rocks/news/2024-09-23-gvisor/
majkinetor|1 month ago
robertk|1 month ago
almet|1 month ago
You are correct: that's basically what Dangerzone is doing!
The challenges for us are to have a sandbox that keeps being secure and make it possible for non-tech folks (e.g. journalists) to run this in their machines easily.
About the sandbox:
- Making sure that it's still updated requires some work: that's testing new container images, and having a way to distribute them securely to the host machines ;
- In addition to running in a container, we reduce the attack surface by using gVisor¹ ;
- We pass a few flags to the Docker/Podman invocation, effectively blocking network access and reducing the authorized system calls ;
Also, in our case the sandbox doesn't mount the host filesystem in any way, and we're streaming back pixels, that will be then written to a PDF by the host (we're also currently considering adding the option to write back images instead).
The other part of the work is to make that easily accessible to non-tech folks. That means packaging Podman on macOS/Windows, and providing an interface that works on all major OSes.
¹ https://dangerzone.rocks/news/2024-09-23-gvisor/
e40|1 month ago
autoexec|1 month ago
tosti|1 month ago
daft_pink|1 month ago
It would be better to strip active content like javascript and actions, without flattening the PDF and losing all the text data having the original text is better than sending it through ocr again.
nullc|1 month ago
This would have been useful-- but I think I would have layered it on top of other isolation.
([1] constructed from a media converter pair, a fiber splitter to bring the link up on the tx side, and some off the shelf software for multicast file distribution).
snowmobile|1 month ago
boston_clone|1 month ago
I imagine that folks like journalists could have that type of attack in their threat model, and EFF already do a lot of great stuff in this space :)
0. https://isc.sans.edu/diary/31998
1. https://www.cloudflare.com/cloudforce-one/research/svgs-the-...
anthk|1 month ago
rurban|1 month ago
The employment readyness check if you can trust a company.
NedF|1 month ago
[deleted]
s5300|1 month ago
[deleted]
theturtle|1 month ago
[deleted]