So happy someone are spending time on this issue, it's like a breath of fresh air and intelligence in the midst of all the usual software (security/privacy/etc. take your pick) mayhem. It's worth reading https://reproducible-builds.org/ for a brief (re-)reminder on why this project is important.
Outtake: "With reproducible builds, multiple parties can redo this process independently and ensure they all get exactly the same result. We can thus gain confidence that a distributed binary code is indeed coming from a given source code."
Reproducible builds are extremely useful. There are more benefits. For example, suppose you have a build server compiling software packages. If your builds are not reproducible and you want to debug a core dump, but you have no debug information, you are out of luck (well, you could dive into the assembly code, but it's inconvenient). If you want to keep debug information, you need to store them for every single build (what a waste of storage...) because the binary for each build is different. Not so with reproducible builds, you could simply check out the old version and compile it with debug information!
Beyond providing security, reproducible builds also provide an important ingredient for caching build artifacts (and thus accelerating build times) across CI and developer machines. They also can form the basis of a much simpler deploy and update pipeline, where the version of source code deployed is no longer as important. Instead a simple (recursive) binary diff can identify which components of a system must be updated, and which have not changed since the last deploy. This means a simpler state machine with fewer edge cases that works more quickly and reliably than the alternative.
I'm very grateful for the work that this project has done and continues to do. Thank you!
Amazing work. Thanks so much to everyone who's contributing. The upstream bugs filed are especially appreciated since they make the whole Linux ecosystem more solid, not just Debian.
Does anyone know if they've made the Packages file (repository metadata file, listing the packages in the repository) build reproducibly yet?
I tripped over this a couple weeks ago and was both amused and annoyed, since it seemed that packages were being listed in the file in a random order. I'm asking here because it might already be fixed; we're using a slightly old version of the package/repository tools.
Guix and Nix are input-reproducible. Given the same input description (input being the source files and any dependencies) an output comes out. Builds are then looked up in a build cache based on the hash of al lathe combined inputs. However. The _output_ of Nix artifacts are not reproducible. Running the same input twice will yield a different result.
Nix does some tricks to improve output reproducibility like building things in sandboxes with fixed time, and using tarballs without modification dates but output bit-by-bit reproducible is not their goal. They also don't have the manpower for this.
Currently, a build is built by a trusted builderver for which you have the public key. And you look up the built by input hash but have no way to check if the thing the builderver is serving is legit. It's fully based on trust.
However, with debian putting so much effort in reproducible output, Nix can benefit too. In the future, we would like to get rid of the 'trust-based' build servers and instead move to a consensus model. Say if 3 servers give the same output hash given an input hash, then we trust that download and avoid a compile from source. If you still don't trust it, you can build from source yourself and check if the source is trustworthy.
Summary: Nix does not do bit-by-bit reproducibility, but we benefit greatly from the work that debian is doing. In the future we will look at setting up infrastructure for buildservers with an output-hash based trust model instead of an input based one. However this will take time.
What does "reproducibility" mean? I understand and appreciate the importance of reproducibility in the context of scientific experiments, but I don't understand what it means in terms of computer programs. I am guessing it has to do with being able to build on different architectures without issue?
In the context of "reproducible builds", it means that if you compile the same source code with the same compiler and build system, the output will be completely identical, bit by bit. This is surprisingly hard to achieve in practice.
Once they have reproducible builds, they can easily prove that each binary package was built from the corresponding source code package: just have a third party compile the source code again and generate the binary package, and it should be identical (except for the signature). This reduces the need to trust that the build machines haven't been compromised.
It has a similar meaning to research. What it means is that you can reproduce (compile in most cases) from source code the same bit-for-bit identical binary independently. While this might sound like something that should be trivial to do, it turns out to be far from trivial (timestamps and other environment information leaks into binaries all the time).
There's a website that describes this project in much more detail as well as how they worked around the various problems they found. https://reproducible-builds.org/
In addition to all the siblings; this is also important in research - which increasingly uses computers. If you provide a paper, source code, dataset and description of the system(s) used - can someone reproduce your research?
It would certainly be convenient if you can point to a version/snapshot of Debian (or another distribution) - and it would then be possible to take your (say C) source code and compile and run the same binary used for research.
It's true that often getting the algorithm more-or-less right is enough - but the more research is augmented by computing devices, the more important it becomes to maintain reproducibility - and the more complex and capable these computer devices (say a top-100 supercomputer, software stack in C++ on top of MPI, some Fortran numeric libraries etc) become - the harder it becomes to maintain it.
Imagine verifying research done today by repeating experiments in 50 years.
It has taken, and continues to take, a suprising amount of work to make two builds of a program produce the same output.
There are many sources of issues. For example: date and time stored in output, program's output effected by order in which files are read from a directory (and is not having a fixed ordering), hash tables based on the pointer and the high objects are stored having different ordering on different executions, parallel programs behaving differently on different runs, and others.
It's about being able to reproduce the binary from source. You might think this is pretty much impossible in the Debian context, but things like timestamps, and underspecified dependencies can end up shifting a build's result over time.
If we want to insist that open source code is secure by source code analysis, we need a verifiable build chain, that the code and binaries an analysis uses are the same as what we get later.
It means each time you build the same code in a known setup, you get bit for bit the same binary. That allows you to assure that the code that's shipped actually matches the source code.
It sounds trivial, but the full paths and timestamps that get added at multiple points in the process are enough to screw this up, and those are the easy problems.
I think it's for security. It means that there's a deterministic relationship between the source of a program and its final compiled artifacts.
If software has reproducible builds that means that third-parties can independently verify that artifacts have been built safely from sources, without any sneaky stuff snuck in.
Once we have reproducible builds, will it be possible to have verifiable builds? As in, can we cryptographically show that source + compiler = binary?
Right now we can sign source code, we can sign binaries, but we can't shows that source produced binaries. I would feel much happier about installing code if I knew it was from a particular source or author.
What do you mean by "cryptographically show"? With reproducible builds, anyone that has repeated a build can verify that a claimed binary matches, and could sign a statement saying so. But I don't think there are solutions that don't include someone repeating the build, or a clear way of proving that you actually did repeat the build.
Yes. The first standard for securing computer systems mandated some protections against this. They were partly made by Paul Karger who invented the compiler subversion Thompson wrote about a decade later. Most just focused on that one thing where Karger et al went on to build systems that were secure from ground up with some surviving NSA pentesting and analysis for 2-5 years. Independently, people started building verified compilers and whole stacks w/ CPU's. They were initially simple with a lot to trust but got better over time. Recently, the two schools have been merging more. Mainstream INFOSEC and IT just ignores it all slowly reinventing it piece by piece with knock offs. It's hard, has performance hit, or is built in something other than C language so don't do it. (shrugs)
No. Or at least this doesn't provide that. I think in theory you could make a crypto compiler that proves the binary is isogonal to the source, but I suspect the verification effort wouldn't be much less than recompiling.
On NixOS python is patched so that if the environment variable DETERMINISTIC_BUILD is present the interpreter set the bytecode timestamps to 0. I suppose they did something similar.
One (misguided) counter argument I've heard from otherwise fantastic devs it's the notion of adding randomness to unit tests in the hopes that if there's a bug, at least some builds will fail. In practice, I've seen those builds and developers saying "yeah, sometimes you need to build it twice".
I think the solution is to give those devs who favor such techniques a separate but easy to use fuzzing tool set that they can run just like their unit tests, separate from their usual 'build' command. Give them their ability to discover new bugs, but make it separate from the real build.
Why would the randomness in unit tests affect the binary? RNGs are invoked when the tests are run, not when they're built, and anyway, test code shouldn't be part of the final binary.
My take on that is that those devs are probably fooling themselves about the state of their test suite. I'm working on a codebase like that right now - it's in generally a relatively good state, but just this week I've run into a moderately nasty bug involving global state not getting properly reset which the tests didn't cover because their specific execution order happened to work.
Or just seed the PRNG with some digest of the relevant source code (perhaps as simple as the version number). Doesn’t solve the problem that your tests can suddenly break on unrelated changes, but does solve the problem that your tests can break without any changes.
Compare this to Windows or OSX, where not only are you unable to build packages yourself, but they are installed from downloads you find in disparate places on the web, are not cryptographically signed by people you can trust, and often include spyware anyway.
Has anyone played with the tool they mentioned, diffoscope? Sounds interesting and wonder how good it is at, for example, comparing excel files with VBA code and formulas etc.
[+] [-] dingdingdang|8 years ago|reply
Outtake: "With reproducible builds, multiple parties can redo this process independently and ensure they all get exactly the same result. We can thus gain confidence that a distributed binary code is indeed coming from a given source code."
[+] [-] Kenji|8 years ago|reply
[+] [-] adamb|8 years ago|reply
I'm very grateful for the work that this project has done and continues to do. Thank you!
[+] [-] seagreen|8 years ago|reply
[+] [-] jbergstroem|8 years ago|reply
[+] [-] cperciva|8 years ago|reply
I tripped over this a couple weeks ago and was both amused and annoyed, since it seemed that packages were being listed in the file in a random order. I'm asking here because it might already be fixed; we're using a slightly old version of the package/repository tools.
[+] [-] lamby|8 years ago|reply
[+] [-] jwilk|8 years ago|reply
What does "build reproducibly" even mean in this context?
[+] [-] lamby|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] phreack|8 years ago|reply
[+] [-] pmoriarty|8 years ago|reply
[+] [-] arianvanp|8 years ago|reply
Nix does some tricks to improve output reproducibility like building things in sandboxes with fixed time, and using tarballs without modification dates but output bit-by-bit reproducible is not their goal. They also don't have the manpower for this.
Currently, a build is built by a trusted builderver for which you have the public key. And you look up the built by input hash but have no way to check if the thing the builderver is serving is legit. It's fully based on trust.
However, with debian putting so much effort in reproducible output, Nix can benefit too. In the future, we would like to get rid of the 'trust-based' build servers and instead move to a consensus model. Say if 3 servers give the same output hash given an input hash, then we trust that download and avoid a compile from source. If you still don't trust it, you can build from source yourself and check if the source is trustworthy.
Summary: Nix does not do bit-by-bit reproducibility, but we benefit greatly from the work that debian is doing. In the future we will look at setting up infrastructure for buildservers with an output-hash based trust model instead of an input based one. However this will take time.
Feel Free to ask any other questions.
[+] [-] pen2l|8 years ago|reply
[+] [-] cesarb|8 years ago|reply
Once they have reproducible builds, they can easily prove that each binary package was built from the corresponding source code package: just have a third party compile the source code again and generate the binary package, and it should be identical (except for the signature). This reduces the need to trust that the build machines haven't been compromised.
[+] [-] cyphar|8 years ago|reply
There's a website that describes this project in much more detail as well as how they worked around the various problems they found. https://reproducible-builds.org/
[+] [-] e12e|8 years ago|reply
It would certainly be convenient if you can point to a version/snapshot of Debian (or another distribution) - and it would then be possible to take your (say C) source code and compile and run the same binary used for research.
It's true that often getting the algorithm more-or-less right is enough - but the more research is augmented by computing devices, the more important it becomes to maintain reproducibility - and the more complex and capable these computer devices (say a top-100 supercomputer, software stack in C++ on top of MPI, some Fortran numeric libraries etc) become - the harder it becomes to maintain it.
Imagine verifying research done today by repeating experiments in 50 years.
[+] [-] CJefferson|8 years ago|reply
It has taken, and continues to take, a suprising amount of work to make two builds of a program produce the same output.
There are many sources of issues. For example: date and time stored in output, program's output effected by order in which files are read from a directory (and is not having a fixed ordering), hash tables based on the pointer and the high objects are stored having different ordering on different executions, parallel programs behaving differently on different runs, and others.
[+] [-] jldugger|8 years ago|reply
If we want to insist that open source code is secure by source code analysis, we need a verifiable build chain, that the code and binaries an analysis uses are the same as what we get later.
[+] [-] wongarsu|8 years ago|reply
It sounds trivial, but the full paths and timestamps that get added at multiple points in the process are enough to screw this up, and those are the easy problems.
[+] [-] richdougherty|8 years ago|reply
If software has reproducible builds that means that third-parties can independently verify that artifacts have been built safely from sources, without any sneaky stuff snuck in.
[+] [-] detaro|8 years ago|reply
[+] [-] morecoffee|8 years ago|reply
Right now we can sign source code, we can sign binaries, but we can't shows that source produced binaries. I would feel much happier about installing code if I knew it was from a particular source or author.
[+] [-] detaro|8 years ago|reply
[+] [-] nickpsecurity|8 years ago|reply
Here's several examples:
VLISP for Scheme48 whose papers are here: https://en.wikipedia.org/wiki/PreScheme
C0 compiler + whole stack correctness in Verisoft http://www.verisoft.de/VerisoftRepository.html
CompCert Compiler for C http://compcert.inria.fr/
CakeML Subset of Standard ML https://cakeml.org/
Rockwell-Collins doing crypto DSL compiled to verified CPU http://www.ccs.neu.edu/home/pete/acl206/slides/hardin.pdf
Karger's original paper with the attack from 1970's: https://www.acsac.org/2002/papers/classic-multics.pdf
Myer's landmark work on subversion in high-assurance security from 1980: http://csrc.nist.gov/publications/history/myer80.pdf
My framework I developed studying Karger back when I was building secure things:
http://pastebin.com/y3PufJ0V
[+] [-] tedunangst|8 years ago|reply
[+] [-] zcdziura|8 years ago|reply
[+] [-] gtt|8 years ago|reply
[+] [-] anonacct37|8 years ago|reply
[+] [-] rnhmjoj|8 years ago|reply
[+] [-] mapreri|8 years ago|reply
[+] [-] JoshTriplett|8 years ago|reply
https://reproducible-builds.org/specs/source-date-epoch/
[+] [-] mabbo|8 years ago|reply
I think the solution is to give those devs who favor such techniques a separate but easy to use fuzzing tool set that they can run just like their unit tests, separate from their usual 'build' command. Give them their ability to discover new bugs, but make it separate from the real build.
[+] [-] closeparen|8 years ago|reply
[+] [-] regularfry|8 years ago|reply
[+] [-] vfaronov|8 years ago|reply
[+] [-] kobeya|8 years ago|reply
[+] [-] sitkack|8 years ago|reply
[+] [-] Sir_Cmpwn|8 years ago|reply
[+] [-] Cogito|8 years ago|reply