The Debian group is admirable, and have positively changed the standards for OS design several times. Reminds me I should donate to their coffee fund around tax time =3
I don't get how someone achieves reproducibility of builds: what about files metadata like creation/modification timestamps? Do they forge them? Or are these data treated as not important enough (like it 2 files with different metadata but identical contents should have the same checksum when hashed)?
There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/
Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.
Maybe dumb question but why would this change the reproducibility? If you clone a git repo, do you not get the meta data as it is stored in git? Or would the files have the modification date of the cloning?
Those aren't needed to generate a hash of a file. And that metadata isn't part of the file itself (or at least doesn't need to be), it's part of the filesystem or OS
> ... what about files metadata like creation/modification timestamps? Do they forge them?
The least difficult to solve for reproducible build but yes.
The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?
Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".
For that's the anti-security mindset we have to fight. And Debian did.
It's my understanding that is about generating the .iso file from the .deb files, not about generating the .deb files from source. Generating .deb from source in a reproducible way is still a work in progress.
Is the build infrastructure for Debian also reproducible? It seems like we if someone wants to inject malware in Debian package binaries (without injecting them into the source), they have to target the build infrastructure (compilers, linkers and whatever wrapper code is written around them).
Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?
I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.
There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/
Reproducible: If Alice and Bob both download and compile the same source code, Alice's binary is byte-for-byte identical to Bob's binary.
Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...
Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.
Open source means "you can see the code for what you run". Except... how do you know that your executables were actually built from that code? You either trust your distro, or you build it yourself, which can be a hassle.
Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.
Reproducible builds actually solve a lot of problems. (Whether these are real problems, who really knows, but people spend a lot of money to solve them.)
At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.
In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!
(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)
It's a link in a chain that allows you to trust programs you run.
- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.
- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.
- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.
- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.
- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.
- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.
It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.
Should be trivial to put in, if not. Install the package and maybe prepare some datasource hints while reproducing the image. Depends on where you'll be using it.
The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.
I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.
Want another? Great - they're freely reproducible :)
I’m a noob to this subject. How can a build be non-reproducible? By that, I mean, what part of the build process could return non-deterministic output? Are people putting timestamps into the build and stuff like that?
Does anyone have any information as to how they modified their C code such that the complier output was deterministic? I thought one of the hardest problems with a effort like this was writing your C such that the compiler would output everything in the same order (same bytes)? And I am not just talking about time stamps etc.
Pretty wild that we’re finally nailing reproducibility in Linux images after so many years—clearly a win for stability and consistency across the board.
A live image is an operating system image which you can boot from and use vs. an install disk which can only install, but there's no usable environment available).
A reproducable build means you can get the same source code and compile it, and it will be identical to the published image. This is important because otherwise you don't know if the published image actually used some other source code. If it used some other source code, the published image might have a backdoor, or something that you can't find by reading the source code.
This question should be at the top. I know HN tries to stay agnostic in their report of news but they definitely fall on the wrong side of feature vs benefit (as do most open sources authors ) and plenty of folks will just pass up this article completely ignorant of the benefit.
I never really understood the hype around reproducible builds. It seems to mostly be a vehicle to enable tivoization[0] while keeping users sufficiently calm. With reproducible buiilds, a vendor can prove to users that they did build $binary from $someopensourceproject, and then digitally sign the result so that it - and only it - would load and execute on the vendor-provided and/or vendor-controlled platform. But that still kills effective software freedom as long as I, the user, cannot do the same thing with my own build (whether it is unmodified or not) of $someopensourceproject.
Lets turn this around. Why would you ever want non-reproducible builds?
Every bit of nondeterminism in your binaries, even if it's just memory layout alone, might alter the behavior, i.e. break things on some builds, which is just really not desirable.
Why would you ever want builds from the same source to have potentially different performance, different output size or otherwise different behavior?
IMO tivoization is completely unrelated, because the vendor most certainly does not need reproducible builds in order to lock down a platform.
For me as a developer, reproducible builds are a boon during debugging because I can be sure that I have reproduced the build environment corresponding to an artifact (which is not trivial, particularly for more complex things like whole OS image builds which are common in the embedded world, for example) in the real world precisely when I need to troubleshoot something.
Then I can be sure that I only make the changes I intend to do when building upon this state (instead of, for example, "fixing" something by accident because the link order of something changed which changed the memory layout which hides a bug).
Tavis makes some good arguments, but since that post I've seen a couple real-world situations where reproducible builds are valuable.
One is where the upstream software developer wants to build and sign their software so that users know it came from them, but distributors also want to be the ones to build and sign the software so they know what exactly it is they are distributing. The most public example is FDroid[1]. Reproducible builds allow both the software developer and the distributor to sign-off on a single binary, giving users addition assurance that neither are sneaking something in. This is similar to the last example that Tavis gave, but shows that it is a workable process that provides real security benefit to the user, not just a hypothetical stretch.
The second is license enforcement. Companies that distribute (A/L)GPL software are required to distribute the exact source code that the binary was created from, and ability to compile and replace the software with a modified version (for GPLv3). However, a lot of companies are lazy about this and publish source code that doesn't include all their changes. A reproducible build demonstrates that the source they provided is what was used to create the binary. Of course, the lazy ones aren't going to go out of their way to create reproducible builds, but the more reproducible the upstream code build system is the fewer extraneous differences downstream builds should have. And it allows greater confidence in the good guys who are following the license.
And like others have said, I don't see the Tivoization argument at all. TiVo didn't have reproducible builds, and they Tivo'd their software just fine. At worst a reproducible build might pacify some security minded folks that would otherwise object to Tivoization, but there will still be people who object to it out of the desire to modify the system.
[+] [-] jcmfernandes|1 year ago|reply
[+] [-] Joel_Mckay|1 year ago|reply
[+] [-] imcritic|1 year ago|reply
[+] [-] jzb|1 year ago|reply
There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/
[+] [-] o11c|1 year ago|reply
The hard things involve things like unstable hash orderings, non-sorted filesystem listing, parallel execution, address-space randomization, ...
[+] [-] purkka|1 year ago|reply
Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.
[+] [-] paulddraper|1 year ago|reply
Yes. All archive entries and date source code macros and any other timestamps are set to a standardized date (in the past).
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] echoangle|1 year ago|reply
I never actually checked that.
[+] [-] HideousKojima|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] c0l0|1 year ago|reply
[+] [-] TacticalCoder|1 year ago|reply
The least difficult to solve for reproducible build but yes.
The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?
Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".
For that's the anti-security mindset we have to fight. And Debian did.
[+] [-] kroeckx|1 year ago|reply
[+] [-] abdullahkhalids|1 year ago|reply
Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?
[+] [-] jzb|1 year ago|reply
I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.
There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/
[+] [-] goodpoint|1 year ago|reply
[+] [-] layer8|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] paulddraper|1 year ago|reply
You must ultimately root trust in some set of binaries and any hardware that you use.
[+] [-] geocrasher|1 year ago|reply
[+] [-] csense|1 year ago|reply
Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...
Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.
[+] [-] orblivion|1 year ago|reply
Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.
[+] [-] jrockway|1 year ago|reply
At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.
In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!
(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)
[+] [-] genpfault|1 year ago|reply
https://wiki.debian.org/ReproducibleBuilds/About
[+] [-] b112|1 year ago|reply
It validates that publicly available downloads aren't different from what is claimed.
[+] [-] rstuart4133|1 year ago|reply
- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.
- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.
- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.
- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.
- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.
- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.
It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.
[+] [-] zozbot234|1 year ago|reply
[+] [-] polynox|1 year ago|reply
[+] [-] moondev|1 year ago|reply
[+] [-] bravetraveler|1 year ago|reply
The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.
I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.
Want another? Great - they're freely reproducible :)
[+] [-] kragen|1 year ago|reply
[+] [-] Cort3z|1 year ago|reply
[+] [-] yupyupyups|1 year ago|reply
[+] [-] nwellinghoff|1 year ago|reply
[+] [-] letters90|1 year ago|reply
[+] [-] eqvinox|1 year ago|reply
I'll happily agree higher degrees of "freedom" are an admirable goal, but this is just rudely shitting on a hard-earned achievement.
[+] [-] amelius|1 year ago|reply
[+] [-] hackburg|1 year ago|reply
[deleted]
[+] [-] curtisszmania|1 year ago|reply
[+] [-] selfhoster|1 year ago|reply
[deleted]
[+] [-] perdomon|1 year ago|reply
[+] [-] stuporglue|1 year ago|reply
A reproducable build means you can get the same source code and compile it, and it will be identical to the published image. This is important because otherwise you don't know if the published image actually used some other source code. If it used some other source code, the published image might have a backdoor, or something that you can't find by reading the source code.
[+] [-] mjg59|1 year ago|reply
[+] [-] Vaslo|1 year ago|reply
[+] [-] c0l0|1 year ago|reply
Therefore, I side with Tavis Ormandy on this debate: https://web.archive.org/web/20210616083816/https://blog.cmpx...
[0]: https://en.wikipedia.org/wiki/Tivoization
[+] [-] myrmidon|1 year ago|reply
Every bit of nondeterminism in your binaries, even if it's just memory layout alone, might alter the behavior, i.e. break things on some builds, which is just really not desirable.
Why would you ever want builds from the same source to have potentially different performance, different output size or otherwise different behavior?
IMO tivoization is completely unrelated, because the vendor most certainly does not need reproducible builds in order to lock down a platform.
[+] [-] ahlCVA|1 year ago|reply
Then I can be sure that I only make the changes I intend to do when building upon this state (instead of, for example, "fixing" something by accident because the link order of something changed which changed the memory layout which hides a bug).
[+] [-] pavon|1 year ago|reply
One is where the upstream software developer wants to build and sign their software so that users know it came from them, but distributors also want to be the ones to build and sign the software so they know what exactly it is they are distributing. The most public example is FDroid[1]. Reproducible builds allow both the software developer and the distributor to sign-off on a single binary, giving users addition assurance that neither are sneaking something in. This is similar to the last example that Tavis gave, but shows that it is a workable process that provides real security benefit to the user, not just a hypothetical stretch.
The second is license enforcement. Companies that distribute (A/L)GPL software are required to distribute the exact source code that the binary was created from, and ability to compile and replace the software with a modified version (for GPLv3). However, a lot of companies are lazy about this and publish source code that doesn't include all their changes. A reproducible build demonstrates that the source they provided is what was used to create the binary. Of course, the lazy ones aren't going to go out of their way to create reproducible builds, but the more reproducible the upstream code build system is the fewer extraneous differences downstream builds should have. And it allows greater confidence in the good guys who are following the license.
And like others have said, I don't see the Tivoization argument at all. TiVo didn't have reproducible builds, and they Tivo'd their software just fine. At worst a reproducible build might pacify some security minded folks that would otherwise object to Tivoization, but there will still be people who object to it out of the desire to modify the system.
[1] https://f-droid.org/docs/Reproducible_Builds/