An update on rust/coreutils

[+] StefanKarpinski|4 years ago|reply

An interesting aspect of this effort is that it undoubtesly makes modifying these tools much easier. While GNU and the FSF are all about legally ensuring the ability to read and modify their code, they have, in practice, made it almost impossible to do so for entirely technical reasons. I’m a pretty good C programmer and whenever I’ve looked at even simple GNU tools like `cat` or god forbid GNU libc’s `printf` code (this one is actually kind of delightful it’s so nuts), I have despaired of understanding, let alone modifying, the code. GNU coreutils are such a morass of preprocessor defines and bonkers C code to support every legacy system GNU has ever run on, that it might as well only be distributed in binary form. I think it would literally be easier to decompile most GNU tools and modify the decompiled platform-specific source than to make the same modification to the source as shipped. Rust has a steep learning curve, but learning Rust is a picnic compared to figuring out how to modify just one GNU tool, and of course they’re all different. Rust puts the onus of portability where it belongs: in the language and compiler, instead of forcing it on every single programmer and application.

[+] Anthony-G|4 years ago|reply

Thanks for sharing this perspective. A few years ago, I started learning C with the aim of contributing to GNU coreutils and other utilities that come with a GNU/Linux system. I read most of the K&R book and thought I had a decent grasp of the C language but when I cloned the coreutils source code, I couldn’t figure it out and gave up, thinking I’d need a lot more experience before I could understand production-quality code. It’s reassuring to find out it wasn’t just me that struggled with the complexity of the code base.

Since then, I’ve discovered (thanks to Hacker News) Decoded: GNU coreutils¹, a “resource for novice programmers exploring the design of command-line utilities” but unfortunately, I no longer have the capacity or free time to spend on coding.

¹ https://www.maizure.org/projects/decoded-gnu-coreutils/index...

[+] yjftsjthsd-h|4 years ago|reply

This sounds right on the edge of the traditional situation in rewrites where everyone says how awful the old code is, so they rewrite it, but then the new code doesn't handle edge cases (doesn't run on AIX/Illumos/NetBSD, yes(1) turns out to be really slow without that weirdness[0], file semantics have really weird edge cases that you only find after losing data), so the new code is adjusted to actually be as comprehensive as the original, at which point the new code is as ugly as the original. Of course, it depends on why the old code was ugly; if, as suggested below, some of it was being intentionally obtuse to avoid claims of copying, there might be room for actual improvement, and we do have more perspective and better tooling and new algorithms - real improvement is 100% possible, I would just be very cautious in assuming that based on an incomplete comparison.

[0] https://news.ycombinator.com/item?id=14542938

[+] melissalobos|4 years ago|reply

I have seen it argued many times that the GNU codebases are deliberately written to be unintelligible to avoid claims of being derived from UNIX sources. If they were written in the most straightforward ways, then they would have to show that no-one working on it has seen the original sources. I believe there is even a section in the glibc manual discussing this idea.

[+] est31|4 years ago|reply

This is an extremely amazing project because it's actually trying to come up with a replacement that can be used as such. Nobody is going to rewrite scripts to use exa instead of ls for example. The coreutils are also somewhat the "basis" of the shell userspace.

Outside of the test failures and missing features, the only reason one might not want the coreutils is the size.

Right now, cat on my system is 44K large, while the default output size for the cat executable in release mode is 4.4 megabytes in release mode. If you enable a bunch of options in Cargo.toml to reduce the default bloaty settings (lto = true, codegen-units = 1, strip = true, debug = false), you get that down to 876K. Which is still 20 times larger than the native cat. For true, it's a similar story. 40K vs 812K.

[+] lr1970|4 years ago|reply

Rust executables are statically linked by default. This makes them oftentimes substantially larger than their dynamically linked counterparts. For me it is not a deal breaker.

[+] ris|4 years ago|reply

My guess is people will start "solving" this kind of problem through multi-call binaries a la busybox.

[+] andai|4 years ago|reply

Is that a typical ratio for software ported to Rust? Where is all the extra size coming from?

[+] cogman10|4 years ago|reply

I believe what you are seeing there is the power of shared libraries.

[+] bsdetector|4 years ago|reply

It'd also be a great time to establish a better standard for using these programs with a shell or scripting.

For example, if ls had a --shell option that would write out file information as quoted eval'able variables, or even JSON, or anything that was easily and reliably parsable it would remove a huge portion of scripting headaches and errors.

[+] sylvestre|4 years ago|reply

I added some CI task and replayed the history to graph the evolution of the size: https://github.com/uutils/coreutils-tracking#binary-size-evo...

This will help making sure that we aren't regressing (more) ;)

[+] unknown|4 years ago|reply

[deleted]

[+] Aldo_MX|4 years ago|reply

  Instead of 30 to 60 patches per month, we jumped to 400 to 472
  patches every month. Similarly, we saw an increase in the
  number of contributors (20 to 50 per month from 3 to 8).

More contributors = more hands to fix issues and tune performance. I believe this is an absolute win.

[+] kkoncevicius|4 years ago|reply

According to the graph more than half of the tests are still failing. Can the speed be (to some extent) attributed to some missing functionality ?

[+] sylvestre|4 years ago|reply

Good question!

Probably not. In general, functionalities are enabling or disabling a behavior when doing an operation. In the code, it translates most of the time by a simple if else. For example, adding new options usually looks like this PR: https://github.com/uutils/coreutils/pull/2880/files

The performance wins are usually produced by using some fancy Rust features.

[+] tialaramex|4 years ago|reply

Slightly less than half, but yes. That could certainly be relevant. On the other hand inevitably some tests will be fragile or even outright wrong and so while your solution passes a different (possibly better/ faster) solution fails because the test is bad.

For example suppose you're implementing case-insensitive sort, you write a test and tweak it slightly so that it passes as you expected. I come along and write a slightly faster case-insensitive sort, and mine fails. Upon examining the test I discover it thinks I ought to sort (rat, doG, cat, DOG, dog, DOg) into (cat, doG, dog, DOG, DOg, rat) but I get (cat, doG, DOG, dog, DOg, rat) my answer seems, if anything better and certainly not wrong but it fails your test.

[+] cedilla|4 years ago|reply

It is possible, but with benchmarks like "head -n 1000000 wikidata.xml" I doubt it. A comment in that PR says "the difference to GNU head is mostly in user time, not in system time. I suspect this is due to GNU head not using SIMD to detect newlines".

Unfortunately I couldn't find a list of failed/successful tests, if that's available I'd be happy if someone linked it

[+] jacquesm|4 years ago|reply

An old mantra that has served me well: first make it work, then make it fast.

So first get it to 100% compatibility, then, and only then concentrate on performance. Because if you don't do it that way you will end up foregoing compatibility because you'd have to say goodbye to your beautifully tuned code that unfortunately can't be 100% compatible. As long as it does not pass all the tests: it does not work. Even if it works for some selected cases, the devil is in the details, and those details can eat up performance like there is no tomorrow.

[+] notdonspaulding|4 years ago|reply

I follow this advice with one small modification. Namely, mine goes:

    - Make it once.  
    - Make it right.  
    - Make it fast.

"Make it right" as the first step can trick an unseasoned developer into never finishing a prototype. I don't mind the first iteration of something being sloppy and then pursuing correctness with an existing solution to the problem in hand.

I'm curious if you've got a sense of the interplay between prototyping and correctness similar to your sense of the interplay between performance tuning and correctness? Any thoughts?

[+] jeppester|4 years ago|reply

I've seen this advice many times before.

While I do generally agree, it can easily become an excuse for not thinking things through from the start.

If you have the wrong architecture it might be very difficult or even impossible to optimize performance later on.

Also: If working on a project for a client, performance can be difficult to sell when the feature already works. But that feature might break when it's put under load.

My advice would be to always have performance in mind. But otherwise to stay away from "optimizations" until they are needed.

[+] sva_|4 years ago|reply

The coreutils consist of many individual binaries ('utils') which are more or less stand-alone[0], so I think this mantra doesn't apply here.

[0] https://github.com/uutils/coreutils/tree/main/src/uu

[+] pjmlp|4 years ago|reply

Now that is something I fully agree with.

[+] draw_down|4 years ago|reply

[deleted]

[+] MailNerd|4 years ago|reply

Great work! Looking forward to a Linux system where the majority of user land is written in a safe and performant language!

[+] pjmlp|4 years ago|reply

It is not the first effort to do so, hopefully this one has more wind to come through.

[+] gravypod|4 years ago|reply

If you needed to do low level systems programming in Rust would you be able to use this as an OS abstraction layer by importing these sorts of things: https://github.com/uutils/coreutils/blob/main/src/uu/chmod/s...

Is this one of the intentions of this team? It sounds like it could make "scripting" in rust very nice if all of the CLI functions you're used to exist as language libraries.

[+] terts|4 years ago|reply

That is a very interesting idea! It is not something we've discussed before as far as I know. For some utils (like chmod) it might be possible, but any util that is focused on outputting information currently does so by printing directly to stdout. So, ls doesn't give you a Vec or Iterator of listed files for instance, but instead prints them to stdout. Nevertheless, it would be a cool experiment to see if we can create something like a "ulib" crate that would provide the functions some functions for which it is possible.

[+] loeg|4 years ago|reply

That would be much more elegant than something like libxo.

[+] tpoacher|4 years ago|reply

I do wonder, to what extent is this "increase in speed" the result of:

- Refactoring out buggy, convoluted, highly-backwards-compliant code for cleaner more practical code? - Reimplementing code in a manner which simplifies previously complicated bottlenecks that were there in response to bug reports? (and whose simplification potentially risks reintroducing said bugs again)

In all honesty, I would expect that "reimplementing coreutils" as above would have resulted in a speedup even if it was written in c again.

Am I wrong? Is there something about rust that inherently leads to an increase in speed which one could not ever hope to obtain with clean, performant c code?

[+] StefanKarpinski|4 years ago|reply

Sometimes I would imagine (I’m no expert in Rust but do know a bit about compilers), Rust’s ability to guarantee unshared access to memory can probably enable optimizations that are hard to coax out of a C compiler.

Many libc functions are also much slower than they could be because of POSIX requirements and being a shared library. For example, libc’s fwrite, fread, etc. functions are threadsafe by default and acquire a globally shared lock even when you aren’t using threads (you can opt out, but it’s quite annoying and non-standard) which makes them horribly slow if you’re doing lots of small reads and writes. Because libc is a shared library, calls to its functions won’t get inlined, which can be a major performance issue as well. By comparison Rust’s read and write primitives don’t need to acquire a lock and can be inlined, so a small read or write (for example) could be just a couple of instructions instead of what a C program will do, which is a function call (maybe even an indirect one through the PLT) and then a lock acquisition, only to write a few bytes into a buffer. That’s a lot of overhead!

And finally, Rust’s promise of safe multithreading no doubt encourages programmers to write code that utilizes threads in situations where only the truly courageous would attempt it in C.

[+] agumonkey|4 years ago|reply

Anybody has stats on which coreutils programs are the most used ? I wonder the impact of gnu userland on system perf.

[+] sylvestre|4 years ago|reply

I don't have data but I guess ls, cp, mv, ln, chown, chmod, sort and cut are the most popular. Overall, I don't think it is a key part of the OS regular work. Most of the servers and system are busy with services, db, browsers, etc...

[+] hobofan|4 years ago|reply

Am I missing something, or how does a PR that bumps the version of an existing brew formula translate to "Brew is proposing coreutils"?

[+] The_rationalist|4 years ago|reply

How about making all ELF binaries symbol processing/retrieving faster? The gnu hashmap is very much obscolete and should be replaced by Swisstable. Key observations like this will keep being ignored for the decades to come.

[+] hiccuphippo|4 years ago|reply

If one were to replace coreutils in a GNU+Linux system with this, which is MIT licensed, would it be fair to call it MIT+Linux?

[+] wiz21c|4 years ago|reply

Yeah, I see the pun. But leaving the GPL license is a very bad idea in my opinion. If we have to rest son shoulder of giant, I'd prefer them being made of concrete, not dust. Coreutils are really a common and should be protected against appropriation.

Depending on how much a rewrite it is, GNU's intellectual property may be enforced at some point. But that's a tricky question for a lawyer. If it were me, I'd have asked GNU first.

[+] codeflo|4 years ago|reply

I also think “GNU+Linux” is a terrible name and a worse marketing move, but you might misunderstand what it’s about. There’s no “GNU license”, and nobody ever proposed calling the system “GPL+Linux”. GNU sees itself as a project to replace the entirety Unix (hence the name, GNU’s Not Unix). A kernel is one part of the full OS, so if you combine the Linux kernel with the rest of GNU’s OS, according to this logic you get GNU+Linux. (Or maybe it should be GNU-Hurd+Linux, or GNU=~s/Hurd/Linux/... it's terrible.)

[+] j16sdiz|4 years ago|reply

Iirc, most of the gnu coreutil came from its locale handling.

It supports lots of different legacy encoding.

[+] RcouF1uZ4gsC|4 years ago|reply

I can see this project really taking off.

It has two advantages on the GNU version apart from just memory safety with Rust.

1) Ease of portably building. Just type ‘cargo build —release’ and it will just work, even on Windows.

2) MIT License. A company can take these and distribute them as part of a commercial offering without having to worry about GPL compliance.

[+] trelane|4 years ago|reply

I'm not sure how point 2 is and advantage for anyone but the company tbh.

It's also going to be _really_ hard to be more portable than GNU coreutils, when it comes to platforms it's available on.

[+] cosmiccatnap|4 years ago|reply

[deleted]

98 comments