Tracking developer build times to decide if the M3 MacBook is worth upgrading

[+] Aurornis|2 years ago|reply

This is a great write-up and I love all the different ways they collected and analyzed data.

That said, it would have been much easier and more accurate to simply put each laptop side by side and run some timed compilations on the exact same scenarios: A full build, incremental build of a recent change set, incremental build impacting a module that must be rebuilt, and a couple more scenarios.

Or write a script that steps through the last 100 git commits, applies them incrementally, and does a timed incremental build to get a representation of incremental build times for actual code. It could be done in a day.

Collecting company-wide stats leaves the door open to significant biases. The first that comes to mind is that newer employees will have M3 laptops while the oldest employees will be on M1 laptops. While not a strict ordering, newer employees (with their new M3 laptops) are more likely to be working on smaller changes while the more tenured employees might be deeper in the code or working in more complicated areas, doing things that require longer build times.

This is just one example of how the sampling isn’t truly as random and representative as it may seem.

So cool analysis and fun to see the way they’ve used various tools to analyze the data, but due to inherent biases in the sample set (older employees have older laptops, notably) I think anyone looking to answer these questions should start with the simpler method of benchmarking recent commits on each laptop before they spend a lot of time architecting company-wide data collection

[+] lawrjone|2 years ago|reply

I totally agree with your suggestion, and we (I am the author of this post) did spot-check the performance for a few common tasks first.

We ended up collecting all this data partly to compare machine-to-machine, but also because we want historical data on developer build times and a continual measure of how the builds are performing so we can catch regressions. We quite frequently tweak the architecture of our codebase to make builds more performant when we see the build times go up.

Glad you enjoyed the post, though!

[+] nox101|2 years ago|reply

I didn't see any analysis of network building as an alternative to M3s. For my project, ~40 million lines, past a certain minimum threshold, it doesn't matter how fast my machine is, it can't compete with the network build our infra-team makes.

So sure, an M3 might make my build 30% faster than my M1 build, but the network build is 15x faster. Is it possible instead of giving the developers M3s they should have invested in some kind of network build?

[+] dimask|2 years ago|reply

> This is a great write-up and I love all the different ways they collected and analyzed data.

> [..] due to inherent biases in the sample set [..]

But that is an analysis methods issue. This serves as a reminder that one cannot depend on AI-assistants when they are not themselves enough knowledgeable on a topic. At least for the time being.

For once, as you point, they conducted a t-test on data that are not independently sampled, as multiple data points were sampled by different people, and there are very valid reasons to believe that different people would have different tasks that may be more or less compute-demanding, which confound the data. This violates one of the very fundamental assumptions of the t-test, which was not pointed out by the code interpreter. In contrast, they could have modeled their data with what is called "linear mixed effects model" where stuff like person (who the laptop belongs to) as well as possibly other stuff like seniority etc could be put into the model as "random effects".

Nevertheless it is all quite interesting data. What I found most interesting is the RAM-related part: caching data can be very powerful, and higher RAM brings more benefits than people usually realise. Any laptop (or at least macbook) with more RAM than it usually needs has most of the time its extra RAM filled by cache.

[+] j4yav|2 years ago|reply

I agree, it seems like they were trying to come up with the most expensive way to answer the question possible for some reason. And why was the finding in the end to upgrade M1 users to more expensive M3s when M2s were deemed sufficient?

[+] matthew-wegner|2 years ago|reply

I would think you would want to capture what/how was built, as like:

* Repo started at this commit

* With this diff applied

* Build was run with this command

Capture that for a week. Now you have a cross section of real workloads, but you can repeat the builds on each hardware tier (and even new hardware down the road)

[+] dash2|2 years ago|reply

As a scientist, I'm interested how computer programmers work with data.

* They drew beautiful graphs!

* They used chatgpt to automate their analysis super-fast!

* ChatGPT punched out a reasonably sensible t test!

But:

* They had variation across memory and chip type, but they never thought of using a linear regression.

* They drew histograms, which are hard to compare. They could have supplemented them with simple means and error bars. (Or used cumulative distribution functions, where you can see if they overlap or one is shifted.)

[+] dgacmu|2 years ago|reply

I'm glad you noted programmers; as a computer science researcher, my reaction was the same as yours. I don't think I ever used a CDF for data analysis until grad school (even with having had stats as a dual bio/cs undergrad).

[+] novok|2 years ago|reply

It's because that's usually the data scientist's job, and most eng infra teams don't have a data scientist and don't really need one most of the time.

Most of the time they deal with data the way their tools generally present data, which correlate closely to most analytics, perf analysis and observability software suites.

Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.

[+] azalemeth|2 years ago|reply

> ChatGPT punched out a reasonably sensible t test!

I think the distribution is decidedly non normal here and the difference in the medians may well have also been of substantial interest -- I'd go for a Wilcox test here to first order... Or even some type of quantile regression. Honestly the famous Jonckheere–Terpstra test for ordered medians would be _perfect_ for this bit of pseudoanalysis -- have the hypothesis that M3 > M2 > M1 and you're good to go, right?!

(Disclaimers apply!)

[+] Herring|2 years ago|reply

>They drew histograms, which are hard to compare.

Note that in some places they used boxplots, which offer clearer comparisons. It would have been more effective to present all the data using boxplots.

[+] tmoertel|2 years ago|reply

> They drew histograms, which are hard to compare.

Like you, I'd suggest empirical CDF plots for comparisons like these. Each distribution results in a curve, and the curves can be plotted together on the same graph for easy comparison. As an example, see the final plot on this page:

https://ggplot2.tidyverse.org/reference/stat_ecdf.html

[+] mnming|2 years ago|reply

I think it's partly because the audiences are often not familiar with those statistics details either.

Most people hates nuances when reading data report.

[+] fallous|2 years ago|reply

I think you might want to add the caveat "young computer programmers." Some of us grew up in a time where we had to learn basic statistics and visualization to understand profiling at the "bare metal" level and carried that on throughout our careers.

[+] jxcl|2 years ago|reply

Yeah, I was looking at the histograms too, having trouble comparing them and thinking they were a strange choice for showing differences.

[+] NavinF|2 years ago|reply

> cumulative distribution functions, where you can see if they overlap or one is shifted

Why would this be preferred over a PDF? I've rarely seen CDF plots after high school so I would have to convert the CDF into a PDF inside my head to check if the two distributions overlap or are shifted. CDFs are not a native representation for most people

[+] LASR|2 years ago|reply

Solid analysis.

A word of warning from personal experience:

I am part of a medium-sized software company (2k employees). A few years ago, we wanted to improve dev productivity. Instead of going with new laptops, we decided to explore offloading the dev stack over to AWS boxes.

This turned out to be a multi-year project with a whole team of devs (~4) working on it full-time.

In hindsight, the tradeoff wasn't worth it. It's still way too difficult to remap a fully-local dev experience with one that's running in the cloud.

So yeah, upgrade your laptops instead.

[+] tomaskafka|2 years ago|reply

My personal research for iOS development, taking the cost into consideration, concluded:

- M2 Pro is nice, but the improvement over 10 core (8 perf cores) M1 Pro is not that large (136 vs 120 s in Xcode benchmark: https://github.com/devMEremenko/XcodeBenchmark)

- M3 Pro is nerfed (only 6 perf cores) to better distinguish and sell M3 Max, basically on par with M2 Pro

So, in the end, I got a slightly used 10 core M1 Pro and am very happy, having spent less than half of what the base M3 Pro would cost, and got 85% of its power (and also, considering that you generally need to have at least 33 to 50 % faster CPU to even notice the difference :)).

[+] euos|2 years ago|reply

I am ex-core contributor Chromium and Node.js and current core contributor to gRPC Core/C++.

I am never bothered with build times. There is "interactive build" (incremental builds I use to rerun related unit tests as I work on code) and non-interactive build (one I launch and go get coffee/read email). I have never seen hardware refresh toggle non-interactive into interactive.

My personal hardware (that I use now and then to do some quick fix/code review) is 5+ year old Intel i7 with 16Gb of memory (had to add 16Gb when realized linking Node.js in WSL requires more memory).

My work laptop is Intel MacBook Pro with a touch bar. I do not think it has any impact on my productivity. What matters is the screen size and quality (e.g. resolution, contrast and sharpness) and storage speed. Build system (e.g. speed of incremental builds and support for distributed builds) has more impact than any CPU advances. I use Bazel for my personal projects.

[+] gizmo|2 years ago|reply

Somehow programmers have come to accept that a minuscule change in a single function that only result in a few bytes changing in a binary takes forever to compile and link. Compilation and linking should be basically instantaneous. So fast that you don't even realize there is a compilation step at all.

Sure, release builds with whole program optimization and other fancy compiler techniques can take longer. That's fine. But the regular compile/debug/test loop can still be instant. For legacy reasons compilation in systems languages is unbelievably slow but it doesn't have to be this way.

[+] karolist|2 years ago|reply

I too come from Blaze and tried to use Bazel for my personal project which involves backend + frontend dockerized, the build rules got weird and niche real quick and I was spending lots of time working with the BUILD files making me question the value against plain old Makefiles, this was 3 years ago, maybe the public ecosystem is better now.

[+] syntaxing|2 years ago|reply

Aren’t M series screen and storage speed significantly superior to your Intel MBP? I transitioned from an Intel MBP to M1 for work and the screen was significantly superior (not sure about storage speed, our builds are all on a remote dev machine that is stacked).

[+] steeve|2 years ago|reply

This is because you’ve been spoiled by Bazel. As was I.

[+] Kwpolska|2 years ago|reply

Chromium is a massive project. In more normally-sized projects, you can build everything on your laptop in reasonable time.

[+] dilyevsky|2 years ago|reply

Bazel is significantly faster on m1 compared to i7 even if it doesn’t try to recompile protobuf compiler code which it’s still attempting to do regularly

[+] white_dragon88|2 years ago|reply

5+ year old i7 are potato and would be a massive time waster today. Build times matter.

[+] unknown|2 years ago|reply

[deleted]

[+] drgo|2 years ago|reply

To people who are thinking about using AI for data analyses like the one described in the article:

- I think it is much easier to just load the data into R, Stata etc and interrogate the data that way. The commands to do that will be shorter and more precise and most importantly more reproducible.

- the most difficult task in data analysis is understanding the data and the mechanisms that have generated it. For that you will need a causal model of the problem domain. Not sure that AI is capable of building useful causal models unless they were somehow first trained using other data from the domain.

- it is impossible to reasonably interpret the data without reference to that model. I wonder if current AI models are capable of doing that, e.g., can they detect confounding or oversized influence of outliers or interesting effect modifiers.

Perhaps someone who knows more than I do on the state of current technology can provide a better assessment of where we are in this effort

[+] mcny|2 years ago|reply

> All developers work with a fully fledged incident.io environment locally on their laptops: it allows for a <30s feedback loop between changing code and running it, which is a key factor in how productively you can work with our codebase.

This to me is the biggest accomplishment. I've never worked at a company (besides brief time helping out with some startups) where I have been able to run a dev/local instance of the whole company on a single machine.

There's always this thing, or that, or the other that is not accessible. There's always a gotcha.

[+] lawrjone|2 years ago|reply

Author here, thanks for posting!

Lots of stuff in this from profiling Go compilations, building a hot-reloader, using AI to analyse the build dataset, etc.

We concluded that it was worth upgrading the M1s to an M3 Pro (the max didn’t make much of a difference in our tests) but the M2s are pretty close to the M3s, so not (for us) worth upgrading.

Happy to answer any questions if people have them.

[+] aranke|2 years ago|reply

Hi,

Thanks for the detailed analysis. I’m wondering if you factored in the cost of engineering time invested in this analysis, and how that affects the payback time (if at all).

Thanks!

[+] unknown|2 years ago|reply

[deleted]

[+] hedgehog|2 years ago|reply

I'm curious how you came to the conclusion the Max SKUs aren't much faster, the distributions in the charts make them look faster but the text below just says they look the same.

[+] firecall|2 years ago|reply

So can we assume the M3 Max offered little benefit because the workloads couldn’t use the cores?

Or the tasks maybe finished so fast that it didn’t make a difference in real world usage?

[+] afro88|2 years ago|reply

Great analysis! Thanks for writing it up and sharing.

Logistical question: did management move some deliverables out of the way to give you room to do this? Or was it extra curricular?

[+] BlueToth|2 years ago|reply

Hi, Thanks for the interesting comparison. What I would like to see added would be a build on a 8GB memory machine (if you have one available).

[+] ribit|2 years ago|reply

Interesting idea, but the quality of data analysis is rather poor IMO and I'm not sure that they are actually learning what they think they are learning. Most importantly, I don't understand why they would see such a dramatic increase of sub 20s build times going from M1 Pro to M2 Pro. The real-world performance delta between the two on code compilation workloads is around 20-25%. It also makes little sense to me that M3 machines have fewer sub-20s builds than M2 machines. Or that M3 Pro with half the cores has more sub-20s builds than M3 Max.

I suspect there might be considerable difference in developer behavior which results in these differences. Such as people with different types of laptops typically working on different things.

And a few random observations after a very cursory reading (I might be missing something):

- Go compiler seems to take little advantage from additional cores

- They are pooling data in ways that makes me fundamentally uncomfortable

- They are not consistent in their comparisons, sometimes they use histograms, sometimes they use binned density plots (with different y axis ranges), it's real unclear what is going on here...

- Macs do not throttle CPU performance on battery. If the builds are really slower on battery (which I am not convinced about btw looking at graphs), it will be because of "low power" setting activated

[+] epolanski|2 years ago|reply

> also makes little sense to me that M3 machines have fewer sub-20s builds than M2 machines.

M3s have a smaller memory bandwidth, they are effectively a downgrade for some use cases.

[+] Tarball10|2 years ago|reply

This is only tangentially related, but I'm curious how other companies typically balance their endpoint management and security software with developer productivity.

The company I work for is now running 5+ background services on their developer laptops, both Mac and Windows. Endpoint management, priviledge escalation interception, TLS interception and inspection, anti-malware, and VPN clients.

This combination heavily impacts performance. You can see these services chewing up CPU and I/O performance while doing anything on the machines, and developers have complained about random lockups and hitches.

I understand security is necessary, especially with the increase in things like ransomware and IP theft, but have other companies found better ways to provide this security without impacting developer productivity as much?

[+] rendaw|2 years ago|reply

> People with the M1 laptops are frequently waiting almost 2m for their builds to complete.

I don't see this at all... the peak for all 3 is at right under 20s. The long tail (i.e. infrequently) goes up to 2m, but for all 3. M2 looks slightly better than M1, but it's not clear to me there's an improvement from M2 to M3 at all from this data.

[+] vessenes|2 years ago|reply

The upshot: M3 Pro is slightly better than M2 and significantly better than M1 Pro is what I've experienced with running local LLMs on my Macs; currently M3 memory bandwidth options are lower than for M2, and that may be hampering the total performance.

Performance per watt and rendering performance are both better in the M3, but I ultimately decided to wait for an M3 Ultra with more memory bandwidth before upgrading my daily driver M1 Max.

[+] aschla|2 years ago|reply

Side note, I like the casual technical writing style used here, with the main points summarized along the way. Easily digestible and I can go back and get the details in the main text at any point if I want.

[+] JonChesterfield|2 years ago|reply

If dev machine speed is important, why would you develop on a laptop?

I really like my laptop. Spend a lot of time typing into it. It's limited to a 30W or similar power budget on thermal and battery constraints. Some of that is spent on a network chip which grants access to machines with much higher power and thermal budgets.

Current employer has really scary hardware behind a VPN to run code on. Previous one ran a machine room with lots of servers. Both expected engineer laptops to be mostly thin clients. That seems obviously the right answer to me.

Thus marginally faster dev laptops don't seem very exciting.

[+] db48x|2 years ago|reply

This is bad science. You compared the thing you had to the thing you wanted, and found a reason to pick the thing you wanted. Honesty should have compelled you to at least compare against a desktop–class machine, or even a workstation with a Threadripper CPU. Since you know that at least part of your workload is concurrent, and 14 CPUs are better than 10, why not check to see if 16, 32, or 64 is better still? And the linker is memory bound, so it is worth considering not just the quantity of memory but the actual memory bandwidth and latency as well.

[+] balls187|2 years ago|reply

> a chat interface to your ... data Generally, the process includes:

> Exporting your data to a CSV

> Create an ‘Assistant’ with a prompt explaining your purpose, and provide it the CSV file with your data.

Once MSFT build's the aforementioned process into Excel, it's going to be a major game changer.

[+] hk1337|2 years ago|reply

I wonder why they didn't include Linux since the project they're building is Go? Most CI tools, I believe, are going to be Linux. Sure, you can explicitly select macOS in Github CI but Linux seems like it would be the better generic option?

*EDIT* I guess if you needed a macOS specific build with Go you would us macOS but I would have thought you'd use Linux too. Can you build a Go project in Linux and have it run on macOS? I suppose architecture would be an issue building on Linux x86 would not run on macOS Apple Silicon but the reverse is true too a build on Apple Silicon would not work on Linux x86 maybe not even Linux Arm.

[+] neya|2 years ago|reply

Last month I upgraded from an M1 Pro to an M3 Pro - 16GB RAM, 512 GB. Here is an honest review of my experience. The chassis of both the machines are exactly the same. The bad thing about that is the newer MacBooks come with a matte finish for the keyboard and the finish wears off like in just 3 weeks even with proper care. As a matter of fact, in the newer machine, I used ISP 70% with water to wipe it off after each use thinking it was due to grease/oil from the fingers. However the finish still wore off, suggesting that it has more to do with the finish itself.

With that out of the way, I maintain a very large site with over 50,000 posts in a static site setup. It is built on a Phoenix/Elixir setup I wrote from scratch and has been serving my client well without hiccups for the last 5 years. From time to time, some of the writers may mess up something and they would want us to run a sweep occasionally - back track to the last X number of posts and re-publish them. Usually about 100-500 posts which covers a few weeks/months worth of posts depending on the run.

So, for a 100 post sweep, the M3 pro was slightly faster, by an order of magnitude of over 1.5x. But, for other everyday tasks, like say opening Affinity Designer/Photo and editing files, I didn't notice much improvements. And Apple's website is notoriously deceptive about the improvements. Their graphs always showcase comparisons with a top specc'ed M3 machine with a base M1 variant or M2 which makes no sense to me as a developer. Disk (SSD) speeds were slightly slower on the M3 than on my old M1 variant. But, I am unable to attribute it to the processor as it could be a bad driver or software version. For those interested, it is one of those internal Samsung SSDs connected via a SATA-USBC converter.

Long story short, if you are on M2 - Not worth upgrading. If you are on an M1 or M1 Pro, you can still hold on to it a bit longer, the only reason you would want to upgrade is if you got a good deal like me. I sold my 21' M1 Pro for almost $1800 US (it still had Apple care) and got a student discount (I'm in the middle of some certification programs) on the new M3 Pro, so I got it. If I didn't have these financial incentives, probably wouldn't have been worth upgrading. Hope this helps someone.

[+] actinium226|2 years ago|reply

Honestly, I found the analysis a little difficult to read. Some of the histograms weren't normalized which made the comparison a bit difficult.

One thing that really stood out to me was:

> People with the M1 laptops are frequently waiting almost 2m for their builds to complete.

And yet looking at the graph, 120s is off the scale on the right side, so that suggests almost literally no one is waiting 2m for a build, and most builds are happening within 40s with a long tail out to 1m50s.

[+] brailsafe|2 years ago|reply

Fun read, I like how overkill it is. When I was still employed, I was building our django/postgres thing locally in Docker, with 32gb of ram, and it was a wild improvement in terms of feedback loop latency over my shitty 13" intel mbp, and I think it's seriously underappreciated how important it is to keep that pretty low, or as low as is cost effective. Now that I'm not employed and don't hope to be for a while, I do think the greatest bottleneck in my overall productivity is my own discipline since I'm not compiling anything huge or using Docker. The few times I do really notice how slow it is, it's in big I/O or ram operations like indexing, or maybe the occasional xcode build, but it's still low in absolute terms and the lack of some stressful deadline doesn't have me worrying so much about it. That makes me happy in some ways, because I used to feel like I could just throw new hardware at my overall productivity and solve any issues, but I think that's true only for things that are extremely computationally expensive. Normally, I'd just spend the cash, even as a contractor, because it's investment in my tools and that's good, but the up-charge for ram and ssd is so ludicrously high that I have no idea when that upgrade will come, and the refurb older models of M1 and M2 just aren't that much lower. My battery life also sucks, but it's not $5k sucky yet. Also worth joting I'm just a measly frontend developer, but there have been scenarios in which I've been doing frontend inside either a big Docker container or a massive Tomcat java app, and for those I'd probably just go for it.

[+] karolist|2 years ago|reply

I don't know where in the world you are, but B&H in US still sells new 16" M1 Max machines with 64GB memory, 2TB SSD for 2499-2599 depending on the current deal. This is around the price of base M3 Pro with 18/512 configuration, I figure you'll still get 5+ years of use with such machine and never worry about storage or memory.

408 comments