The most copied StackOverflow snippet of all time is flawed (2019)

[+] TacticalCoder|2 years ago|reply

I find it interesting that all the answers using hardcoded values / if statements (or while) are all doing up to five comparisons.

It goes B, KiB, MiB, GiB, TiB, EiB and no more than that (in all the answers) so that can be solved with three if statements at most, no five.

I mean: if it's greater or equal to GiB, you know it won't be B, KiB or MiB. Dichotomy search for the win!

Not a single of the hardcoded solutions do it that way.

Now let's go up to ZiB and YiB: still only three if statements at most, vs up to seven for the hardcoded solutions.

I mention it because I'd personally definitely not go for the whole log/pow/floating-points if I had to write a solution myself (because I precisely know all too well the SNAFU potential).

I'd hardcode if statements... But while doing a dichotomy search. I must be an oddball.

P.S: no horse in this race, no hill to die on, and all the usual disclaimers

[+] IshKebab|2 years ago|reply

I would expect your binary search solution is possibly slower than just doing 6 checks because the latter is only going to take 1 branch. Branching is very slow. You want to keep code going in a straight line as much as possible.

[+] zeroonetwothree|2 years ago|reply

It depends on the input distribution. If it’s very common to have smaller values then the linear search could be superior.

[+] throwaway9870|2 years ago|reply

Your comment and mine are basically the same. This is what I call terrible engineering judgement. A random co-worker could review the simple solution without much effort. They could also see the corner cases clearly and verify the tests cover them. With this code, not so much. It seems like a lot of work to write slower, more complex, harder to test and harder to review code.

[+] roryokane|2 years ago|reply

(2019)

Past discussions:

https://news.ycombinator.com/item?id=21693431

https://news.ycombinator.com/item?id=21698619

https://news.ycombinator.com/item?id=27533684

[+] dang|2 years ago|reply

Thanks! Macroexpanded:

The most copied StackOverflow snippet of all time is flawed (2019) - https://news.ycombinator.com/item?id=27533684 - June 2021 (334 comments)

The most copied StackOverflow snippet of all time is flawed - https://news.ycombinator.com/item?id=21698619 - Dec 2019 (88 comments)

The most copied StackOverflow snippet of all time is flawed - https://news.ycombinator.com/item?id=21693431 - Dec 2019 (3 comments)

[+] throwaway9870|2 years ago|reply

I don't understand. There are 7 suffixes, can't you pick the right one with binary search? That would be 3 comparisons. Or just do it the dumb way and have 6 comparisons. How are two log() calls, one pow() call and ceil() better than just doing it the dumb way? The bug being described is a perfect example of trying to be too clever.

[+] emerongi|2 years ago|reply

The author apparently went back to using a loop after recognizing that it's not readable: https://programming.guide/java/formatting-byte-size-to-human...

Notably, it's still slightly better than the first code example in the original article, as it takes the rounding bug into account.

[+] zeroonetwothree|2 years ago|reply

The author says at the beginning that it’s not actually better than the loop.

Also 6 comparisons is only if you’d have the max value which seems unlikely in actual usage. Linear could be better if most of the time values are in B or KB ranges

[+] ComputerGuru|2 years ago|reply

Shameless plug: another option to format sizes in a human readable format quickly and correctly (other than copying from S/O), you can use one of our open source PrettySize libraries, available for rust [0] and .NET [1]. They also make performing type-safe logical operations on file sizes safe and easy!

The snippet from S/O may be four lines but these are much more extensive, come with tests, output formatting options, conversion between sizes, and more.

[0]: https://github.com/neosmart/prettysize-rs

[1]: https://github.com/neosmart/PrettySize.net

[+] drunkendog|2 years ago|reply

Replacing 4 line solutions with extensive libraries is what caused left-pad.

[+] oooyay|2 years ago|reply

Out of curiosity, is there a sizable number of developers that just copy and paste untrusted code from StackOverflow into their applications?

The conjecture that people just copy from StackOverflow is obviously popular but I always thought this was just conjecture and humor until I saw someone do it. Don't get me wrong, I use StackOverflow to give me a head start on solving a problem in an area I'm not as familiar with yet, but I've never just straight copied code from there. I don't do that because rarely does the snippet do exactly and only exactly what I need. It requires me to look at the APIs and form my own solution from the explained approach. StackOverflow has pointed me in the direction of some niche APIs that are useful to me, especially in Python.

[+] JimDabell|2 years ago|reply

I once worked with a developer who wouldn’t let anything come between him seeing an answer and copying it into his code. He wasn’t even reading the question to make sure it was the same problem he was having, let alone the answer. He would literally go Google => follow the first link to Stack Overflow he saw => copy and paste the first code block he saw. Sometimes it wasn’t even the right language. People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.

Now he was an extreme case, but yes, there are a lot of developers out there with the mindset of “I need code; Stack Overflow has code; problem solved!” that don’t put any thought at all into whether it’s an appropriate solution.

[+] sp332|2 years ago|reply

Yes, and it happens more for things that feel out of scope for the part of the program that I'm interested in. After all, we import library code from random strangers into our programs all the time for the parts we consider "plumbing" and beneath notice. If I wanted to dig in and understand something, I would be more likely to write my own. But if I want this part over here to "just work" so I can get on with the project, it's compiler-error-driven development.

[+] zelda-mazzy|2 years ago|reply

I hardly ever just copy and paste for the exact reason the author talks about. Instead, I try to make sense of the solution, and if I have to, I'll hand-copy it down line-by-line to make sure I properly understand and refactor from there. I also rename variables, since often times there are so many foos and bars and bazes that it's completely unreadable by a human.

Also if I come across the problem a second time, I'll have better luck remembering what I did (as opposed to blindly copying).

[+] ahoka|2 years ago|reply

Yes, people do that. After looking at a huge number of incorrect TLS related code and configuration at SO, I’m now pretty sure that most systems run without validating certificates properly.

[+] jihadjihad|2 years ago|reply

Oh boy, where to begin. You obviously haven't had the pleasure of working in a codebase written by Adderall-fueled 23-year-olds.

[+] shusaku|2 years ago|reply

I think the section “ A Study on Attribution” and associated paper might be as good of an answer as you’ll get to that

[+] foobarian|2 years ago|reply

Well. You (collective you) start by copying and pasting a code snippet first, and then modifying it as needed. Does that count? If no modifications are needed, then it stays.

[+] stringtoint|2 years ago|reply

Plenty of developers paste arbitrary bash commands posted on sites like GitHub without thinking because they look "legit", I suppose. I see it similarly as you do: StackOverflow (and Copilot) can be helpful to start but it's.

Had an exchange like this some time ago:

Me: Hey, I'm reviewing your PR. Looks pretty fine to me. Except for this function which looks like it was copy-pasted from SO: I literally found the same function in an answer on SO (it was written in pure JS while we were using TS in our project).

Dev: Yes, everyone copies from SO.

Me: Well, in that case I hope you always copy the right thing. Because this code might run but it is not good enough (e.g. the variable names are inexpressive, it creates DOM elements without removing them after they are not needed anymore).

[+] nikanj|2 years ago|reply

There really is, but people do give it a cursory read. See also: https://en.wikipedia.org/wiki/Underhanded_C_Contest

[+] hobs|2 years ago|reply

Yes. I was told from a reliable source that at one point they tried to log all the copy and paste events and it brought their systems to their knees.

[+] londons_explore|2 years ago|reply

I wouldn't do it in most professional settings due to licensing...

But for personal projects where I just want to get something running, then yes, I would copy paste and barely even read the code.

I don't really care about bugs like this either - I'm happy to make something that works 99% of the time, and only fix that last 1% if it turns out to be an issue.

[+] hattmall|2 years ago|reply

In the server side JavaScript world absolutely, it seems like it's standard practice, people are injecting entire dependencies without even remotely looking at the code. Bringing in an entire library for a single function that could be accomplished in a couple lines and usually is posted below the fold.

[+] naikrovek|2 years ago|reply

...you would not believe...

not long ago I worked on a team who actively chose libraries and frameworks based on the likelihood they felt their questions would be answered on StackOverflow.

[+] ehutch79|2 years ago|reply

Yes.

This is why PHP got such a bad reputation. A lot of new developers where copy and pasting quick example code from stack overflow, or code from other new developers who only kind of knew what they were doing.

[+] bornfreddy|2 years ago|reply

Less and less every day. Now they are using ChatGPT.

[+] yellowsir|2 years ago|reply

when i had to used python i felt like copy pasting anything was out of scope due to indentation errors.

[+] TrackerFF|2 years ago|reply

Millions.

[+] berkle4455|2 years ago|reply

Wait til you find out about chatGPT

[+] marginalia_nu|2 years ago|reply

I don't understand why you'd use floating point logarithms if you want log 2?

Unless I'm missing something, this gives you an accurate value of floor(log2(value)) for anything positive less than 2^63 bytes, and it's much faster too:

  Long.bitCount( (Long.highestOneBit(value) << 1) - 1) - 1

[+] jprete|2 years ago|reply

I took one look at the snippet, saw a floating-point log operation and divisions applied to integers, and mentally discarded the entire snippet as too clever by half and inherently bug-prone.

[+] zeroonetwothree|2 years ago|reply

That’s basically the point of the article

[+] dleeftink|2 years ago|reply

Knowledge cascades all the way down; it goes to show how difficult it is to 'holster' even the smallest piece of knowledge once its drawn.

I wonder with the rate Stack Exchange is losing active contributors, what it would take for 'fastest gun' answers to be corrected that are later found to be off mark, and what it would mean for our collective knowledge once these 'slightly off' answers are further cemented in our annals of search and increasingly, LLM history.

[+] dirtyv|2 years ago|reply

This reminds me of when I was in basic training. The drill sgts would give us new recruits a task that none of us knew how to do, purposefully without guidance, and then leave. One guy would try and start doing it, always the incorrect way, and everyone else would just copy that person.

[+] koromak|2 years ago|reply

In a way, I don't even consider floating point errors to be "flaws" with an algorithm like this. If the code defines a logical, mathematically correct solution, then its "right". Solving floating point errors is a step above this, and only done in certain circumstances where it actually matters.

You can imagine some perfect future programming language where floating point errors don't exist, and don't have to be accounted for. Thats the language I'm targeting with 99% of my algorithms.

[+] bloak|2 years ago|reply

This reminds me of a weirdness with some sat navs: the distance to your exit/destination is displayed as: 12 ... 11 ... 10 ... 10.0 ... 9.9 ... 9.8 ... with the value 10.0 shown only while the distance is between 9.95 and 10. It's not really a bug but it's strange seeing the display update from 10 to 10.0 as you pass the imaginary ten-mile milestone so perhaps it's a distraction worth avoiding.

[+] bombcar|2 years ago|reply

Mercedes for awhile had a fuel gauge that showed 1/4 1/2 3/4 1/1

They had another one that went R 2/4 4/4

I'm still undecided which was more weird. You can see them both on eBay.

[+] envsubst|2 years ago|reply

Almost every top stack overflow answer is wrong. The correct one is usually at rank 3. The system promotes answers which the public believes to be correct (easy to read, resembles material they are familiar with, follows fads, etc).

Pay attention to comments and compare a few answers.

[+] crabbone|2 years ago|reply

Long time ago, when ActionScript was a thing, there was this one snippet in ActionScript documentation that illustrated how to deal with events dispatching, handling etc. In order to illustrate the concept the official documentation provided a code snippet that created a dummy object, attached handlers to it, and in those handlers defined some way of processing... I think it was XML loading and parsing, well, something very common.

The example implied that this object would be an instance of a class interested in handling events, but didn't want to blow up the size of this example with not so relevant bits of code.

There was a time when I very actively participated in various forums related to ActionScript. And, as you can imagine, loading of XML was paramount to success in that field. Invariably, I'd encounter code that copied the documentation example and had this useless dummy object with handlers defined (and subsequently struggled to extract information thus loaded).

It was simply amazing how regardless of the overall skill of the programmer or the purpose of the applet, the same exact useless object would appear in the same situation -- be it XML socket or XML loaded via HTTP, submitted and parsed by user... it was always there.

----

Today, I often encounter code like this in unit tests in various languages. Often programmers will copy some boilerplate code from example in the manual and will create hundreds or even thousands of unit tests all with some unnecessary code duplication / unnecessary objects. Not sure why in this specific area, but it looks like programmers both treat these kinds of test as some sort of magic but also unimportant, worthless code that doesn't need attention.

----

Finally, specifically on the subject of human-readable encoding of byte sizes. Do you guys like parted? Because it's so fun to work with it because of this very issue! You should try it, if you have some spare time and don't feel misanthropic enough for today.

[+] derstander|2 years ago|reply

I feel like there ought to be a software analogue to that aphorism about models (if it doesn’t exist already) — maybe something like:

All code is wrong, but some is useful.

[+] corbezzoli|2 years ago|reply

Why do you need a 4-line dependency?

This is the reason.

[+] bauruine|2 years ago|reply

There is still the chance that the person that created the 4 line dependency also just copy pasted it from the flawed StackOverflow answer. Or is the same person or is also just a random person creating the package like the random person that created the SO answer. I'm not sure why random_person1 should be more trustworthy to produce non flawed code than random_person2.

OTO: It's at least easily upgrade able so it has an advantage.

[+] Rapzid|2 years ago|reply

The most impressive suggestion Copilot has given me was a solution to this that used a loop to divide and index further into an array of units..

It never dawned on me to approach it that way and I had never seen that solution(not that I ever looked). Not sure where it got that from but was pretty cool and.... Yeah, it gets simple stuff wrong all the time haha.

[+] seeknotfind|2 years ago|reply

I was surprised to find log implementations are loopless. Cool.

https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[+] zeroonetwothree|2 years ago|reply

It basically has the loop unrolled. But it looks like it’s evaluating a polynomial approximation so I suppose it makes sense

[+] bradley13|2 years ago|reply

When StackOverflow was new, it was an incredible resource. Unfortunately, so much cruft has accumulated that it is now nearly useless. Even if an answer was once correct (and many are not), it is likely years out of date and no longer applicable.

[+] meling|2 years ago|reply

While reading I was thinking why aren’t stackoverflow “mandating” that solutions have tests, so that this problem isn’t left to everyone else, ref. to the comment at the end of the article:

Test all edge cases, especially for code copied from Stack Overflow.

[+] nelsonic|2 years ago|reply

How does the author determine this is the "most copied snippet" on SO? The Question/Answer has only been Viewed 351k times. There are posts with many millions of views e.g: https://stackoverflow.com/questions/927358/how-do-i-undo-the... which have definitely been copy-pasted more times. Yes, there may be many instances of this Java function on GitHub. But only because the people doing the copying are too lazy to think about how it works never mind alter the function name. If there's a bug, just update the SO answer and fix the problem. No need to write a lengthy self-promoting post about it.

233 comments