Notice the xchg, stosb and a loop instruction. This was definitely written by a skilled Asm programmer --- I've never seen even a compiler at -Os generate code like that.
This also compels me to "code-golf" the function even more:
push edi
mov edi, [esp+8]
mov ecx, [esp+12]
jecxz label2
label1:
push ecx
call sub_416352
stosb
pop ecx
test al, al
loopnz label1
jecxz label2
dec edi
salc
stosb
label2:
pop edi
ret
Original: 58 bytes; patched: 44; mine: 30.
I've done plenty of patching like this, and indeed the relative "sparseness" of compiler output very often allows the more functional version to be smaller than the original. It's amazing how many instructions the original wastes --- notice how none of ebx, esi, or edi are used, yet they get needlessly pushed and popped; and despite saving those registers so they could be used locally, the compiler perplexingly decided to keep all the local variables on the stack instead. The "jump around a jump", with both of them being the "long" form (for destinations greater than 128 bytes away, not the case here) is equally horrible. This may actually be a case where today's compilers will generate smaller code for the same source.
Note that in 32-bit code, memcpy is typically implemented by first copying blocks of 4 bytes using the movsd (move double word) instruction, while any remaining bytes are then copied using movsb (move byte). This is efficient in terms of performance, but whoever was patching this noticed that some space can be freed by only using movsb, and perhaps sacrificing a nanosecond or two.
On older processors this was true, but since Ivy Bridge a REP MOVSB will essentially be as fast but smaller. Look up "enhanced REP MOVSB" for more information.
Modern compilers aren't that much better at code golf.
I tried equivalent C code and gcc-7.2 gets me 47 bytes, while clang-6.0 only manages 49 bytes (both with -m32 f.c -Os -fomit-frame-pointer).
I have a feeling that size optimization just isn't really important to (at least open source) compiler writers these days. There are more important things, like actual performance, standards compliance and nice diagnostics.
How does this go with the often quoted mantra that you can only beat compilers today if you're an extremely skilled asm programmer? Or is the problem you describe just about executable size rather than speed?
"Note that in 32-bit code, memcpy is typically implemented by first copying blocks of 4 bytes using the movsd (move double word) instruction, while any remaining bytes are then copied using movsb (move byte)."
The semi-official Debian server, alioth.debian.org, where a lot of random developer stuff is hosted, is stuck on Debian wheezy for various reasons. Most users, including myself (a Debian Developer) don't have root access to upgrade the server nor install new software.
The version of libapt-inst is too old to support Debian packages with control.tar.xz members (only control.tar.gz members). So we can't upload newer Debian packages to various custom APT repos that we host on that server.
I worked around this by looking at the libapt-inst source code, figuring out how to make it support control.tar.xz instead of control.tar.gz, and binary-patched libapt-inst.so to have this effect instead. It's actually fairly simple
1. there is a check for control.tar.gz, the failure branch prints an error and then returns. I overwrite this with NOP so it goes into the "success" branch.
2. then later it extracts the control.tar.gz member and pipes it through gzip. Luckily, nowhere else in the program uses the exact string "control.tar.gz" or "gzip" so I simply patch that string "control.tar.gz" -> "control.tar.xz" in the binary and also change "gzip" -> "xz\0\0".
(Actually given the change in (2), (1) is not necessary. But without it you get a bunch of spurious error messages.)
Applying this patch makes the resulting .so lose the ability of working with old control.tar.gz members (which is still needed of course). So my workaround does this:
I'm surprised no one has noted the copyright is to Design Science - this is a small company in my hometown who are still around. I've spoken with their CEO a few times and I wouldn't be at all surprised if the source code was lost, or somehow at least wasn't being made available to Microsoft (I doubt it ever was). It's a really old school shop who seems to have largely been coasting on the licensing of this one component for the past couple decades and I wouldn't at all be shocked to find they no longer are capable of maintaining it themselves.
I noted it - thanks for the background info on the company! I also assume that either they are not able to maintain the software themselves, or they have lost the source code, but it might also be that setting up the toolchain to compile such an old piece of software is more effort than just patching the binary.
According to an Ars comment: "I've got an older version of Mac Office (2011 I think), and there's a version of Equation Editor in there with a 1990-2010 design science copyright on it, so they have some version of newer code they could swap the old office one for."
I once worked at a place which lost part of the source code for their giant mission-defining application. They spent a decade linking in object code for which there was no corresponding source code.
The build team was very proud when they announced that the application would finally start being built from the source code in version control.
Stuff happens, indeed, and more often than most of us realise.
Getting on for a decade ago now I was working at Red Gate when they bought .NET Reflector - a decompiler for .NET code - from Lutz Roeder. After the acquisition we started asking people what they were using it for.
Turns out a significant minority of them were trying to recover lost source code, or source code they never had in the first place (e.g., where a supplier went out of business). I don't remember the exact figure but it might even have run into low double-digit percentages. Bear in mind this is a tool that was being downloaded tens of thousands of times every month by all manner of people working for all kinds of organisations of every size and you can see the scale of the problem.
There were a couple of Reflector add-ins that would allow you to take a .NET binary and generate a C# or VB.NET Visual Studio project with all source code from it. The source code was never perfect and wouldn't likely compile first time, but it was certainly better than starting from scratch. Not surprisingly these add-ins were among the most popular.
Granted, times have changed, and I think source control is probably the default for almost everyone these days - although I would have expected that even in 2008 - but, bottom line: I think this sort of thing happens a lot, for one reason or another.
A friend once told me a story about a software company that had offices in the World Trade Center in New York. Their offices were totally destroyed by 9/11; thankfully, all the staff got out alive, but it turned out they didn't have offsite backups of the source code repository, and it was lost completely. They found various bits of the source code floating around (e.g. some developers had bits of it on their home computers), but there were a few key components they could not locate any source for. Well, the customers still had the compiled binaries, so they got the binaries back from the customers, extracted the missing bits, ran them through a decompiler, and checked the result into the source code repository – since the application was written in Java, this actually worked quite well. Years later, new developers would find bits of obviously decompiled code still in the source repo (you can tell, it has a distinct look to it, e.g. variable names with numbers in them), and scratch their heads, and then get told the tale.
There's only two reasons I can think of why they'd patch the binary directly: either they've lost the source-code, or they no longer have an environment they can build it in.
Another reason could be that it has dependencies that link to specific addresses in the exe. It's very peculiar that they made the effort to keep all the original adresses.
I have no specific insight to this patch, but I do have personal experience binary patching a popular Microsoft product.
My patch was to the VC++ compiler nearly 20 years ago. We had source, and my fix was also applied to the source (which I'd imagine is still there today), but a binary patch also made sense in the short term.
The binary that I patched was used to build another important Microsoft product, and this bug was found late in the product cycle where any compiler change was risky.
We weren't 100% confident we had the exact sources used to build that version of the compiler (git would have been handy then), we only knew, plus or minus one day, what the sources where.
After carefully evaluating the binary patch versus the risk of building from uncertain source, the binary patch was taken to reduce risk.
I'm no reverse engineer, but this was a pretty interesting exercise in RE even though I had sources. I had no symbols, and the binary was optimized so that functions were not contiguous, cold paths were moved to the end of the binary. Just finding the code I needed to patch was not easy.
The code review was fun - a dozen or so compiler engineers reviewed the change on paper printouts - the most thorough review I've had in my career, and the only one that used paper.
To the best of my knowledge, this binary was never used to build anything other than that specific version of the product which I won't name - not that it matters really, the product is still in use, but that version is unlikely to be in use anywhere anymore.
Thanks for sharing this. I suspected that "not being sure if you have the exactly right source code" could be a real world reason to patch a binary, and now I know.
I wonder if they patched this way because they wanted to maintain as much binary compatibility as possible, or if they don't have the original source/couldn't reproduce the build process.
Horrific? This is what you do when you want to make sure you don't introduce any unintentional changes. Computers aren't magic, and there is nothing wrong about patching a binary.
Compiling the software with a modern compiler or linking to a modern runtime is very likely to bring obscure bugs in the codebase to the surface. It's pretty hard to replicate the entire build process that produced the original binary, even if they have the source code and everything else on hand.
Building a new binary means running a full QA against it, which is probably not cost-effective for such an old component. In contrast, this patch has exactly known impact.
I know it seems like magic to you lot, but it's a day's work if you've the right skills.
They probably don't have the Office 97 or 2000 build pipeline around anymore. And back then for Office XP or 2003 copied the equation editor in binary form to the new repository.
Chances are that Microsoft doesn’t have a license for bug fixes from Design Science (makers of MathType) anymore and isn’t willing to pay for this fix.
Alternatively, Design Science may not be able to deliver a version that, for maximum backwards compatibility, has only this fix (to minimize risks, they would have to have kept an environment around that hosts the compiler used back then)
One reason for doing it this way is possibly this:
> Well, have you ever met a C/C++ compiler that would put all functions in a 500+ KB executable on exactly the same address in the module after rebuilding a modified source code, especially when these modifications changed the amount of code in several functions?
It's quite possible they are still contractually obligated to maintain some pretty old systems where changes to the .exe would produce unexpected behaviour. I had Access apps/databases crash on a system if they were built by a different version of Access.
Ah, this brings up a lot of font memories of me in high school preparing presentations using this fine piece of software[0] before replacing it with a 1GB open source equation editor called LaTeX.
[0] It was actually quite usable once you got to know its warts.
The article mentions that the timestamp of compilation gets embedded into the binary. When does this happen? I am used to having identical binaries when recompiling the same source code with same flags (and compiler and so on and so on)
Binary patching is a really common requirement in attack/defense CTF, and there are a few projects floating around to help with it.
Keypatch helps you do assembly overwrites in IDA Pro.
Binary Ninja lets you do assembly (and C shellcode!) overwrite patches, and even has undo.
I have my own project [1] for patching ELFs that relies on injecting additional segments and injecting a hook at any address, so as to not require in-place patches. It can also massage GCC/Clang output and inject that reliably into an existing binary.
I have my own story about this as well. A few years ago I released a port of Uplink: Hacker Elite for the OpenPandora handheld with a few game engine patches, and some people were running into a bug: the game would enter the "new game" screen on every launch, even if you already had a save game to load.
I and couldn't find the exact source I'd used to build it and didn't want to spend time making sure I got all of my bugfixes into the vanilla repository, so... I went digging with IDA, found the topmost branch to the "new game" wizard, and patched the address to go to the main menu function instead. At that point you could still click "new game" from the menu and it wouldn't go through the patched address (so "new game" still worked), but you could also load an existing game, thus fixing the bug!
I still have nothing on Notaz, who statically recompiled StarCraft and Diablo for that community :)
It's an old program the source code for which may either not compile with the modern C++ compiler, or be lost. Back in 2000, Microsoft was using Visual Source Safe for managing its source code. I wouldn't be surprised if nobody can remember where the heck the VSS repository with that source code is located.
That leaves the binary monkey-patching as the only reasonable solution. I'm pretty sure Raymond Chen still works at Microsoft...
Binary patching is really only reasonable when the source code is indeed lost. If they had the code but simply needed a compiler that worked, they could have rebuilt it using the same toolchain and build environment it was built with to begin with. Old versions of Windows and MSVC are obviously still around.
Just a historical note: Patching used to be much more common. Back in the Vax VMS days the image file format (executables, not pictures) had a section for patches.
From the ANALYZE/IMAGE command…
Patch information --- Indicates whether the image has been patched (changed without having been recompiled or reassembled and relinked). If a patch is present, the actual patch code can be displayed. (VAX and Alpha only.)
They'd have to re-implement the patch in source before doing anything else to it. I wonder if they are no longer able to build from source anymore... why else would they resort to this?
They probably just lost the ability to build it, or the source code can't be found. Happens quite often. 17 years is a /long/ time to maintain build systems and remember where you put the files.
I'm glad that most of the software development community seems to have settled on git. I get the feeling that I'll still have all of the source code for my projects in 20 years.
Redundant backups are especially important for software companies. It's scary to think how many startups give all cofounders and developers admin access to everything. It helps that git is distributed, but it's not hard to imagine a scenario where a ticked off former employee wipes everyone's laptops and deletes the hosted source code.
Even if you don't update the mirrors regularly, it's good to know that you have some copies of data in BitBucket/GitLab/Heroku/Google Drive.
[+] [-] userbinator|8 years ago|reply
This also compels me to "code-golf" the function even more:
Original: 58 bytes; patched: 44; mine: 30.I've done plenty of patching like this, and indeed the relative "sparseness" of compiler output very often allows the more functional version to be smaller than the original. It's amazing how many instructions the original wastes --- notice how none of ebx, esi, or edi are used, yet they get needlessly pushed and popped; and despite saving those registers so they could be used locally, the compiler perplexingly decided to keep all the local variables on the stack instead. The "jump around a jump", with both of them being the "long" form (for destinations greater than 128 bytes away, not the case here) is equally horrible. This may actually be a case where today's compilers will generate smaller code for the same source.
Note that in 32-bit code, memcpy is typically implemented by first copying blocks of 4 bytes using the movsd (move double word) instruction, while any remaining bytes are then copied using movsb (move byte). This is efficient in terms of performance, but whoever was patching this noticed that some space can be freed by only using movsb, and perhaps sacrificing a nanosecond or two.
On older processors this was true, but since Ivy Bridge a REP MOVSB will essentially be as fast but smaller. Look up "enhanced REP MOVSB" for more information.
[+] [-] tpolzer|8 years ago|reply
I tried equivalent C code and gcc-7.2 gets me 47 bytes, while clang-6.0 only manages 49 bytes (both with -m32 f.c -Os -fomit-frame-pointer).
I have a feeling that size optimization just isn't really important to (at least open source) compiler writers these days. There are more important things, like actual performance, standards compliance and nice diagnostics.
[+] [-] azag0|8 years ago|reply
[+] [-] feelin_googley|8 years ago|reply
Some software authors do not use memcpy().
https://marc.info/?l=djbdns&m=96477313901746&w=2
http://cr.yp.to/lib/byte.html
[+] [-] yuhong|8 years ago|reply
[+] [-] infinity0|8 years ago|reply
The semi-official Debian server, alioth.debian.org, where a lot of random developer stuff is hosted, is stuck on Debian wheezy for various reasons. Most users, including myself (a Debian Developer) don't have root access to upgrade the server nor install new software.
The version of libapt-inst is too old to support Debian packages with control.tar.xz members (only control.tar.gz members). So we can't upload newer Debian packages to various custom APT repos that we host on that server.
I worked around this by looking at the libapt-inst source code, figuring out how to make it support control.tar.xz instead of control.tar.gz, and binary-patched libapt-inst.so to have this effect instead. It's actually fairly simple
1. there is a check for control.tar.gz, the failure branch prints an error and then returns. I overwrite this with NOP so it goes into the "success" branch.
2. then later it extracts the control.tar.gz member and pipes it through gzip. Luckily, nowhere else in the program uses the exact string "control.tar.gz" or "gzip" so I simply patch that string "control.tar.gz" -> "control.tar.xz" in the binary and also change "gzip" -> "xz\0\0".
(Actually given the change in (2), (1) is not necessary. But without it you get a bunch of spurious error messages.)
Applying this patch makes the resulting .so lose the ability of working with old control.tar.gz members (which is still needed of course). So my workaround does this:
LD_PRELOAD=libapt-inst.so.patched apt-ftparchive [..] && apt-ftparchive [..]
i.e. runs it once with the hack to pick up the new-style debs, and once again without the hack to pick up the old-style debs.
My motto is, "dirty solutions for dirty problems". :D :D :D
[+] [-] mschuster91|8 years ago|reply
Jeez. How is security maintained? That actually scares me a bit.
[+] [-] uyoakaoma|8 years ago|reply
[+] [-] rogerhoward|8 years ago|reply
[+] [-] extra88|8 years ago|reply
They also make other software meant to make math more accessible to people with various disabilities.
https://www.dessci.com/en/
[+] [-] rob74|8 years ago|reply
[+] [-] yuhong|8 years ago|reply
[+] [-] dzdt|8 years ago|reply
The build team was very proud when they announced that the application would finally start being built from the source code in version control.
Stuff happens!
[+] [-] bartread|8 years ago|reply
Getting on for a decade ago now I was working at Red Gate when they bought .NET Reflector - a decompiler for .NET code - from Lutz Roeder. After the acquisition we started asking people what they were using it for.
Turns out a significant minority of them were trying to recover lost source code, or source code they never had in the first place (e.g., where a supplier went out of business). I don't remember the exact figure but it might even have run into low double-digit percentages. Bear in mind this is a tool that was being downloaded tens of thousands of times every month by all manner of people working for all kinds of organisations of every size and you can see the scale of the problem.
There were a couple of Reflector add-ins that would allow you to take a .NET binary and generate a C# or VB.NET Visual Studio project with all source code from it. The source code was never perfect and wouldn't likely compile first time, but it was certainly better than starting from scratch. Not surprisingly these add-ins were among the most popular.
Granted, times have changed, and I think source control is probably the default for almost everyone these days - although I would have expected that even in 2008 - but, bottom line: I think this sort of thing happens a lot, for one reason or another.
[+] [-] skissane|8 years ago|reply
[+] [-] Fuxy|8 years ago|reply
There is no way I could rely on something like that.
Am I the only one that thinks relying on something that has no source code is just asking for trouble and headaches in the future?
[+] [-] bearbearbear|8 years ago|reply
How would you go an entire decade without noticing this?
Wouldn't you have to use that missing source code for something within ten years?
[+] [-] nicktelford|8 years ago|reply
[+] [-] dtech|8 years ago|reply
[+] [-] dawnbreez|8 years ago|reply
[+] [-] porfirium|8 years ago|reply
[+] [-] lzybkr|8 years ago|reply
My patch was to the VC++ compiler nearly 20 years ago. We had source, and my fix was also applied to the source (which I'd imagine is still there today), but a binary patch also made sense in the short term.
The binary that I patched was used to build another important Microsoft product, and this bug was found late in the product cycle where any compiler change was risky.
We weren't 100% confident we had the exact sources used to build that version of the compiler (git would have been handy then), we only knew, plus or minus one day, what the sources where.
After carefully evaluating the binary patch versus the risk of building from uncertain source, the binary patch was taken to reduce risk.
I'm no reverse engineer, but this was a pretty interesting exercise in RE even though I had sources. I had no symbols, and the binary was optimized so that functions were not contiguous, cold paths were moved to the end of the binary. Just finding the code I needed to patch was not easy.
The code review was fun - a dozen or so compiler engineers reviewed the change on paper printouts - the most thorough review I've had in my career, and the only one that used paper.
To the best of my knowledge, this binary was never used to build anything other than that specific version of the product which I won't name - not that it matters really, the product is still in use, but that version is unlikely to be in use anywhere anymore.
[+] [-] dielel|8 years ago|reply
[+] [-] dtech|8 years ago|reply
I wonder if they patched this way because they wanted to maintain as much binary compatibility as possible, or if they don't have the original source/couldn't reproduce the build process.
[+] [-] gizmo|8 years ago|reply
Compiling the software with a modern compiler or linking to a modern runtime is very likely to bring obscure bugs in the codebase to the surface. It's pretty hard to replicate the entire build process that produced the original binary, even if they have the source code and everything else on hand.
[+] [-] moonbug22|8 years ago|reply
[+] [-] frik|8 years ago|reply
[+] [-] be5invis|8 years ago|reply
[+] [-] Someone|8 years ago|reply
Chances are that Microsoft doesn’t have a license for bug fixes from Design Science (makers of MathType) anymore and isn’t willing to pay for this fix.
Alternatively, Design Science may not be able to deliver a version that, for maximum backwards compatibility, has only this fix (to minimize risks, they would have to have kept an environment around that hosts the compiler used back then)
[+] [-] dmitriid|8 years ago|reply
> Well, have you ever met a C/C++ compiler that would put all functions in a 500+ KB executable on exactly the same address in the module after rebuilding a modified source code, especially when these modifications changed the amount of code in several functions?
It's quite possible they are still contractually obligated to maintain some pretty old systems where changes to the .exe would produce unexpected behaviour. I had Access apps/databases crash on a system if they were built by a different version of Access.
[+] [-] magnat|8 years ago|reply
[+] [-] kristofferR|8 years ago|reply
IDA is widely regarded as the best disassembler and debugger out there. It comes with a price to match too though.
[+] [-] sandos|8 years ago|reply
[+] [-] nostoc|8 years ago|reply
http://rada.re/r/
[+] [-] _pmf_|8 years ago|reply
[0] It was actually quite usable once you got to know its warts.
[+] [-] yoz-y|8 years ago|reply
[+] [-] svenfaw|8 years ago|reply
[+] [-] lunixbochs|8 years ago|reply
Keypatch helps you do assembly overwrites in IDA Pro.
Binary Ninja lets you do assembly (and C shellcode!) overwrite patches, and even has undo.
I have my own project [1] for patching ELFs that relies on injecting additional segments and injecting a hook at any address, so as to not require in-place patches. It can also massage GCC/Clang output and inject that reliably into an existing binary.
[1] https://github.com/lunixbochs/patchkit
I have my own story about this as well. A few years ago I released a port of Uplink: Hacker Elite for the OpenPandora handheld with a few game engine patches, and some people were running into a bug: the game would enter the "new game" screen on every launch, even if you already had a save game to load.
I and couldn't find the exact source I'd used to build it and didn't want to spend time making sure I got all of my bugfixes into the vanilla repository, so... I went digging with IDA, found the topmost branch to the "new game" wizard, and patched the address to go to the main menu function instead. At that point you could still click "new game" from the menu and it wouldn't go through the patched address (so "new game" still worked), but you could also load an existing game, thus fixing the bug!
I still have nothing on Notaz, who statically recompiled StarCraft and Diablo for that community :)
[+] [-] alexeiz|8 years ago|reply
That leaves the binary monkey-patching as the only reasonable solution. I'm pretty sure Raymond Chen still works at Microsoft...
[+] [-] ajross|8 years ago|reply
[+] [-] jws|8 years ago|reply
From the ANALYZE/IMAGE command…
Patch information --- Indicates whether the image has been patched (changed without having been recompiled or reassembled and relinked). If a patch is present, the actual patch code can be displayed. (VAX and Alpha only.)
[+] [-] lima|8 years ago|reply
[+] [-] atupis|8 years ago|reply
[+] [-] foobarbecue|8 years ago|reply
[+] [-] artursapek|8 years ago|reply
[+] [-] anon1253|8 years ago|reply
[+] [-] nathan_f77|8 years ago|reply
Redundant backups are especially important for software companies. It's scary to think how many startups give all cofounders and developers admin access to everything. It helps that git is distributed, but it's not hard to imagine a scenario where a ticked off former employee wipes everyone's laptops and deletes the hosted source code.
Even if you don't update the mirrors regularly, it's good to know that you have some copies of data in BitBucket/GitLab/Heroku/Google Drive.