top | item 38823614

(no title)

p3n1s | 2 years ago

It looks like you got lucky and this proprietary format is nothing more than standard MIDI file concatenated together with perhaps some additional data that you are able to ignore +/- some header patch. Frankly this barely qualifies as reverse engineering, at least it represents some trivial case, I mean I'm happy it was easy, but reverse engineering just rarely ever works out so straightforward.

And I would expect someone with competence in scripting language of choice to pop out that script which is a loop and file IO in a few minutes, not hours (assuming it is even correct). And if they have a basic experience working with binary files should know how to google the necessary info about MIDI in seconds.

However looking at the transcript I am also confused because it says (correctly): MIDI files typically start with the header "MThd" followed by the header length, format type, number of tracks, and division. It goes on: "Once a MIDI section is found, we'll extract it according to the MIDI file structure". OK. But the script does NOT do that it reads 4 bytes starting from offset 8 as a 32-bit big endian "length" which is not "according to the MIDI file structure". The standard format is 2 bytes for a format specifier (AKA type) (0, 1, or 2), and then 2 bytes for the number of tracks.

ie, this is wrong in some way:

    # Read the MIDI header and length (14 bytes in total: 'MThd' + 4 bytes for header length + 6 bytes header data)
    midi_chunk = io.read(14)

    # Extracting the length of the MIDI data from the header
    midi_data_length = midi_chunk[8..11].unpack('N').first

So either the proprietary format you're dealing with actually does have a variation on the header of the embedded MIDI file. If that's the case, I would have to deduct points from ChatGPT because I would expect a competent developer to comment/document this fact, no where in the transcript is this stated.

The other possibility I can see is that if your file is a bunch of standard Type 1 MIDI files, the unpack/parse is going to read that as 65536 + some small amount and will extract files that are all around that size. Since the next step is to look for another MThd magic it will just gleefully resync (I assume these are small segments), but you will end up skipping a whole bunch of files and they will be unceremoniously tacked onto others (which will just be ignored in many players).

So what did it end up being? If it was the second case, I would also be suspicious that a first crack LLM follow-up "fix" isn't subtly wrong and prone to false splits.

On further thought, how could it be the first case? If it were the outputted files are not standard MIDI. So something is fucked here. Either you have something totally broken or you have further follow-up and we have to believe it is not subtly broken.

"There was a whole lot of reading the MIDI spec, searching for strings in a hex viewer and calculating values in a hex to decimal calculator."

One pearl I would lend in relation to this: use your REPL, that is a productivity accelerator.

I am also sincerely interested in examples of LLMs reverse engineering something with compression or encryption or some checksum, or like some actual complicated structure that has to be teased out (this is something humans do all the time), maybe something that is most easily solved by cracking open the compiled parser, I'm not saying they can't do it, but plainly put this example is too trivial to be interesting and frankly barely qualifies as reverse engineering at least insofar as some sort of RE Turing Test analogue.

----

If the format works the way I think it does (and this is based on nothing more than general experience and this thread, so give me a break), the only robust way to deal with this is to either figure out where in the proprietary data some type of length field is, and clearly ChatGPT was not going in that direction, nor do I believe it would be able to divine that information from a file upload. Or to use this slightly wonkier method but actually read every MIDI chunk header, since standard MIDI has no total file size length encoded in it. The loop should be: look for MThd, read the NEXT 4 bytes for the length, skip, read and write out chunks (ie 4 byte magic followed by 4 byte length), split when chunk type not seen (that's what makes this a bit fragile, but its probably good enough). If you just look for MThd, you'll split if the MIDI data has an 'MThd' in it.

discuss

peteforde|2 years ago

Ha! First: I appreciate the detailed and thoughtful reply, even if I feel wildly judged.

It's distinctly possibly that you're simply "better" at reverse engineering than I am, which really just means that you might do it frequently and I might do it a few times a decade. This isn't going to keep me up tonight, because my identity isn't tied to being someone who reverse engineers things.

That said, I am pretty thrilled with this solution. I launched a web-enabled version last night and so far about 1100 people have used it to convert 6800 files after I replied to some posts on relevant musician forums around the web.

In my defense, what you're not taking into any consideration is that until 48 hours ago, I'd never looked at the MIDI spec or opened a MIDI file, before. You clearly have a huge amount of domain knowledge that I don't pretend to have.

I also, shocking as it may seem, haven't worked with binary formats in over a decade. I'm a web developer. Binary formats aren't an alien mystery to me, but all of the tools for working with them had to be re-learned as I was working on this.

Anyhow, don't fall into the trap of equating typing speed with the time it takes to learn a domain and consider (design) an approach. If I could think at the speed I can type, John Carmack would have nothing on me.

In the end, I absolutely did get lucky. The proprietary format was, as you proposed, a bunch of 1 track/format 0 MIDI files, bounded by hierarchy metadata that was discarded.

p3n1s|2 years ago

Curiosity did get the better of me and you seem to be spamming every fucking forum so WTH. The longest time it took for anything was waiting for the file to download. The rest of this was about 5 minutes of effort. For the record I do not know the MIDI spec, but it must be one of the most easy to google and well documented things out there, and it's pretty simple likely because it's nearly 40 years old and had to run on potatoes.

The file you could have reverse engineered is not enough of a challenge for an interview question in a low level/systems field and yet you didn't even attempt to do that part. The file hierarchy is absolute offsets and sizes in 4 byte little endian quantities. It took less than 30 seconds to figure that out. How? Find MThd string, what offset is it at, search for that offset as a value. Notice that it is adjacent to a null terminated string with a name ending in .mid and a small quantity. Is that small quantity added to the original offset the start of the next MThd, yes? Done. Anyone with the shittiest hex editor and Ctrl-F can do this in a minute. Rebuilding the hierarchy is quite simple since the absolute file offset to the entries is adjacent to every directory name. These midinfo things are interesting they also have some sidecar data. Their content might also be something novel to reverse, that might be worth bragging about.

> I spent almost a week (!) reverse engineering their absurd proprietary format using a hex editor and the MIDI spec.

Since this whole affair seems to have just boiled down to naively iterating for the 4 byte MIDI file magic and then examining the MIDI file metadata you didn't even need all the bluster of breaking out the hex editor and a calculator... Bam https://github.com/jimm/midilib, done (the actual get the MIDI data part can be done in 1 line with a string split). It's too bad ChatGPT didn't suggest that. That should extract the 3 pieces of metadata you attempted to store in the directories.

Sidebar, nothing here stands out as absurd. It just looks like the obvious solution some working stiff would put together to bundle some data up. They don't obfuscate or do anything that stands out as fucky. Since it's read only its not like not using sqlite is some cardinal sin and they probably gave it all of 3 seconds of thought it deserved.

If you were using this as a learning exercise, fine, but then go back and check your work because your tempos and key signatures are off. eg the tempos that you categorize as 104 have an actual encoded tempo of 571429 µs/beat, or 1.7499 bps or 104.9999 BPM, ie it's what humans would call 105 BPM. Not the end of the world but this is pretty rookie floating point mistake. And I'm pretty sure you bungled the key signature because at least ones that say Abm are Fm. Why is that... ah you have ignored relative keys. For instance since the meta event FF 59 02 FC 01 has a sf value of 0xFC which is 2's complement -4 that's 4 flats. If this were a major key that would be Ab, but it's a minor key so it's Fm. Oh no, even simple keys are fucked. FF (1 flat, or F major is coming out as Cb, 7 flats), seems like a bungling of 2's complement. My music education culminated in 2nd chair trombone in the 8th grade and everything I know about the MIDI spec I got from the top 3 hits of google, so caveat emptor.

https://en.wikipedia.org/wiki/Key_signature https://www.music.mcgill.ca/~ich/classes/mumt306/StandardMID...

The only reason I knew it was wrong was because I checked the files with mido (I know you claim ruby is superior, but I seem to be doing ok with dull old python like a rube).

Also in the file I count 640 MIDI headers and I only got 639 files out of your thingy so you might want to revisit that part I mentioned about bugs.

It might actually be interesting if you reconstructed the hierarchy in the metadata. That is not hard, it is all absolute offsets, sizes and null terminated ASCII strings.

But what you did seems to have written a (possibly buggy) loop that calls basic standard library functions in ruby, and attempt to poorly parse/convert a few MIDI meta-events. It's not reverse engineering if you have a nearly perfect spec in front of you, you completely skipped the (simple) reversing part and bungled the easy part. And this task shouldn't take a few days nor hours for someone experienced with basic scripting. https://adventofcode.com/2023

> the entire process of iterating through the binary blob, pulling out the MIDI header/track chunks, and then creating entries with smart naming in a zip archive is done entirely in-memory. For the non-programmers, this is otherwise known as "hard" or "showing off".

We have very different definitions of hard. Unless you were writing a shell script doing it any other way is bananas. These are like a few hundred bytes a piece.

Your technical chops are discordant with the arrogance and incivility you've displayed towards others on this forum. (And name dropping Carmack, that's bold)

eg: "Serious question: are you wickedly combative in all discussions, or just when you get the cold feeling that perhaps you might not know as much as you think you do?" in response to someone who simply had a differing opinion to this: "you keep implying HTMX is a good choice, when there are vastly more powerful solutions in play."

"I assume that the person who downvoted my comment is a bootcamp grad. Good luck with your future endeavors."

I see a common denominator here.

p3n1s|2 years ago

[deleted]