top | item 39391688

Magika: AI powered fast and efficient file type identification

695 points| alphabetting | 2 years ago |opensource.googleblog.com | reply

251 comments

order
[+] TomNomNom|2 years ago|reply
This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

[+] ebursztein|2 years ago|reply
Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at [email protected] if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.

[+] michaelmior|2 years ago|reply
> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

[+] stevepike|2 years ago|reply
Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

[+] ebursztein|2 years ago|reply
Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question

[+] metafunctor|2 years ago|reply
I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.

[+] renonce|2 years ago|reply
From the first paragraph:

> enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?

[+] brabel|2 years ago|reply
> They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).

[+] aidenn0|2 years ago|reply
FWIW, file can now distinguish many types of zip containers, including Oxml files.
[+] m0shen|2 years ago|reply
As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

[+] m0shen|2 years ago|reply
Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'
[+] ebursztein|2 years ago|reply
We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

[+] invernizzi|2 years ago|reply
Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.
[+] tudorw|2 years ago|reply
Can we do the 1600 if known, if not, let the AI take a guess?
[+] michaelt|2 years ago|reply
> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

[+] lebean|2 years ago|reply
It's for researchers, probably.
[+] Eiim|2 years ago|reply
I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.
[+] pizzalife|2 years ago|reply
> The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code

To be fair though, a snippet of GLSL shader code can be perfectly valid C.

[+] lifthrasiir|2 years ago|reply
I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.
[+] HtmlProgrammer|2 years ago|reply
Minecraft mods 14 years ago used to tell you to open the JAR and delete the META-INF when installing them so can’t rely on that one…
[+] supriyo-biswas|2 years ago|reply
The `file` command checks only the first few bytes, and doesn’t parse the structure of the file. APK files are indeed reported as Zip archives by the latest version of `file`.
[+] lopkeny12ko|2 years ago|reply
I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.
[+] TacticalCoder|2 years ago|reply
> What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

I use the file command all the time. The value is when you get this:

    ... $  file somefile.xyz
    somefile.xyz: data
AIUI from reading TFA, magika can determine more filetypes than what the file command can detect.

It'd actually be very easy to determine if there's any value in magika: run file on every file on your filesystem and then for every file where the file command returns "data", run magika and see if magika is right.

If it's right, there's your value.

P.S: it may also be easier to run on Windows than the file command? But then I can't do much to help people who are on Windows.

[+] cle|2 years ago|reply
It's not always deterministic, sometimes it's fuzzy depending on the file type. Example of this is a one-line CSV file. I tested one case of that, libmagic detects it as a text file while magika correctly detects it as a CSV (and gives a confidence score, which is killer).
[+] alkonaut|2 years ago|reply
But even with determinism, it's not always right. It's not too rare to find a text file with a byte order mark indicating UTF16 (0xFE 0xFF) but then actually containing utf-8. But what "format" does it have then? Is it UTF-8 or UTF-16? Same with e.g. a jar file missing a manifest. That's just a zip, even though I'm sure some runtime might eat it.

But the question is when you have the issue of having to guess the format of a file? Is it when reverse engineering? Last time I did something like this was in the 90's when trying to pick apart some texture from a directory of files called asset0001.k and it turns out it was a bitmap or whatever. Fun times.

[+] vintermann|2 years ago|reply
Consider, it's perfectly possible for a file to fit two or more file formats - polyglot files are a hobby for some people.

And there are also a billion formats that are not uniquely determined by magic bytes. You don't have to go further than text files.

[+] potatoman22|2 years ago|reply
This also works for formats like Python, HTML, and JSON.
[+] YoshiRulz|2 years ago|reply
So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.
[+] thorum|2 years ago|reply
[+] s1mon|2 years ago|reply
It's surprising that there are so many file types that seem relatively common which are missing from this list. There are no raw image file formats. There's nothing for CAD - either source files or neutral files. There's no MIDI files, or any other music creation types. There's no APL, Pascal, COBOL, assembly source file formats etc.
[+] vunderba|2 years ago|reply
As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.
[+] NiloCK|2 years ago|reply
A somewhat surprising and genuinely useful application of the family of techniques.

I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.

[+] nicklecompte|2 years ago|reply
Elsewhere in the thread kevincox[1] points out that it's extremely susceptible to adversarial binaries:

> Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

Seems like this is genuinely useless for anybody but AI researchers.

[1] https://news.ycombinator.com/item?id=39395677

[+] jamesdwilson|2 years ago|reply
For the extremely limited number of file types supported, I question the utility of this compared to `magic`
[+] star4040|2 years ago|reply
It gets a lot of binary file formats wrong for me out-of-the-box. I think it needs to be a bit more effective before we can truly determine the effectiveness of such exploits.
[+] dghlsakjg|2 years ago|reply
“These aren’t the binaries you are looking for…”
[+] Vt71fcAqt7|2 years ago|reply
This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.
[+] userbinator|2 years ago|reply
Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file.

"web browsers"? Odd to see this coming from Google itself. https://en.wikipedia.org/wiki/Content_sniffing was widely criticised for being problematic for security.

[+] TacticalCoder|2 years ago|reply
To me the obvious use case is to first use the file command but then, when file returns "DATA" (meaning it couldn't guess the file type), call magika.

I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around file doing just that when I come back from vacation. I hate it when file cannot do its thing.

Put it this way: I use file a lot and I know at times it cannot detect a filetype. But is file often wrong when it does have a match? I don't think so...

So in most of the cases I'd have file correctly give me the filetype, very quickly but then in those rare cases where file cannot find anything, I'd then use the slower but apparently more capable magika.

[+] SnowflakeOnIce|2 years ago|reply
I have seen 'file' misclassify many things when running it at large scale (millions of files) from a hodgepodge of sources. Unrelated types getting called 'GPG Private Keys', for example.

For textual data types, 'file' gets confused often, or doesn't give a precise type. GitHub's 'linguist' [1] tool does much better here, but is structured in such a way that it is difficult to call it on an arbitrary file or bytestring that doesn't reside in a git repo.

I'd love to have a classification tool that can more granularly classify textual files! It may not be Magika _today_ since it only supports 116-something types. For this use case, an ML-based approach will be more successful than an approach based solely on handwritten heuristic rules. I'm excited to see where this goes.

[+] krick|2 years ago|reply
What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust that thing to protect you from malicious payloads.
[+] nindalf|2 years ago|reply
> no way an "AI powered" tool can be more reliable

The article provides accuracy benchmarks.

> you would be better off just rejecting it completely

They mention using it in gmail and Drive, neither of which have the luxury of rejecting files willy-nilly.

[+] n2d4|2 years ago|reply
Virus detection is mentioned in the article. Code editors need to find the programming language for syntax highlighting of code before you give it a name. Your desktop OS needs to know which program to open files with. Or, recovering files from a corrupted drive. Etc

It's easy to distinguish, say, a PNG from a JPG file (or anything else that has well-defined magic bytes). But some files look virtually identical (eg. .jar files are really just .zip files). Also see polyglot files [1].

If you allow an `unknown` label or human intervention, then yes, magic bytes might be enough, but sometimes you'd rather have a 99% chance to be right about 95% of files vs. a 100% chance to be right about 50% of files.

[1] https://en.wikipedia.org/wiki/Polyglot_(computing)

[+] woliveirajr|2 years ago|reply
Reminds me when someone asked (at StackOverflow) on how to recognize binaries for different architetures, like x86 or ARM-something or Apple M1 and so on.

I gave the idea to use the technique of NCD (Normalized compression distance), based on Kolmogorov complexity. Celibrasi, R. was one great researcher in this area, and I think he worked at Google at some point.

Using AI seems to follow the same path: "learn" what represents some specific file and then compare the unknown file to those references (AI:all the parameters, NCD:compression against a known type).

[+] jjsimpso|2 years ago|reply
I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.

As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

[+] VikingCoder|2 years ago|reply
What does it do with an Actually Portable Executable compiled by Cosmopolitan libc compiler?
[+] 20after4|2 years ago|reply
I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!