(no title)
TomNomNom | 2 years ago
It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".
Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".
I like the idea, but the current implementation can't be relied on IMO; especially not for automation.
A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.
ebursztein|2 years ago
For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.
We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.
Thanks again for sharing your experience with Magika this is very useful.
TomNomNom|2 years ago
Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff
These are files that were in one of my crawl datasets.
[0]: https://poc.lol/files/magika-test.tgz
westurner|2 years ago
hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...
File signature: https://en.wikipedia.org/wiki/File_signature
PhotoRec: https://en.wikipedia.org/wiki/PhotoRec
"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/
Table of ': https://formats.kaitai.io/xref.html
AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...
Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression
Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database
sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...
clamav sigtool: https://www.google.com/search?q=clamav+sigtool
https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :
List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signaturesAnd then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.
A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?
https://github.com/google/osv.dev/blob/master/README.md#usin... :
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.
Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.
Add'l useful formats:
> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories
Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage
Solvency|2 years ago
[deleted]
michaelmior|2 years ago
What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)
jdiff|2 years ago
TomNomNom|2 years ago
If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.