Extracting AI models from mobile apps

ipsum2|1 year ago

This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.

mentalgear|1 year ago

Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.

refulgentis|1 year ago

This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.

An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.

The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.

Thanks for the article OP, really fascinating.

rob_c|1 year ago

If you can't fix this with a little help from chatgpt or Google you shouldn't be building the models frankly let alone mucking with other people's...

nthingtohide|1 year ago

One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.

E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.

Can something similar help in model encryption?

antman|1 year ago

Had to look it up, this seems to be the paper https://research.google/pubs/federated-learning-for-mobile-k...

biosboiii|1 year ago

Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.

So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.

They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.

And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.

umeshunni|1 year ago

Homomorphic, not homeomorphic

vlovich123|1 year ago

In theory yes, in practice right now no. Homomorphic encryption is too computationally expensive.

garyfirestorm|1 year ago

Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!

biosboiii|1 year ago

Thanks a lot :)

My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.

This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".

Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.

janalsncm|1 year ago

I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.

Zambyte|1 year ago

Maybe someday we will build a society where standing on the shoulders of giants is encouraged, even when they haven't been dead for 100 years yet.

avg_dev|1 year ago

pretty cool; that frida tool seems really nice. https://frida.re/docs/home/

(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")

frogsRnice|1 year ago

frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo

It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.

EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.

JTyQZSnP3cQGa8B|1 year ago

> Keep in mind that AI models [...] are considered intellectual property

Is it ironic or missing a /s? I can't really tell here.

SunlitCat|1 year ago

To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.

If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.

Freak_NL|1 year ago

Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.

biosboiii|1 year ago

Hey, author here.

I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.

But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.

GuB-42|1 year ago

For now, it is better to assume it is the truth.

The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.

It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.

npteljes|1 year ago

I think it's both. It's

1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data

boothby|1 year ago

If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.

zitterbewegung|1 year ago

IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]

[1] https://github.com/meta-llama/llama/blob/main/LICENSE

Fragoel2|1 year ago

There's an interesting research paper from a few years ago that extracted models from Android apps on a large scale: https://impillar.github.io/files/ccs2022advdroid.pdf

asciii|1 year ago

That's pretty cool! I am impressed by the Frida tool, especially to read in the binary and dump it to disk by overwriting the native method.

The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?

biosboiii|1 year ago

Yeah, you can basically just unzip IPA files. Gaining them is hard though, I have a pathway if you are interested.

But the Objective C code is actually compiled, and decompilation is a lot harder than with the JVM languages on Android.

My next article will be about CoreML on iOS, doing the same exact thing :)

unknown|1 year ago

[deleted]

Ekaros|1 year ago

Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...

benreesman|1 year ago

There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.

As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.

bangaladore|1 year ago

To some extent this is how many models are being produced today.

Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.

This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.

amolgupta|1 year ago

For app developers considering tflite, a safer way would be to host the models on firebase and delete them when their job is done. It comes with other features like versioning for model updates, A/B tests, lower apk size etc. https://firebase.google.com/docs/ml/manage-hosted-models

hn8726|1 year ago

That wouldn't help against the technique explained in the article, would it? Since the model makes it way into the device, it can be intercepted in a similar fashion.

I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.

saagarjha|1 year ago

In addition to the sibling comment this would require repeatedly re-downloading models when you want to use them, which sucks.

do_not_redeem|1 year ago

Can anyone explain that resize_to_320.tflite file? Surely they aren't using an AI model to resize images? Right?

smitop|1 year ago

tflite files can contain a ResizeOp that resizes the image: https://ai.google.dev/edge/api/tflite/java/org/tensorflow/li...

The file is only 7.7kb, so it couldn't contain many weights anyways.

koe123|1 year ago

Probably not what your alluding to but AI upscaling of images is definitely a thing

VectorLock|1 year ago

Excellent introduction to some cool tools I wasn't aware of!

kittikitti|1 year ago

This was a great article and I really appreciate it!

Polizeiposaune|1 year ago

You wouldn't train a LLM on a corpus containing copyrighted works without ensuring you had the necessary rights to the works, would you?

Workaccount2|1 year ago

LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set.

And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".

tomjen3|1 year ago

You wouldn't read a book and teach others its lessons without a derived license, would you?

yieldcrv|1 year ago

and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes

sharkest|1 year ago

[deleted]

deadbabe|1 year ago

Fair use.

philwelch|1 year ago

You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.

1vuio0pswjnm7|1 year ago

"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."

Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.

See, e.g., https://news.ycombinator.com/item?id=42617889

jonpo|1 year ago

Well done you seem to have liberated an open model trained on open data for blind and visually impaired people.

Paper: https://arxiv.org/pdf/2204.03738

Code: https://github.com/microsoft/banknote-net Training data: https://raw.githubusercontent.com/microsoft/banknote-net/ref...

model: https://github.com/microsoft/banknote-net/blob/main/models/b...

Kinda easier to download it straight from github.

Its licenced under MIT and CDLA-Permissive-2.0 licenses.

But lets not let that get in the way of hating on AI shall we?

dang|1 year ago

> But lets not let that get in the way of hating on AI shall we?

Can you please edit this kind of thing out of your HN comments? (This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)

It leads to a downward spiral, as one can see in the progression to https://news.ycombinator.com/item?id=42604422 and https://news.ycombinator.com/item?id=42604728. That's what we're trying to avoid here.

Your post is informative and would be just fine without the last sentence (well, plus the snarky first two words).

DoctorOetker|1 year ago

Don't you think its intentional, so as not to demonstrate the technique on potentially copyrighted data?

biosboiii|1 year ago

Author here, it would be nice to claim that I did this on purpose but I really did not know it was open source.

I was rather interested in the process of instrumenting of TF to make this "attack" scalable to other apps.

rob_c|1 year ago

... Because if he did this with a model that's not open that's sure going to keep everyone happy and not result in lawsuit(s)...

The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)

llama_drama|1 year ago

If this is exactly the same model then what's the point of encrypting it?

TechDebtDevin|1 year ago

[deleted]

cess11|1 year ago

[deleted]

powtain-gen1|1 year ago

Welcome to check out Sam Altman’s January 5, 2025 blog post, “Reflections.”

https://web.powtain.com/pow/qao631

wat10000|1 year ago

“ Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner.”

That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.

Lerc|1 year ago

I'm not even sure if event the first part is true. Has it been determined if AI models are intellectual property? Machine generated content may not be copyrightable. It isn't just the output of generative AI that falls under this, the models themselves are.

Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.

An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.

People possess models, I'm not sure if they own them.

There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.

jdietrich|1 year ago

Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.

https://www.law.cornell.edu/uscode/text/17/1201

dragonwriter|1 year ago

No, copyright violation occurs at the first unauthorized copying or creation of a derivative work or exercise of any of the other exclusive rights of the copyright holder (that does not fall into an exception like that for fair use.) That distribution is required for a copyright violation is a persistent myth. Distribution is a means by which a violation becomes more likely to be detected and also more likely to involve significant liability for damages.

(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)

wslh|1 year ago

It's also worth noting that there is still no legal clarity on these issues, even if a license claims to provide specific permissions.

Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.

rileymat2|1 year ago

I doubt the models are copyrighted, arn’t works created by machine not eligible? Or you get into cases autogenerating and claiming ownership of all possible musical note combinations.

hnlmorg|1 year ago

It’s hard to say because as far as I know this stuff hasn’t been definitively tested on any courts that I know of. Europe not America.

AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.

larodi|1 year ago

its insane to state it tbh

greggawatt|1 year ago

[deleted]

dfefdfdd|1 year ago

[deleted]

23B1|1 year ago

> hoarding data

Laundering IP. FTFY.

1vuio0pswjnm7|1 year ago

"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."

If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.

1vuio0pswjnm7|1 year ago

Where the model owner is not the owner of the training data consider also that weights may be derivative works:

https://www.arxiv.org/pdf/2407.13493

unknown|1 year ago

[deleted]

236 comments