This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.
Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.
This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.
An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.
The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.
One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.
E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.
Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.
So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.
They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way.
When you own the device, root it or even gain JTAG access to it, you can access and control everything.
And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.
Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!
My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.
This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".
Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.
I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.
(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")
frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo
It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.
EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.
To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.
If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.
Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.
I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.
But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.
The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.
It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.
1. the current, unproven-in-court legal understanding,
2. standard disclaimer to cover OP's ass
3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.
IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]
Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...
There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.
As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.
For app developers considering tflite, a safer way would be to host the models on firebase and delete them when their job is done. It comes with other features like versioning for model updates, A/B tests, lower apk size etc.
https://firebase.google.com/docs/ml/manage-hosted-models
That wouldn't help against the technique explained in the article, would it? Since the model makes it way into the device, it can be intercepted in a similar fashion.
I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.
and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes
You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.
... Because if he did this with a model that's not open that's sure going to keep everyone happy and not result in lawsuit(s)...
The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)
“ Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner.”
That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.
I'm not even sure if event the first part is true. Has it been determined if AI models are intellectual property? Machine generated content may not be copyrightable. It isn't just the output of generative AI that falls under this, the models themselves are.
Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.
An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.
People possess models, I'm not sure if they own them.
There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.
Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.
No, copyright violation occurs at the first unauthorized copying or creation of a derivative work or exercise of any of the other exclusive rights of the copyright holder (that does not fall into an exception like that for fair use.) That distribution is required for a copyright violation is a persistent myth. Distribution is a means by which a violation becomes more likely to be detected and also more likely to involve significant liability for damages.
(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)
It's also worth noting that there is still no legal clarity on these issues, even if a license claims to provide specific permissions.
Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.
I doubt the models are copyrighted, arn’t works created by machine not eligible? Or you get into cases autogenerating and claiming ownership of all possible musical note combinations.
It’s hard to say because as far as I know this stuff hasn’t been definitively tested on any courts that I know of. Europe not America.
AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.
ipsum2|1 year ago
mentalgear|1 year ago
refulgentis|1 year ago
An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.
The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.
Thanks for the article OP, really fascinating.
rob_c|1 year ago
nthingtohide|1 year ago
E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.
Can something similar help in model encryption?
antman|1 year ago
biosboiii|1 year ago
So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.
They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.
And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.
umeshunni|1 year ago
vlovich123|1 year ago
garyfirestorm|1 year ago
biosboiii|1 year ago
My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.
This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".
Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.
janalsncm|1 year ago
Zambyte|1 year ago
avg_dev|1 year ago
(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")
frogsRnice|1 year ago
It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.
EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.
JTyQZSnP3cQGa8B|1 year ago
Is it ironic or missing a /s? I can't really tell here.
SunlitCat|1 year ago
If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.
Freak_NL|1 year ago
biosboiii|1 year ago
I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.
But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.
GuB-42|1 year ago
The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.
It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.
npteljes|1 year ago
1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
boothby|1 year ago
zitterbewegung|1 year ago
[1] https://github.com/meta-llama/llama/blob/main/LICENSE
Fragoel2|1 year ago
asciii|1 year ago
The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?
biosboiii|1 year ago
But the Objective C code is actually compiled, and decompilation is a lot harder than with the JVM languages on Android.
My next article will be about CoreML on iOS, doing the same exact thing :)
unknown|1 year ago
[deleted]
Ekaros|1 year ago
benreesman|1 year ago
As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.
bangaladore|1 year ago
Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.
This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.
amolgupta|1 year ago
hn8726|1 year ago
I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.
saagarjha|1 year ago
do_not_redeem|1 year ago
smitop|1 year ago
The file is only 7.7kb, so it couldn't contain many weights anyways.
koe123|1 year ago
VectorLock|1 year ago
kittikitti|1 year ago
Polizeiposaune|1 year ago
Workaccount2|1 year ago
And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".
tomjen3|1 year ago
yieldcrv|1 year ago
sharkest|1 year ago
[deleted]
deadbabe|1 year ago
philwelch|1 year ago
1vuio0pswjnm7|1 year ago
Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.
See, e.g., https://news.ycombinator.com/item?id=42617889
jonpo|1 year ago
Paper: https://arxiv.org/pdf/2204.03738
Code: https://github.com/microsoft/banknote-net Training data: https://raw.githubusercontent.com/microsoft/banknote-net/ref...
model: https://github.com/microsoft/banknote-net/blob/main/models/b...
Kinda easier to download it straight from github.
Its licenced under MIT and CDLA-Permissive-2.0 licenses.
But lets not let that get in the way of hating on AI shall we?
dang|1 year ago
Can you please edit this kind of thing out of your HN comments? (This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)
It leads to a downward spiral, as one can see in the progression to https://news.ycombinator.com/item?id=42604422 and https://news.ycombinator.com/item?id=42604728. That's what we're trying to avoid here.
Your post is informative and would be just fine without the last sentence (well, plus the snarky first two words).
DoctorOetker|1 year ago
biosboiii|1 year ago
I was rather interested in the process of instrumenting of TF to make this "attack" scalable to other apps.
rob_c|1 year ago
The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)
llama_drama|1 year ago
TechDebtDevin|1 year ago
[deleted]
cess11|1 year ago
[deleted]
powtain-gen1|1 year ago
https://web.powtain.com/pow/qao631
wat10000|1 year ago
That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.
Lerc|1 year ago
Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.
An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.
People possess models, I'm not sure if they own them.
There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.
jdietrich|1 year ago
https://www.law.cornell.edu/uscode/text/17/1201
dragonwriter|1 year ago
(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)
wslh|1 year ago
Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.
rileymat2|1 year ago
hnlmorg|1 year ago
AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.
larodi|1 year ago
greggawatt|1 year ago
[deleted]
dfefdfdd|1 year ago
[deleted]
23B1|1 year ago
Laundering IP. FTFY.
1vuio0pswjnm7|1 year ago
If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.
1vuio0pswjnm7|1 year ago
https://www.arxiv.org/pdf/2407.13493
unknown|1 year ago
[deleted]