So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates.
Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
> But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.
And that problem has a name: dithering!
Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.
And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!
Interestingly, FAISS does exactly that before doing Product Quantization and it works very well (errors are much lower compared to no rotation). They call it “optimal PQ”. During training time, they iterate to find a good candidate and save the best one.
Perhaps not entirely coincidentally, FAISS is also maintained by FB.
I find the geometrical intuition of rotating a vector in high dimensional space to minimize its largest values (vector basis projections) beautiful.
I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?
Tangentially related to the idea of "apply a random rotation matrix" is one where you apply a random matrix to a set of points to preserve distances between them but transform them into a lower dimensional space. This is captured by the JL Lemma [1].
Actually, “apply a random matrix” is often the solution to a large dimensional space involving near neighbours.
The Johnson-Lindenstrauss lemma asserts that a multiplying by a random matrix (some conditions apply, but iirc rotation matrices satisfy them) keeps, in many senses, the distances between points even if the dimension drops very significantly (some conditions apply but usually satisfied by real world data)
This is, in fact, the theoretical underpinning of compressed sensing.
It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
Aside from the weirdness of calling "good old" something that was released 17 months ago :-D I mean, deep learning is evolving at crazy rhythm, but you just can't assume a good paper gets written in days.
That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.
(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)
I mean, it's no free lunch, you still need to expend significantly more compute for the QLoRA training compared to any usual PTQ method, be it SpinQuant or any other more conventional quantization approaches.
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
3B models are perfectly capable, I've had great luck with Phi 3.5.
> For example, they seem to not care about instructions to only write a response and no explanation
You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.
You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)
Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.
For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.
For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.
Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.
I've tried using 3B outside of production.
Asked it to be the character needed, like 30 words and use German. Instructions were consistently ignored, sometimes sentences devolved into Gibberish or English was mixed in halfway through.
Don't even want to know how lobotomized 1B is.
> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline
I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.
Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.
You can't expect a 1B model to perform as well as 7B or chatGPT, probably the best use case is speculative decoding or to use to fine tune for a specific use case.
Just tried asking Llama 3.2:3b to write a YAML file with Kubernetes Deployment definition. It spit the yaml out but along with a ton of explanations. But when I followed up with below it did what I want it to do.
>>> Remove the explanation parts and only leave yaml in place from above response.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 3
...
Alternatively this worked as well
>>> Write a YAML file with kubernetes deployment object in it. Response should only contain the yaml file, no explanations.
... ions.
```yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
replicas: 3
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
spec:
containers:
- name: example-container
image: nginx:latest
ports:
- containerPort: 80
```
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:
These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?
Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.
Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?
I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.
That's because it's easily calculable and also somewhat impossible to say in any meaningful sense.
Most weights are released as fp16/bf16 so 2 bytes per weight. So just double the number of parameters = the number of gigabytes of VRAM. Llama 3.1 8B ~= 16GB weights in fp16. At 4bit quantization, it would be half the number of parameters so Llama 3.1 8B ~= 4GB weights.
But this is just weights. The real issue is context and output length: how much data are you feeding in? This is where VRAM can explode, and it's entirely use-case dependent. So for a 128k context model, the range of VRAM usage is huge.
The reality is, if you're not able to quickly estimate the above, you're probably not running local models anyway.
Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
The last table shows memory usage and performance on an Android phone.
> Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B.
for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.
ExecuTorch is a runtime for mobile and embedded devices to run PyTorch models directly. Currently it runs pretty fast on CPU, but expanding our use-case for mobile accelerators and GPUs.
We're still in our early stages (just turned beta status). But try it out and let us know.
Regarding Llama Stack, it is built by my colleagues. What were some concrete issues have you experienced? If you have error/bug reports, I'll happy to pass along.
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
In addition to the other answers in this thread, there's a practical one: sometimes (ok, often) you want to run a model on a card that doesn't have enough VRAM for it. Quantisation is a way to squeeze it down so it fits. For instance I've got a 4090 that won't fit the original Llama3 70b at 16 bits per param, but it will give me usable token rates at 2 bits.
It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.
During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.
MLC Chat is a great iPhone app for running models (it's on Android too) and currently ships with Llama 3.2 3B Instruct - not the version Meta released today, its a quantized version of their previous release.
I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.
This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)
I'm on Android, however my somewhat elaborate solution was to install Ollama on my home laptop computer and then ssh in when I want to query a model. I figured that'd be better for my phone battery. Since my home computer is behind NAT I run yggdrasil on everything so I can access my AI on the go.
These undergo additional fine tuning (QLoRA) using some or all of the original dataset, so they're able to get the weights to align to the nf4 dtype better, which increases the accuracy.
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).
I wouldn't be so haughty and presumptive of your understanding of things is as they are: this doesn't have practical applications.
No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)
It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.
(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841
I don’t get the comment. For one I’m excited for developments in the field. Not afraid it will “replace me” as technology has replaced me multiple times over. I’m looking towards working with these models more and more.
What kind of fundamental discussion are you hoping to see under an article about an iterative improvement to a known model?
"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.
I mean, this outcome of LLMs is expected and the frequency of LLM drops are too fast, and definitely too fast to wait for Meta to do an annual conference with a ton of hype, and furthermore these things are just prerequisites for a massive lemming rush of altering these models for the real fun, which occurs in other communities
tveita|1 year ago
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf
derefr|1 year ago
Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.
And that problem has a name: dithering!
Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.
And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!
ninja3925|1 year ago
Perhaps not entirely coincidentally, FAISS is also maintained by FB.
https://faiss.ai/cpp_api/struct/structfaiss_1_1OPQMatrix.htm...
arijo|1 year ago
I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?
govg|1 year ago
[1] - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...
beagle3|1 year ago
The Johnson-Lindenstrauss lemma asserts that a multiplying by a random matrix (some conditions apply, but iirc rotation matrices satisfy them) keeps, in many senses, the distances between points even if the dimension drops very significantly (some conditions apply but usually satisfied by real world data)
This is, in fact, the theoretical underpinning of compressed sensing.
jjssmith|1 year ago
tl;dr: round((2*R)*x) is not a great idea for an R-bit quantization.
[1] https://arxiv.org/abs/2410.13780
nisten|1 year ago
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
spi|1 year ago
That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.
(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)
Aeolun|1 year ago
formalsystem|1 year ago
lambda-research|1 year ago
Definitely nice to see them not cherrypick results - makes them more believable that its not the best along all axes.
ipsum2|1 year ago
miven|1 year ago
theanonymousone|1 year ago
com2kid|1 year ago
> For example, they seem to not care about instructions to only write a response and no explanation
You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.
You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)
Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.
I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.
wswope|1 year ago
For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.
beoberha|1 year ago
And Claude did everything perfectly ;)
scriptsmith|1 year ago
Works as expected if you provide a few system prompts with context.
accrual|1 year ago
itsTyrion|1 year ago
JohnHammersley|1 year ago
I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.
Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.
ipsum2|1 year ago
nikolayasdf123|1 year ago
blinkingled|1 year ago
>>> Remove the explanation parts and only leave yaml in place from above response. apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 3 ...
Alternatively this worked as well >>> Write a YAML file with kubernetes deployment object in it. Response should only contain the yaml file, no explanations. ... ions. ```yml apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: replicas: 3 selector: matchLabels: app: example-app template: metadata: labels: app: example-app spec: containers: - name: example-container image: nginx:latest ports: - containerPort: 80 ```
bloomingkales|1 year ago
formalsystem|1 year ago
philipkglass|1 year ago
https://huggingface.co/docs/hub/en/gguf#quantization-types
It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.
Evidlo|1 year ago
In vanilla Pytorch I have the following expression:
If 'inds' is int8, I get "IndexError: tensors used as indices must be long, int, byte or bool tensors".Is this still true if I use torchao?
saagarjha|1 year ago
philipkglass|1 year ago
[1] https://ollama.com/library/llama3.2/tags
yuvalr1|1 year ago
Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?
I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.
tugdual|1 year ago
https://github.com/TugdualKerjan/bunny/tree/main
niutech|1 year ago
antonvs|1 year ago
https://github.com/a-ghorbani/pocketpal-ai
unknown|1 year ago
[deleted]
cmsj|1 year ago
qeternity|1 year ago
Most weights are released as fp16/bf16 so 2 bytes per weight. So just double the number of parameters = the number of gigabytes of VRAM. Llama 3.1 8B ~= 16GB weights in fp16. At 4bit quantization, it would be half the number of parameters so Llama 3.1 8B ~= 4GB weights.
But this is just weights. The real issue is context and output length: how much data are you feeding in? This is where VRAM can explode, and it's entirely use-case dependent. So for a 128k context model, the range of VRAM usage is huge.
The reality is, if you're not able to quickly estimate the above, you're probably not running local models anyway.
ed|1 year ago
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
tucnak|1 year ago
You should customise your sampler to mandate JSON grammar after ```json tokens.
Evidlo|1 year ago
That and average inference times on common hardware is what I'm curious about.
Ardren|1 year ago
> Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B.
itsTyrion|1 year ago
nikolayasdf123|1 year ago
for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.
is ExecuTorch any better?
SoLoMo123|1 year ago
ExecuTorch is a runtime for mobile and embedded devices to run PyTorch models directly. Currently it runs pretty fast on CPU, but expanding our use-case for mobile accelerators and GPUs.
We're still in our early stages (just turned beta status). But try it out and let us know.
Regarding Llama Stack, it is built by my colleagues. What were some concrete issues have you experienced? If you have error/bug reports, I'll happy to pass along.
Tepix|1 year ago
> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B
No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!
justanotheratom|1 year ago
behnamoh|1 year ago
regularfry|1 year ago
formalsystem|1 year ago
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
o11c|1 year ago
xcodevn|1 year ago
EliBullockPapa|1 year ago
simonw|1 year ago
I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.
https://apps.apple.com/us/app/mlc-chat/id6448482937
Arcuru|1 year ago
[1] - https://github.com/arcuru/chaz
drilbo|1 year ago
This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)
evbogue|1 year ago
behnamoh|1 year ago
unknown|1 year ago
[deleted]
arnaudsm|1 year ago
tcdent|1 year ago
ngamboa|1 year ago
[deleted]
newfocogi|1 year ago
newfocogi|1 year ago
grahamj|1 year ago
They were already pretty small but I guess the smaller the better as long as accuracy doesn't suffer too much.
mmaunder|1 year ago
accrual|1 year ago
refulgentis|1 year ago
No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)
It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.
(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841
pryelluw|1 year ago
keyle|1 year ago
lxgr|1 year ago
"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]
flawn|1 year ago
yieldcrv|1 year ago