GGML – AI at the Edge

[+] samwillis|2 years ago|reply

ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant. We should be concentrating as much as possible to do local inference (and training) based on privet data.

I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

[+] SparkyMcUnicorn|2 years ago|reply

Maybe I'm wrong, but I don't think you want it fine-tuned on your data.

Pretty sure you might be looking for this: https://github.com/SamurAIGPT/privateGPT

Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

[+] rvz|2 years ago|reply

> ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant

The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

> I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

I think you are setting yourself up for disappointment in the future.

[+] ignoramous|2 years ago|reply

Can LLaMA be used for commerical purposes though (might limit external contributors)? I believe, FOSS alternatives like DataBricks Dolly / Together RedPajama / Eluether GPT NeoX (et al) is where the most progress is likely to be at.

[+] shostack|2 years ago|reply

I've been trying to figure out what I might need to do in order to turn my Obsidian vault into a dataset to fine tune against. I'd invest a lot more into it now if I thought it would be a key to an AI learning about my the way it does in the movie Her.

[+] brucethemoose2|2 years ago|reply

If MeZO gets implemented, we are basically there: https://github.com/princeton-nlp/MeZO

[+] behnamoh|2 years ago|reply

I wonder if ClosedAI and other companies use the findings of the open source community in their products. For example, do they use QLORA to reduce the costs of training and inference? Do they quantize their models to serve non-subscribing consumers?

[+] yukIttEft|2 years ago|reply

Its graph execution is still full of busyloops, e.g.:

https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9...

I wonder how much more efficient it would be when Taskflow lib was used instead, or even inteltbb.

[+] moffkalast|2 years ago|reply

Someone ought to be along with a PR eventually.

[+] mhh__|2 years ago|reply

It's not a very good library IMO.

[+] boywitharupee|2 years ago|reply

is graph execution used for training only or inference also?

[+] make3|2 years ago|reply

does tbb work with apple Silicon?

[+] graycat|2 years ago|reply

WOW! They are using BFGS! Haven't heard of that in decades! Had to think a little: Yup, the full name is Broyden–Fletcher–Goldfarb–Shanno for iterative unconstrained non-linear optimization!

Some of the earlier descriptions of the optimization being used in the AI learning was about steepest descent, that is, just find the gradient of the function are trying to minimize and move some distance in that direction. Just using the gradient was concerning since that method tends to zig zag where after, say, 100 iterations the distance moved in the 100 iterations might be several times farther than the distance from the starting point of the iterations to the final one. Can visualize this zig zag already in just two dimensions, say, following a river, say, a river that curves, down a valley the river cut over a million years or so, that is, a valley with steep sides. Then gradient descent may keep crossing the river and go maybe 10 feet for each foot downstream!

Right, if just trying to go downhill on a tilted flat plane, then the gradient will point in the steepest descent on the plane and gradient descent will go all way downhill in just one iteration.

In even moderately challenging problems, BFGS can a big improvement.

[+] kretaceous|2 years ago|reply

Georgi's Twitter announcement: https://twitter.com/ggerganov/status/1666120568993730561

[+] jgrahamc|2 years ago|reply

Cool. I've just started sponsoring him on GitHub.

[+] TechBro8615|2 years ago|reply

I believe ggml is the basis of llama.cpp (the OP says it's "used by llama.cpp")? I don't know much about either, but when I read the llama.cpp code to see how it was created so quickly, I got the sense that the original project was ggml, given the amount of pasted code I saw. It seemed like quite an impressive library.

[+] kgwgk|2 years ago|reply

https://news.ycombinator.com/item?id=33877893

“OpenAI recently released a model for automatic speech recognition called Whisper. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++.”

That “minimalistic tensor library” was ggml.

[+] make3|2 years ago|reply

it's the library used for tensor operations inside of llama.cpp, yes

[+] noman-land|2 years ago|reply

Georgi if you're reading this, I've had a lot of fun with whisper.cpp llama.cpp because of you so thank you very much.

[+] hanselot|2 years ago|reply

I envy his drive and ambition. I can't force myself to finish writing a simple alarm clock app for android, never mind pathing the literal road to the future of Open Source AI.

Would someone else have taken his place had he not been around? Maybe, but I'm insanely happy that he is around.

The amount of hours I've sunk into LLM's is crazy, and it's mostly thanks to his work that I can both run and download models in meaningful timeframes.

And yes, I have tested llama.cpp on my android and it works 100% on termux. (Your biggest enemy here will be Android process reaper when you hit the memory cap)

[+] iamflimflam1|2 years ago|reply

I've always thought on the edge to be IoT type stuff. So running on embedded devices. But maybe that not the case?

[+] Y_Y|2 years ago|reply

Like any new term the (mis)usage broadens the meaning over time until it either it's widely known, it's unfashionable, or most likely; it becomes so broad as to be meaningless and hence it achieves buzzword apotheosis.

My old job title had "edge" in it, and I still don't know what it's supposed to mean, although "not cloud" is a good approximation.

[+] timerol|2 years ago|reply

"Edge computing" is a pretty vague term, and can encompass anything from a 8MHz ARM core that can barely talk compliant BLE, all the way to a multi-thousand dollar setup on something like a self-checkout machine, which may have more compute available than your average laptop. In that range are home assistants, which normally have some basic ML for wake word detection, and then send the next bit of audio to the cloud with a more advanced model for full speech-to-text (and response)

[+] anentropic|2 years ago|reply

Here it means more 'on your own device' rather than 'in the cloud'

You could consider that the real edge, whereas edge computing often means 'at the edge of the cloud' i.e. local CDN node

[+] java_beyb|2 years ago|reply

edge brings compute close to where data is generated, cloud brings data to compute.

even processing something in a web browser is called edge. i guess due to this impression the industry is moving towards "on-device"

[+] KronisLV|2 years ago|reply

Just today, I finished a blog post (also my latest submission, felt like could be useful to some) about how to get something like this working in a bundle of something to run models, as well as a web UI for more easy interaction - in my case that was koboldcpp, which can run GGML, both on the CPU (with OpenBLAS) and on the GPU (with CLBlast). Thanks to Hugging Face, getting Metharme, WizardLM or other models is also extremely easy, and the 4-bit quantized ones provide decent performance even on commodity hardware!

I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41 instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which costs about 25 EUR per month and still can generate decent responses in less than half a minute, my local machine needing approx. double that time. While not quite as good as one might expect (decent response times mean maxing out CPU for the single request, if you don't have a compatible GPU with enough VRAM), the technology is definitely at a point where it's possible for it to make people's lives easier in select use cases with some supervision (e.g. customer support).

What an interesting time to be alive, I wonder where we'll be in a decade.

[+] digitallyfree|2 years ago|reply

The fact that this is commodity hardware makes ggml extremely impressive and puts the tech in the hands of everyone. I recently reported my experience running 7B llama.cpp on a 15 year old Core 2 Quad [1] - when that machine came out it was a completely different world and I certainly never imagined how AI would look like today. This was around when the first iPhone was released and everyone began talking about how smartphones would become the next big thing. We saw what happened 15 years later...

Today with the new k-quants users are reporting that 30B models are working with 2-bit quantization on 16GB CPUs and GPUs [2]. That's enabling access to millions of consumers and the optimizations will only improve from there.

[1] https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf...

[2] https://github.com/ggerganov/llama.cpp/pull/1684, https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros...

[+] b33j0r|2 years ago|reply

I wish everyone in tech had your perspective. That is what I see, as well.

There is a lull right now between gpt4 and gpt5 (literally and metaphorically). Consumer models are plateauing around 40B for a barely-reasonable RTX 3090 (ggml made this possible).

Now is the time to launch your ideas, all!

[+] SparkyMcUnicorn|2 years ago|reply

Seems like serverless is the way to go for fast output while remaining inexpensive.

e.g.

https://replicate.com/stability-ai/stablelm-tuned-alpha-7b

https://github.com/runpod/serverless-workers/tree/main/worke...

https://modal.com/docs/guide/ex/falcon_gptq

[+] _20p0|2 years ago|reply

This guy is damned good. I sponsored him on Github because his software is dope. I also like how when some controversy erupted on the project he just ejected the controversial people and moved on. Good stewardship. Great code.

I recall something like when he first ported it and it worked on my M1 Max he hadn't even yet tested it on Apple Silicon since he didn't have the hardware.

Honestly, with this and whisper, I am a huge fan. Good luck to him and the new company.

[+] aryamaan|2 years ago|reply

Could someone at high level talk more about how one starts contributing to this kind of problems.

For the people who build solutions for data handling— ranging from crud to building highly scalable solutions— these things are alien concepts. (Or maybe I am just talking about it myself)

[+] world2vec|2 years ago|reply

Might be a silly question but is GGML a similar/competing library to George Hotz's tinygrad [0]?

[0] https://github.com/geohot/tinygrad

[+] xiphias2|2 years ago|reply

They are competing (although they are very different, tinygrad is full stack Python, ggml is focusing on a few very important models), but in my opinion George Hotz lost focus a bit by not working more on getting the low level optimizations perfect.

[+] qeternity|2 years ago|reply

No, GGML is a CPU optimized library and quantized weight format that is closely linked to his other project llama.cpp

[+] sva_|2 years ago|reply

Really impressive work and I've asked this before, but is it really a good thing to have basically the whole library in a single 16k line file?

[+] regularfry|2 years ago|reply

It makes syncing between llama.cpp, whisper.cpp, and ggml itself quite straightforward.

I think the lesson here is that this setup has enabled some very high-speed project evolution or, at least, not got in its way. If that is surprising and you were expecting downsides, a) why; and b) where did they go?

[+] baobabKoodaa|2 years ago|reply

I guess the "clean code" crowd would like to refactor this into hundreds of files that all call each other in an incomprehensible maze, plus pulling in 20GB of dependencies from the internet during install. Because that is the way™.

[+] CamperBob2|2 years ago|reply

Yes. Next question

[+] ankitg12|2 years ago|reply

Quite impressive, able to run a LLM on my local mac

    % ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "Let's talk about Machine Learning now"
    main: seed = 1686112244
    gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
    gpt2_model_load: n_vocab = 50257
    gpt2_model_load: n_ctx   = 1024
    gpt2_model_load: n_embd  = 768
    gpt2_model_load: n_head  = 12
    gpt2_model_load: n_layer = 12
    gpt2_model_load: ftype   = 1
    gpt2_model_load: qntvr   = 0
    gpt2_model_load: ggml tensor size = 224 bytes
    gpt2_model_load: ggml ctx size = 384.77 MB
    gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
    gpt2_model_load: model size  =   239.08 MB
    extract_tests_from_file : No test file found.
    test_gpt_tokenizer : 0 tests failed out of 0 tests.
    main: prompt: 'Let's talk about Machine Learning now'
    main: number of tokens in prompt = 7, first 8 tokens: 5756 338 1561 546 10850 18252 783

    Let's talk about Machine Learning now.

    The first step is to get a good understanding of what machine learning is. This is where things get messy. What do you think is the most difficult aspect of machine learning?

    Machine learning is the process of transforming data into an understanding of its contents and its operations. For example, in the following diagram, you can see that we use a machine learning approach to model an object.

    The object is a piece of a puzzle with many different components and some of the problems it solves will be difficult to solve for humans.

    What do you think of machine learning as?

    Machine learning is one of the most important, because it can help us understand how our data are structured. You can understand the structure of the data as the object is represented in its representation.

    What about data structures? How do you find out where a data structure or a structure is located in your data?

    In a lot of fields, you can think of structures as

    main: mem per token =  2008284 bytes
    main:     load time =   366.33 ms
    main:   sample time =    39.59 ms
    main:  predict time =  3448.31 ms / 16.74 ms per token
    main:    total time =  3894.15 ms

[+] boringuser2|2 years ago|reply

Looking at the source of this kind of underlines the difference between machine learning scientist types and actual computer scientists.

[+] baobabKoodaa|2 years ago|reply

Can you elaborate?

[+] danieljanes|2 years ago|reply

Does GGML support training on the edge? We're especially interested in training support for Android+iOS

[+] svantana|2 years ago|reply

Yes - look at the file tests/test-opt.c. Unfortunately there's almost no documentation about its training/autodiff.

[+] statusfailed|2 years ago|reply

What kind of applications do you see for training on mobile devices? Is anyone using this in industry?

[+] unknown|2 years ago|reply

[deleted]

[+] edfletcher_t137|2 years ago|reply

This is a bang-up idea, you absolutely love to see capital investment on this type of open, commodity-hardware-focused foundational technology. Rock on GGMLers & thank you!

[+] doxeddaily|2 years ago|reply

This scratches my itch for no dependencies.

[+] huevosabio|2 years ago|reply

Very exciting!

Now, we just need a post that benchmarks the different options (ggml, tvm, AItemplate, hippoml) and helps deciding which route to take.

[+] mliker|2 years ago|reply

congrats! I was just listening to your changelog interview from months ago in which you said you were going to move on from this after you brush up the code a bit, but it seems the momentum is too great. Glad to see you carrying this amazing project(s) forward!

[+] pawelduda|2 years ago|reply

I happen to have RPi 4B with HomeAssistant. Is this something I could set up on it and integrate with HA to control it with speech, or is it overkill?

236 comments