Are these libraries for connecting to an ollama service that the user has already installed or do they work without the user installing anything? Sorry for not checking the code but maybe someone has the same question here.
I looked at using ollama when I started making FreeChat [0] but couldn't figure out a way to make it work without asking the user to install it first (think I asked in your discord at the time). I wanted FreeChat to be 1-click install from the mac app store so I ended up bundling the llama.cpp server instead which it runs on localhost for inference. At some point I'd love to swap it out for ollama and take advantage of all the cool model pulling stuff you guys have done, I just need it to be embeddable.
My ideal setup would be importing an ollama package in swift which would start the server if the user doesn't already have it running. I know this is just js and python to start but a dev can dream :)
On the subject of installing Ollama, I found it to be a frustrating and user-hostile experience. I instead recommend the much more user-friendly LLM[0] by Simon Willison.
* Ollama spawns at least four processes, some persistently in the background: 1 x Ollama application, 1 x `ollama` server component, 2 x Ollama Helper
* Ollama provides no information at install time about what directories will be created or where models will be downloaded.
* Ollama prompts users to install the `ollama` CLI tool, with admin access required, with no way to cancel, and with no way to even quit the application at that point. Ollama provides no clarity that about what is actually happening during this step: all it is doing is symlinking `/Applications/Ollama.app/Contents/Resources/ollama` to `/usr/local/bin/`
The worst part is that not only is none of this explained at install time, but the project README doesn’t tell you any of this information either. Potential users deserve to know what will happen on first launch, but when a PR arrived to at least provide that clarification in the README, Ollama maintainers summarily closed that PR and still have not rectified the aforementioned UX problems.
As an open source maintainer myself, I understand and appreciate that Ollama developers volunteer their time and energy into the project, and they can run it as they see fit. So I intend no disrespect. But these problems, and a seeming unwillingness to prioritize their resolution, caused me to delete Ollama from my system entirely.
As I said above, I think LLM[0] by Simon Willison is an excellent and user-friendly alternative.
I used Ollama docker image to integrate with Gait Analyzer[1], a self-hosted gait analysis tool; all I had to do was to set up the docker compose file,
I was able to get the setup done with a single script for the end user and I used langchain to interact with Ollama.
How does this fine-tuning work? I can see that you are loading a train.jsonl file and the some instructions but is the output model generated or this is some kind of a new way of training the models?
An off topic question: Is there such a thing as a "small-ish language model". A model that you could simple give instructions / "capabilities" which a user can interact with. Almost like Siri-level of intelligence.
Imagine you have an API-endpoint where you can set the level of some lights and you give the chat a system prompt explaining how to build the JSON body of the request, and the user can prompt it with stuff like "Turn off all the lights" or "Make it bright in the bedroom" etc.
How low could the memory consumption of such a model be? We don't need to store who the first kaiser of Germany was, "just" enough to kinda map human speech onto available API's.
There are "smaller" models, for example tinyllama 1.1B (tiny seems like an exaggeration). PHI2 is 2.7B parameters. I can't name a 500M parameter model but there is probably one.
The problem is they are all still broadly trained and so they end up being Jack of all trades master of none. You'd have to fine tune them if you want them good at some narrow task and other than code completion I don't know that anyone has done that.
If you want to generate json or other structured output, there is Outlines https://github.com/outlines-dev/outlines that constrains the output to match a regex so it guarantees e.g. the model will generate a valid API call, although it could still be nonsense if the model doesn't understand, it will just match the regex. There are other similar tools around. I believe llama.cpp also has something built in that will constrain the output to some grammar.
Not really. You can use small models for task like text classification etc (traditional nlp) and those run in pretty much anything. We're talking about BERT-like models like distillbert for example.
Now, models that have "reasoning" as an emergent property... I haven't seen anthing under 3B that's capable of making anything useful. The smaller I've seen is litellama and while it's not 100% useless, it's really just an experiment.
Also, everything requires new and/or expensive hardware. For GPU you really are about 1k€ at minumum for something decent for running models. CPU inference is way slower and forget about anythin that has no AVX and preferably AVX2.
I try models on my old thinkpad x260 with 8Gb ram, which is perfectly capable for developing stuff and those small task oriented I've told you about, but even though I've tried everything under the sun, with quantization etc, it's safe to say you can only run decent LLMs with a decent inference speed with expensive hardware now.
Now, if you want task like, language detection, classifying text into categories, etc, very basic Question Answering, then go on HugginFace and try youself, you'll be capable of running most models on modest hardware.
In fact, I have a website (https://github.com/iagovar/cometocoruna/tree/main) where I'm using a small flask server in my data pipeline to extract event information from text blobs I get scraping sites. That runs every day in an old Atom + 4Gb RAM laptop that I use as sever.
Experts in the field say that might change (somewhat) with mamba models, but I can't really say more.
I've been playing with the idea of dumping some money. But I'm 36, unemployed and just got into coding about 1.5 years ago, so until I secure some income I don't want to hit my saving hard, this is not the US where I can land a job easy (Junior looking for job, just in case someone here needs one).
Not directly related to what Ollama aims to achieve. But, I’ll ask nevertheless.
Local LLMs are great! But, it would be more useful once we can _easily_ throw our own data for them to use as reference or even as a source of truth. This is where it opens doors that a closed system like OpenAI cannot - I’m never going to upload some data to ChatGPT for them to train on.
Could Ollama make it easier and standardize the way to add documents to local LLMs?
I’m not talking about uploading one image or model and asking a question about it. I’m referring to pointing a repository of 1000 text files and asking LLMs questions based on their contents.
There's two main ways to "add documents to LLMs" - using documents in retrieval augmented generation (RAG) and training/finetuning models. I believe you can use RAG with Ollama, however Ollama doesn't do the training of models.
Used ollama as part of a bash pipeline for a tiny throwaway app.
It blocks until there is something on the mic, then sends the wav to whisper.cpp, which then sends it to llama which picks out a structured "remind me" object from it, which gets saved to a text file.
I made something pretty similar over winter break so I could have something read books to me. ... Then it turned into a prompting mechanism of course! It uses Whisper, Ollama, and TTS from CoquiAI. It's written in shell and should hopefully be "Posix-compliant", but it does use zenity from Ubuntu; not sure how widely used zenity is.
Noob question, and may be probably being asked at the wrong place.
Is there any way to find out min system requirements for running ollama run commands with different models.
On my 32G M2 Pro Mac, I can run up to about 30B models using 4 bit quantization. It is fast unless I am generating a lot of text. If I ask a 30B model to generate 5 pages of text it can take over 1 minute. Running smaller models like Mistral 7B is very fast.
Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.
EDIT: if you only have 8G RAM, try some of the 3B models. I suggest using at least 4 bit quantization.
You can easily experiment with smaller models, for example, Mistral 7B or Phi-2 on M1/M2/M3 processors. With more memory, you can run larger models, and better memory bandwidth (M2 Ultra vs. M2 base model) means improved performance (tokens/second).
They have a high level summary of ram requirements for the parameter size of each model and how much storage each model uses on their GitHub: https://github.com/ollama/ollama#model-library
I posted about my awesome experiences using Ollama a few months ago: https://news.ycombinator.com/item?id=37662915. Ollama is definitely the easiest way to run LLMs locally, and that means it’s the best building block for applications that need to use inference. It’s like how Docker made it so any application can execute something kinda portably kinda safely on any machine. With Ollama, any application can run LLM inference on any machine.
Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.
Also one feature request - if the library (or another related library) could also transparently spin up a local Ollama instance if the user doesn’t have one already. “Transparent-on-demand-Ollama” or something.
So cool! I have bene using Ollama for weeks now and I just love it! Easiest way to run local LLMs, we are actually embedding them into our product right now and super excited about it!
I used this half a year ago, love the UX but it was not possible to accelerate the workloads using an AMD GPU. How's the support for AMD GPUs under Ollama today?
What I hate about ollama is that it makes server configuration a PITA. ollama relies on llama.cpp to run GGUF models but while llama.cpp can keep the model in memory using `mlock` (helpful to reduce inference times), ollama simply won't let you do that:
I love Ollama's simplicity to download and consume different models with its REST API. I've never used it in a "production" environment, anyone knows how Ollama performs? or is it better to move to something like Vllm for that?
API wise, it looks very similar to the OpenAI python SDK but not quite the same. I was hoping I could swap out one client for another. Can anyone confirm they’re intentionally using an incompatible interface?
Same question here. Ollama is fantastic as it makes it very easy to run models locally, But if you already have a lot of code that processes OpenAI API responses (with retry, streaming, async, caching etc), it would be nice to be able to simply switch the API client to Ollama, without having to have a whole other branch of code that handles Ollama API responses. One way to do an easy switch is using the litellm library as a go-between but it’s not ideal.
For an OpenAI compatible API my current favorite method is to spin up models using oobabooga TGW. Your OpenAI API code then works seamlessly by simply switching out the api_base to the ooba endpoint. Regarding chat formatting, even ooba’s Mistral formatting has issues[1] so I am doing my own in Langroid using HuggingFace tokenizer.apply_chat_template [2]
I love ollama, the engine underneath is llama.cpp, and they have the first version of self-extend about to me merged into main, so with any luck it will be available in ollama soon too!
Is anyone using this as an api behind a multi user web application? Or does it need to be fed off of a message queue or something to basically keep it single threaded?
ollama feels like llama.cpp with extra undesired complexities. It feels like the former project is desperately trying to differentiate and monetize while the latter is where all the things that matter happens.
This looks really nice but it’s good to point out that this project can use the Ollama HTTP API or any other API, but does not run models itself. So not a replacement to Ollama, but rather to the Ollama npm. Perhaps that was obvious because the post is about that, but I briefly thought this could run models too.
The Rust+Wasm stack provides a strong alternative to Python in AI inference.
* Lightweight. Total runtime size is 30MB as opposed 4GB for Python and 350MB for Ollama.
* Fast. Full native speed on GPUs.
* Portable. Single cross-platform binary on different CPUs, GPUs and OSes.
* Secure. Sandboxed and isolated execution on untrusted devices.
* Modern languages for inference apps.
* Container-ready. Supported in Docker, containerd, Podman, and Kubernetes.
* OpenAI compatible. Seamlessly integrate into the OpenAI tooling ecosystem.
I wish JS libraries would stop using default exports. They are not ergonomic as soon as you want to export one more thing in your package, which includes types, so all but the most trivial package requires multiple exports.
Just use a sensibly named export, you were going to write a "how to use" code snippet for the top of your readme anyway.
Also means that all of the code snippets your users send you will be immediately sensible, even without them having to include their import statements (assuming they don't use "as" renaming, which only makes sense when there's conflicts anyway)
rgbrgb|2 years ago
I looked at using ollama when I started making FreeChat [0] but couldn't figure out a way to make it work without asking the user to install it first (think I asked in your discord at the time). I wanted FreeChat to be 1-click install from the mac app store so I ended up bundling the llama.cpp server instead which it runs on localhost for inference. At some point I'd love to swap it out for ollama and take advantage of all the cool model pulling stuff you guys have done, I just need it to be embeddable.
My ideal setup would be importing an ollama package in swift which would start the server if the user doesn't already have it running. I know this is just js and python to start but a dev can dream :)
Either way, congrats on the release!
[0]: https://github.com/psugihara/FreeChat
SnowLprd|2 years ago
Among the problems with Ollama include:
* Ollama silently adds a login item with no way to opt out: <https://github.com/jmorganca/ollama/issues/162>
* Ollama spawns at least four processes, some persistently in the background: 1 x Ollama application, 1 x `ollama` server component, 2 x Ollama Helper
* Ollama provides no information at install time about what directories will be created or where models will be downloaded.
* Ollama prompts users to install the `ollama` CLI tool, with admin access required, with no way to cancel, and with no way to even quit the application at that point. Ollama provides no clarity that about what is actually happening during this step: all it is doing is symlinking `/Applications/Ollama.app/Contents/Resources/ollama` to `/usr/local/bin/`
The worst part is that not only is none of this explained at install time, but the project README doesn’t tell you any of this information either. Potential users deserve to know what will happen on first launch, but when a PR arrived to at least provide that clarification in the README, Ollama maintainers summarily closed that PR and still have not rectified the aforementioned UX problems.
As an open source maintainer myself, I understand and appreciate that Ollama developers volunteer their time and energy into the project, and they can run it as they see fit. So I intend no disrespect. But these problems, and a seeming unwillingness to prioritize their resolution, caused me to delete Ollama from my system entirely.
As I said above, I think LLM[0] by Simon Willison is an excellent and user-friendly alternative.
[0]: https://llm.datasette.io/
icyfox|2 years ago
Abishek_Muthian|2 years ago
I was able to get the setup done with a single script for the end user and I used langchain to interact with Ollama.
[1] https://github.com/abishekmuthian/gaitanalyzer
ivanfioravanti|2 years ago
I create a GIST with a quick and dirty way of generating a dataset for fine-tuning Mistral model using Instruction Format on a given topic: https://gist.github.com/ivanfioravanti/bcacc48ef68b02e9b7a40...
jumperabg|2 years ago
eurekin|2 years ago
tinyhouse|2 years ago
pknerd|2 years ago
LoganDark|2 years ago
filleokus|2 years ago
Imagine you have an API-endpoint where you can set the level of some lights and you give the chat a system prompt explaining how to build the JSON body of the request, and the user can prompt it with stuff like "Turn off all the lights" or "Make it bright in the bedroom" etc.
How low could the memory consumption of such a model be? We don't need to store who the first kaiser of Germany was, "just" enough to kinda map human speech onto available API's.
andy99|2 years ago
The problem is they are all still broadly trained and so they end up being Jack of all trades master of none. You'd have to fine tune them if you want them good at some narrow task and other than code completion I don't know that anyone has done that.
If you want to generate json or other structured output, there is Outlines https://github.com/outlines-dev/outlines that constrains the output to match a regex so it guarantees e.g. the model will generate a valid API call, although it could still be nonsense if the model doesn't understand, it will just match the regex. There are other similar tools around. I believe llama.cpp also has something built in that will constrain the output to some grammar.
spaniard89277|2 years ago
Now, models that have "reasoning" as an emergent property... I haven't seen anthing under 3B that's capable of making anything useful. The smaller I've seen is litellama and while it's not 100% useless, it's really just an experiment.
Also, everything requires new and/or expensive hardware. For GPU you really are about 1k€ at minumum for something decent for running models. CPU inference is way slower and forget about anythin that has no AVX and preferably AVX2.
I try models on my old thinkpad x260 with 8Gb ram, which is perfectly capable for developing stuff and those small task oriented I've told you about, but even though I've tried everything under the sun, with quantization etc, it's safe to say you can only run decent LLMs with a decent inference speed with expensive hardware now.
Now, if you want task like, language detection, classifying text into categories, etc, very basic Question Answering, then go on HugginFace and try youself, you'll be capable of running most models on modest hardware.
In fact, I have a website (https://github.com/iagovar/cometocoruna/tree/main) where I'm using a small flask server in my data pipeline to extract event information from text blobs I get scraping sites. That runs every day in an old Atom + 4Gb RAM laptop that I use as sever.
Experts in the field say that might change (somewhat) with mamba models, but I can't really say more.
I've been playing with the idea of dumping some money. But I'm 36, unemployed and just got into coding about 1.5 years ago, so until I secure some income I don't want to hit my saving hard, this is not the US where I can land a job easy (Junior looking for job, just in case someone here needs one).
oblio|2 years ago
3abiton|2 years ago
reacharavindh|2 years ago
Local LLMs are great! But, it would be more useful once we can _easily_ throw our own data for them to use as reference or even as a source of truth. This is where it opens doors that a closed system like OpenAI cannot - I’m never going to upload some data to ChatGPT for them to train on.
Could Ollama make it easier and standardize the way to add documents to local LLMs?
I’m not talking about uploading one image or model and asking a question about it. I’m referring to pointing a repository of 1000 text files and asking LLMs questions based on their contents.
jerpint|2 years ago
I’ve implemented a RAG library if you’re ever interested but they are a dime a dozen now :)
https://www.github.com/jerpint/buster
jampekka|2 years ago
emmanueloga_|2 years ago
reacharavindh|2 years ago
asterix_pano|2 years ago
BeetleB|2 years ago
If you use the API, they do not train on it.
(However, that doesn't mean they don't retain it for a while).
As others have said, RAG is probably the way to go - although I don't know how well RAG performs on local LLMs.
sciolist|2 years ago
CubsFan1060|2 years ago
It's meant to do exactly what you want. I've had mixed results.
porridgeraisin|2 years ago
It blocks until there is something on the mic, then sends the wav to whisper.cpp, which then sends it to llama which picks out a structured "remind me" object from it, which gets saved to a text file.
awayto|2 years ago
https://github.com/jcmccormick/runtts
killermouse0|2 years ago
nbbaier|2 years ago
deepsquirrelnet|2 years ago
I’ll give this Python library a try. I’ve been wanting to try some fine tuning with LLMs in the loop experiments.
palashkulsh|2 years ago
mike978|2 years ago
mark_l_watson|2 years ago
Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.
EDIT: if you only have 8G RAM, try some of the 3B models. I suggest using at least 4 bit quantization.
hellsten|2 years ago
You can easily experiment with smaller models, for example, Mistral 7B or Phi-2 on M1/M2/M3 processors. With more memory, you can run larger models, and better memory bandwidth (M2 Ultra vs. M2 base model) means improved performance (tokens/second).
slawr1805|2 years ago
nextlevelwizard|2 years ago
I have not ran into a llama that won't run, but if it doesn't fit into my GPU you have to count seconds per token instead of tokens per second
wazoox|2 years ago
unknown|2 years ago
[deleted]
palashkulsh|2 years ago
explorigin|2 years ago
sqs|2 years ago
Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.
keyle|2 years ago
All it took for me to get going is `make` and I basically have it working locally as a console app.
acd10j|2 years ago
sqs|2 years ago
donpdonp|2 years ago
refulgentis|2 years ago
Nitro outstripped them, 3 MB executable with OpenAI HTTP server and persistent model load
joaomdmoura|2 years ago
visarga|2 years ago
nbbaier|2 years ago
Kostic|2 years ago
mchiang|2 years ago
If you do build from source, it should work (Instructions below):
https://github.com/ollama/ollama/blob/main/docs/development....
The reason why it's not in released builds is because we are still testing ROCm.
accelbred|2 years ago
brucethemoose2|2 years ago
You can be a linux/python dev and set up rocm.
Or you can run llama.cpp's very slow OpenCL backend, but with easy setup.
Or you can run MLC's very fast Vulkan backend, but with no model splitting and medium-hard setup.
unknown|2 years ago
[deleted]
jquaint|2 years ago
imrehg|2 years ago
There a bunch of methods need to be implemented to work, but then usual OpenAI buts can be switched out to anything else, e.g. see the code stub in https://vanna.ai/docs/bigquery-other-llm-vannadb.html
Looking forward to more remixes for other tools too.
hatmanstack|2 years ago
behnamoh|2 years ago
https://github.com/ollama/ollama/issues/1536
Not to mention, they hide all the server configs in favor of their own "sane defaults".
jmorgan|2 years ago
You can enable mlock manually in the /api/generate and /api/chat endpoints by specifying the "use_mlock" option:
{“options”: {“use_mlock”: true}}
Many other sever configurations are also available there: https://github.com/ollama/ollama/blob/main/docs/api.md#reque...
mfalcon|2 years ago
hellsten|2 years ago
Try to, for example, set 'num_gpu' to 99 and 'use_mlock' to true.
jerpint|2 years ago
techn00|2 years ago
visarga|2 years ago
pamelafox|2 years ago
WiSaGaN|2 years ago
[1] https://github.com/ollama/ollama/issues/305
d4rkp4ttern|2 years ago
For an OpenAI compatible API my current favorite method is to spin up models using oobabooga TGW. Your OpenAI API code then works seamlessly by simply switching out the api_base to the ooba endpoint. Regarding chat formatting, even ooba’s Mistral formatting has issues[1] so I am doing my own in Langroid using HuggingFace tokenizer.apply_chat_template [2]
[1] https://github.com/oobabooga/text-generation-webui/issues/53...
[2] https://github.com/langroid/langroid/blob/main/langroid/lang...
Related question - I assume ollama auto detects and applies the right chat formatting template for a model?
lhenault|2 years ago
WhackyIdeas|2 years ago
malux85|2 years ago
brucethemoose2|2 years ago
Also, you really want to wait until flash attention is merged before using mega context with llama.cpp. The 8 bit KV cache would be ideal too.
dchuk|2 years ago
Havoc|2 years ago
mchiang|2 years ago
https://github.com/ollama/ollama/blob/main/docs/import.md
awongh|2 years ago
brucethemoose2|2 years ago
cranberryturkey|2 years ago
lobocinza|2 years ago
sjwhevvvvvsj|2 years ago
bearjaws|2 years ago
It is far more robust, integrates with any LLM local or hosted, supports multi-modal, retries, structure parsing using zod and more.
kvz|2 years ago
nextlevelwizard|2 years ago
Ollama already exposes REST API that you can query with whatever language (or you know, just using curl) - why do I want to use Python or JS?
JrProgrammer|2 years ago
girvo|2 years ago
leansensei|2 years ago
unknown|2 years ago
[deleted]
3Sophons|2 years ago
* Lightweight. Total runtime size is 30MB as opposed 4GB for Python and 350MB for Ollama. * Fast. Full native speed on GPUs. * Portable. Single cross-platform binary on different CPUs, GPUs and OSes. * Secure. Sandboxed and isolated execution on untrusted devices. * Modern languages for inference apps. * Container-ready. Supported in Docker, containerd, Podman, and Kubernetes. * OpenAI compatible. Seamlessly integrate into the OpenAI tooling ecosystem.
Give it a try --- https://www.secondstate.io/articles/wasm-runtime-agi/
anhldbk|2 years ago
For ollama, llama2:7b is 3.8 GB. See: https://ollama.ai/library/llama2/tags. Still I see ollama requires less RAM to run llama 2
fillskills|2 years ago
jdlyga|2 years ago
gregorymichael|2 years ago
[deleted]
rezonant|2 years ago
Just use a sensibly named export, you were going to write a "how to use" code snippet for the top of your readme anyway.
Also means that all of the code snippets your users send you will be immediately sensible, even without them having to include their import statements (assuming they don't use "as" renaming, which only makes sense when there's conflicts anyway)
unknown|2 years ago
[deleted]
maswewe|2 years ago
[deleted]
unknown|2 years ago
[deleted]