Show HN: I built a free in-browser Llama 3 chatbot powered by WebGPU
547 points| abi | 1 year ago |github.com
I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.
I’ve been using it as a therapist and coach. And it’s wonderful knowing that my personal information never leaves my local computer.
Should work on Desktop with Chrome or Edge. Other browsers are adding WebGPU support as well - see the Github for details on how you can get it to work on other browsers.
Note: after you send the first message, the model will be downloaded to your browser cache. That can take a while depending on the model and your internet connection. But on subsequent page loads, the model should be loaded from the IndexedDB cache so it should be much faster.
The project is open source (Apache 2.0) on Github. If you like it, I’d love contributions, particularly around making the first load faster.
Github: https://github.com/abi/secret-llama Demo: https://secretllama.com
NikhilVerma|1 year ago
I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.
abi|1 year ago
dosinga|1 year ago
What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!
PhilippGille|1 year ago
andai|1 year ago
Phi 3 is great though.
moffkalast|1 year ago
ComputerGuru|1 year ago
_akhe|1 year ago
Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.
flawsofar|1 year ago
KennyBlanken|1 year ago
When someone pointed out that Chrome has had this functionality for years and the world has not imploded...and has enabled many open source projects and web-based microcontroller IDEs to provide enormous user convenience...the response was a condescending sneer along the lines of "well we actually care about user privacy."
(If Chrome is such a user privacy dumpsterfire, why not implement WebSerial so that people don't have to run Chrome in order to communicate with and program microcontrollers?)
Given they claimed that people's pacemakers and blood glucose monitors would be tampered with if WebSerial were implemented, I'd be shocked if they allowed such low level access to a GPU...
abi|1 year ago
Cheer2171|1 year ago
low_tech_punk|1 year ago
abi|1 year ago
Just wrappers all the way down
joshstrange|1 year ago
Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.
[0] https://cs.joshstrange.com/07kPLPPW
[1] https://cs.joshstrange.com/4sxvt1Mc
EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.
brianzelip|1 year ago
abi|1 year ago
Chat history shouldn’t be hard to add with local storage and Indexed DB.
manlobster|1 year ago
doctorpangloss|1 year ago
So what games are in this LLM? Can it do solitaire yet?
swores|1 year ago
For anyone not old enough to remember, here's an example on YouTube (and a faster loading time than I remember often being the case!): https://youtube.com/watch?v=ra0EG9lbP7Y
zozbot234|1 year ago
threatofrain|1 year ago
KeplerBoy|1 year ago
Different webapps can't share common dependencies stored in localstorage afaik.
knowaveragejoe|1 year ago
abi|1 year ago
Approx sizes are listed in the GitHub README.
Models are stored in indexeddb and will be managed by the browser. Might get evicted.
sp332|1 year ago
unknown|1 year ago
[deleted]
manlobster|1 year ago
[0] https://webllm.mlc.ai/.
BoorishBears|1 year ago
wg0|1 year ago
Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.
intended|1 year ago
If I have the presence of mind, I offload the work here. At the same time I have a strong understanding of how coaching works, as does my brain.
I suspect that with all things LLM, some amount of proficiency is needed to truly get the prompts to work.
The simplest option is to ask it to be a coach for you. This is going to be hit and miss.
The better version is to specify the kind of coaching you want, or provide a rough outline of the issues on your mind and then ask for what kind of coach or therapist would make sense.
I use either of these for example - 1) over designed - https://chat.openai.com/g/g-KD6jm0l4c-thought-council 2) base ver - https://chat.openai.com/g/g-Cdq3drl87-two-guides
Sadly OpenAI doesnt let you share active chats anymore, so it’s going to need a plus subscription.
pseudosavant|1 year ago
cal85|1 year ago
kushie|1 year ago
nojvek|1 year ago
This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.
LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.
My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.
zitterbewegung|1 year ago
Even llama.cpp is performant already on macOS.
mcculley|1 year ago
It will be a huge step forward for humanity when we can run algorithms across all human knowledge. We are far from that.
maxboone|1 year ago
andrewfromx|1 year ago
abi|1 year ago
mentos|1 year ago
This has sparked a curiosity in me to play with more LLms locally, thank you!
bastawhiz|1 year ago
navigate8310|1 year ago
moffkalast|1 year ago
New models just keep rolling in day after day on r/locallama, tunes for this or that, new prompt formats, new quantization types, people doing all kinds of tests and analyses, new arxiv papers on some breakthrough and llama.cpp implementing it 3 days later. Every few weeks a new base model drops from somebody. So many things to try that nobody has tried before. It's genuinely like crack.
NayamAmarshe|1 year ago
raylad|1 year ago
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...
Also on Mistral 7B again after supposedly full download:
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...
Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.
abi|1 year ago
r0fl|1 year ago
MayeulC|1 year ago
On Linux, I had to go to chrome://flags/#skia-graphite and chrome://flags/#enable-vulkan and chrome://flags/#enable-unsafe-webgpu
I think only one of the first is actually required, but I enabled both. That allowed me to make use of TinyLlama with my AMD GPU (R9 Fury, OSS drivers), but I think I'd need Chromium Canary to enable "shader-f16" and use the other models, as I was not able to make it work on regular Chromium.
I haven't tried with Firefox.
pjmlp|1 year ago
Lex-2008|1 year ago
> Should work on Desktop with Chrome or Edge.
littlestymaar|1 year ago
spmurrayzzz|1 year ago
Was there something specifically about the install that didn't feel right? I ask because ollama is just a thin go wrapper around llama.cpp (its actually starting a modified version of the llama.cpp server in the background, not even going through the go ffi, likely for perf reasons). In that that sense, you could just install the CUDA toolkit via your package manager and calling `make LLAMA_CUDA=1; ./server` from the llama.cpp repo root to get effectively the same thing in two simple steps with no extra overhead.
notarealllama|1 year ago
It really is too bad WebGPU isn't supported on Linux, I mean, that's a no-brainer right there.
Jedd|1 year ago
I'm wary of any 'web' prefix on what could / should otherwise be desktop applications, mostly due to doubts about browser security.
geor9e|1 year ago
abi|1 year ago
hpeter|1 year ago
I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.
Or at least make it an extension
simple10|1 year ago
Do you think it would be feasible to extend it to support web browsing?
I'd like to help if you could give some pointers on how to extend it.
When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.
[EDIT] The Llama 3 model was able to summarize web pages!
simple10|1 year ago
indit|1 year ago
Its_Padar|1 year ago
koolala|1 year ago
eyegor|1 year ago
Apparently nightly is supposed to support it: https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Exp...
1f60c|1 year ago
Snoozus|1 year ago
lukan|1 year ago
So maybe the author forgot to include the high performance request, or your OS does not give the high performance GPU by default (as it might be in eco mode). This behavior can be changed in OS settings.
lastdong|1 year ago
1f60c|1 year ago
spacebanana7|1 year ago
https://github.com/abi/secret-llama?tab=readme-ov-file#syste...
pjmlp|1 year ago
gitinit|1 year ago
zerop|1 year ago
adontz|1 year ago
valine|1 year ago
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]
Dowwie|1 year ago
Y_Y|1 year ago
ngshiheng|1 year ago
On a side note, i've been trying to do something similar too for similar reasons (privacy).
Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardware
But yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers
Jackson_Fleck|1 year ago
Jackson_Fleck|1 year ago
Bradd3rs|1 year ago
unknown|1 year ago
[deleted]
BargirPezza|1 year ago
It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"
I write: "The song names should only be two to three words. Can you read what we have written before?"
Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.
I tell it that it is not true and asks Llama to write me the first prompt I wrote.
Llama answers: "You asked me to help you with a math problem."
Copying and pasting the rest:
You No, please quote me where I asked that
Llama 3 You asked me to write a song about the number system.
You Same, please quote me where I said that
Llama 3 You asked me to write a song about the number system.
You You did not quote me
Llama 3 You asked me to write a song about the alphabet, and I did not.
You Okay.. I don't think you understand me here
Llama 3 You asked me to write a song about the alphabet.
You Can you read what we have written?
Llama 3 You asked me to help you with a math problem.
You Yes or no?
Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.
You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!
Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?
Hock88sdx|1 year ago
[deleted]
Proven|1 year ago
[deleted]
dangilostmywang|1 year ago
[deleted]
Nishant6572|1 year ago
[deleted]
Nishant6572|1 year ago
[deleted]