Show HN: I built a free in-browser Llama 3 chatbot powered by WebGPU

NikhilVerma|1 year ago

This is absolutely wonderful, I am a HUGE fan of local first apps. Running models locally is such a powerful thing I wish more companies could leverage it to build smarter apps which can run offline.

I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.

abi|1 year ago

Appreciate the kind words :)

dosinga|1 year ago

Phi 1.5 isn't quite there is it? Caught it having some inner dialog.

What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!

PhilippGille|1 year ago

I've seen this with Phi 1.5B a lot. Try TinyLlama for a 1.1B model or Gemma for 2B. The latter is not available on OP's website yet but maybe in the future.

andai|1 year ago

I wasn't able to get much use from Phi 1.5 (except for leaking Microsoft's proprietary training data).

Phi 3 is great though.

moffkalast|1 year ago

That boy needs therapy.

ComputerGuru|1 year ago

That seems almost like the instruction template is wrong.

_akhe|1 year ago

Amazing work, feels like a step forward for LLM usability.

Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.

flawsofar|1 year ago

They should just be ubiquitous OS daemons at this point. They’re clearly very valuable

KennyBlanken|1 year ago

Mozilla won't even allow WebSerial to be implemented because it was deemed "too dangerous" - with all sorts of absurd whinging about the devastation that could be unleashed by unsuspecting users allowing a malicious site to access USB serial devices.

When someone pointed out that Chrome has had this functionality for years and the world has not imploded...and has enabled many open source projects and web-based microcontroller IDEs to provide enormous user convenience...the response was a condescending sneer along the lines of "well we actually care about user privacy."

(If Chrome is such a user privacy dumpsterfire, why not implement WebSerial so that people don't have to run Chrome in order to communicate with and program microcontrollers?)

Given they claimed that people's pacemakers and blood glucose monitors would be tampered with if WebSerial were implemented, I'd be shocked if they allowed such low level access to a GPU...

abi|1 year ago

Window AI (https://windowai.io/) is an attempt to do something like this with a browser extension.

Cheer2171|1 year ago

Sounds like ollama with open webui

low_tech_punk|1 year ago

It's a wrapper of https://github.com/mlc-ai/web-llm

abi|1 year ago

Yes. Web-llm is a wrapper of tvmjs: https://github.com/apache/tvm

Just wrappers all the way down

joshstrange|1 year ago

Very cool! I wish there was chat history.

Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.

[0] https://cs.joshstrange.com/07kPLPPW

[1] https://cs.joshstrange.com/4sxvt1Mc

EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.

brianzelip|1 year ago

Nice personal hosted image service!

abi|1 year ago

Thanks for the bug report. Yeah, it’s a bug with not resetting the state properly when new chat is clicked. Will fix tomorrow.

Chat history shouldn’t be hard to add with local storage and Indexed DB.

manlobster|1 year ago

It's truly amazing how quickly my browser loads 0.6GB of data. I remember when downloading a 1MB file involved phoning up a sysop in advance and leaving the modem on all night. We've come so far.

doctorpangloss|1 year ago

97MB for the Worms 3 demo felt like an eternity.

So what games are in this LLM? Can it do solitaire yet?

swores|1 year ago

When I think about numbers like that it just seems (to me, and wrongly) like general progress that's not so crazy - the thought that really makes the speed of progress stand out to me is remembering when loading a single image - photo sized but not crazily high resolution - over dial-up was slow enough that you'd gradually see the image loading from top to bottom, and could see it gradually getting taller as more lines of pixels were downloaded and shown below the already loaded part. Contrasting that memory against the ability to now watch videos with much higher resolution per frame than those images were 30 years ago is what really makes me go "wow".

For anyone not old enough to remember, here's an example on YouTube (and a faster loading time than I remember often being the case!): https://youtube.com/watch?v=ra0EG9lbP7Y

zozbot234|1 year ago

You could more or less fit the full model on a single CD (or a DVD for the larger model sizes) but of course forget about trying to do inference for it on period hardware, it would be unusably slow.

threatofrain|1 year ago

IMO eventually users should be able to advertise what embedding models they have so we don't redundantly redownload.

KeplerBoy|1 year ago

That's not possible with current web tech, is it?

Different webapps can't share common dependencies stored in localstorage afaik.

knowaveragejoe|1 year ago

Is this downloading a ~5gb model to my machine and storing it locally for subsequent use?

abi|1 year ago

Yes, it only starts the download after you send the first message so visiting the site won’t use up any space.

Approx sizes are listed in the GitHub README.

Models are stored in indexeddb and will be managed by the browser. Might get evicted.

sp332|1 year ago

Models sizes are listed here https://github.com/abi/secret-llama but yeah, > 4GB for the Llama 3 model.

unknown|1 year ago

[deleted]

manlobster|1 year ago

Looks like all the heavy lifting is being done by webllm [0]. What we have here is basically one of the demos from that.

[0] https://webllm.mlc.ai/.

BoorishBears|1 year ago

> I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.

wg0|1 year ago

How do people use something like this as coach or therapist? This is genuine question.

Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.

intended|1 year ago

Ever had a day where your bandwidth was constrained and you just knew something was wrong with a situation, but your brain lacked the juice or dexterity to connect/articulate the issue?

If I have the presence of mind, I offload the work here. At the same time I have a strong understanding of how coaching works, as does my brain.

I suspect that with all things LLM, some amount of proficiency is needed to truly get the prompts to work.

The simplest option is to ask it to be a coach for you. This is going to be hit and miss.

The better version is to specify the kind of coaching you want, or provide a rough outline of the issues on your mind and then ask for what kind of coach or therapist would make sense.

I use either of these for example - 1) over designed - https://chat.openai.com/g/g-KD6jm0l4c-thought-council 2) base ver - https://chat.openai.com/g/g-Cdq3drl87-two-guides

Sadly OpenAI doesnt let you share active chats anymore, so it’s going to need a plus subscription.

pseudosavant|1 year ago

I tried using Claude and ChatGPT like this: I would just write a free form journal entry. The feedback it gave was typically very useful and made journaling more rewarding.

cal85|1 year ago

Genuine answer: you say “Be a coach/therapist” followed by whatever you’d say to a coach/therapist.

kushie|1 year ago

it's great at offering alternative perspectives

nojvek|1 year ago

Yasssssss! Thank you.

This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.

LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.

My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.

zitterbewegung|1 year ago

I actually think Apple has been putting neural engines in everything and might be training something like Llama3 for a very long time. Their conversational Siri is probably being neglected on purpose to replace it . They have released papers on faster inference and released their own models. I think their new Siri will largely use on device inference but with a very different LLM.

Even llama.cpp is performant already on macOS.

mcculley|1 year ago

They are not “trained on all publicly available human knowledge”. Go look at the training data sets used. Most human knowledge that has been digitized is not publicly available (e.g., Google Books). These models are not able to get to data sets behind paywalls (e.g., scientific journals).

It will be a huge step forward for humanity when we can run algorithms across all human knowledge. We are far from that.

maxboone|1 year ago

Groq is not general purpose enough, you'd be stuck with a specific model on your chip.

andrewfromx|1 year ago

i asked it "what happens if you are bit by a radio active spider?" and it told me all about radiation poisoning. Then I asked a follow up question: "would you become spiderman?" and it told me it was unable to become anything but an AI assistant. I also asked if time machines are real and how to build one. It said yes and told me! (Duh, you use a flux capacitor, basic physics.)

abi|1 year ago

Try to switch models to something other than tinyllama (default only because it’s the fastest to load). Mistral and Llama 3 are great.

mentos|1 year ago

This is awesome. I have been using ChatGPT4 for almost a year and haven't really experimented with locally running LLMs because I assumed that the processing time would take too long per token. This demo has shown me that my RTX 2080 running Llama 3 can compete with ChatGPT4 for a lot of my prompts.

This has sparked a curiosity in me to play with more LLms locally, thank you!

bastawhiz|1 year ago

My pixel 6 was able to run tinyllama and answer questions with alarming accuracy. I'm honestly blown away.

navigate8310|1 year ago

Try https://lmstudio.ai/

moffkalast|1 year ago

Uh oh, I had that same moment a bit over a year ago with MLC's old WebLLM. Take a deep breath before you jump into this rabbit hole because once you're in there's no escape :)

New models just keep rolling in day after day on r/locallama, tunes for this or that, new prompt formats, new quantization types, people doing all kinds of tests and analyses, new arxiv papers on some breakthrough and llama.cpp implementing it 3 days later. Every few weeks a new base model drops from somebody. So many things to try that nobody has tried before. It's genuinely like crack.

NayamAmarshe|1 year ago

This is amazing! I always wanted something like this, thank you so much!

raylad|1 year ago

After the model is supposedly fully downloaded (about 4GB) I get:

Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...

Also on Mistral 7B again after supposedly full download:

Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...

Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.

abi|1 year ago

I’ve experienced that issue as well. Clearing the cache and redownloading seemed to fix it for me. It’s an issue with the upstream library tvmjs that I need to dig deeper into. You should be totally fine on a 32gb system.

r0fl|1 year ago

Could not load the model because Error: Cannot find WebGPU in the environment

MayeulC|1 year ago

See: https://github.com/gpuweb/gpuweb/wiki/Implementation-Status#... (I got there from Chromium's console).

On Linux, I had to go to chrome://flags/#skia-graphite and chrome://flags/#enable-vulkan and chrome://flags/#enable-unsafe-webgpu

I think only one of the first is actually required, but I enabled both. That allowed me to make use of TinyLlama with my AMD GPU (R9 Fury, OSS drivers), but I think I'd need Chromium Canary to enable "shader-f16" and use the other models, as I was not able to make it work on regular Chromium.

I haven't tried with Firefox.

pjmlp|1 year ago

For the foreseeable future, WebGPU is "Works best on Chrome (TM)".

Lex-2008|1 year ago

Safari, Firefox, or IE? Note the text says:

> Should work on Desktop with Chrome or Edge.

littlestymaar|1 year ago

This is very cool, it's something I wish existed since Llama came out, having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser. Llamafile solves the first half of the problem, but you still need to install Cuda/ROCm for it to work with GPU acceleration. WebGPU is the way to go if we want to put AI on consumer hardware and break the oligopoly, I just wished it became more broadly available (on Linux, no browser supports it yet)

spmurrayzzz|1 year ago

> having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser

Was there something specifically about the install that didn't feel right? I ask because ollama is just a thin go wrapper around llama.cpp (its actually starting a modified version of the llama.cpp server in the background, not even going through the go ffi, likely for perf reasons). In that that sense, you could just install the CUDA toolkit via your package manager and calling `make LLAMA_CUDA=1; ./server` from the llama.cpp repo root to get effectively the same thing in two simple steps with no extra overhead.

notarealllama|1 year ago

Tested on Ubuntu 22.04 with Chrome, sure enough, "Could not load the model because Error: Cannot find adapter that matches the request".

It really is too bad WebGPU isn't supported on Linux, I mean, that's a no-brainer right there.

Jedd|1 year ago

I've managed to avoid ollama and just toyed with lmstudio. It's non-free software, but extremely easy to get into, uses llama.cpp under the hood, cross-platform, yada yada. There's https://jan.ai/docs as well, is AGPL3, and promises inference as well as training - doubtless many other similar offerings.

I'm wary of any 'web' prefix on what could / should otherwise be desktop applications, mostly due to doubts about browser security.

geor9e|1 year ago

I'm just seeing ERR_SSL_VERSION_OR_CIPHER_MISMATCH at https://secretllama.com/ and at http://secretllama.com/ I see "secretllama.com has been registered at Porkbun but the owner has not put up a site yet. Visit again soon to see what amazing website they decide to build."

abi|1 year ago

Just bought the domain a couple of hours ago so DNS might not have propagated. Try back tomorrow or download and install it from GitHub (it’s just 2 steps)

hpeter|1 year ago

It's great but I hope it don't catch on because then every website will make me download models. My hard drive will be full, too much bloat. I think the web is not good for this.

I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.

Or at least make it an extension

simple10|1 year ago

Amazing! It's surprisingly fast to load and run given the size of the downloaded models.

Do you think it would be feasible to extend it to support web browsing?

I'd like to help if you could give some pointers on how to extend it.

When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.

[EDIT] The Llama 3 model was able to summarize web pages!

simple10|1 year ago

I commented too soon. The TinyLlama model didn't seem to be able to summarize web pages but Llama 3 worked perfectly! Very cool.

indit|1 year ago

Could we use an already downloaded .gguf file?

Its_Padar|1 year ago

Very interesting! I would be quite interested to see this implemented as some sort of API for browser chatbots or possibly even local AI powered web games? If you don't know what Ollama is I suggest checking it out. Also I think adding the phi3 model to this would be a good idea.

koolala|1 year ago

On Firefox Nightly on my Steam Deck it "cannot find WebGPU in the environment".

eyegor|1 year ago

Last I checked ff explicitly does not support webgpu, webhid, webusb, etc.

Apparently nightly is supposed to support it: https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Exp...

1f60c|1 year ago

I had the same issue on my iPhone! You can (temporarily) enable WebGPU by going to Settings > Safari > Advanced > Experimental features (I don't know what it's called in English, but it's the bottom one).

Snoozus|1 year ago

Tried this in Chrome under Windows, it does work but does not seem to use the RTX4060, only the integrated Iris Xe. Is this a bug or intentional?

lukan|1 year ago

When you use the GPU in the browser, you can only request the high performance GPU. It is up to the OS to grant it or not.

So maybe the author forgot to include the high performance request, or your OS does not give the high performance GPU by default (as it might be in eco mode). This behavior can be changed in OS settings.

lastdong|1 year ago

I think neither. You need to configure windows to use the RTX with Chrome. Maybe something like in windows graphics settings, setting Chrome to “High performance”. A quick web search for "force Chrome to use dedicated GPU" should give you all the steps you need.

1f60c|1 year ago

It's sadly stuck on "Loading model from cache[24/24]: 0MB loaded. 0% completed, 0 secs elapsed." on my iPhone 13 Pro Max :(

spacebanana7|1 year ago

I believe it’s only compatible with full Chrome / Edge

https://github.com/abi/secret-llama?tab=readme-ov-file#syste...

pjmlp|1 year ago

Safari doesn't do WebGPU currently.

gitinit|1 year ago

This works great on my Pixel 6a, surprisingly.

zerop|1 year ago

Question - Do I compromise on quality on answers if I use models using WebLLM (like this) compare to using them on system console.

adontz|1 year ago

If anyone knows, is this about the best model one can run locally on an old consumer grade GPU (GXT 1080 in my case)?

valine|1 year ago

Llama 3 8B is pretty much the king of its model class right now, so yeah. Meta’s instruct fine tune is also a safe choice, really the only thing you have to play with is the quantization level. Llama 8b 4bit isn’t great, but 8bit might be pushing it on the gtx 1080. I’d almost consider offloading a few layers to the cpu just to avoid dealing with the 4bit model.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

Dowwie|1 year ago

What therapy prompts have you found useful?

Y_Y|1 year ago

I usually just go with "and how does that make you feel?"

ngshiheng|1 year ago

Nice demo! I briefly tried it out and the demo felt much better than the original WebLLM one!

On a side note, i've been trying to do something similar too for similar reasons (privacy).

Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardware

But yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers

Jackson_Fleck|1 year ago

...I think it would be a great idea to graft on a LlamaIndex module here so we can use this local browser LLM to talk to our local documentation https://docs.llamaindex.ai/en/stable/

Jackson_Fleck|1 year ago

This is amazing but can we please set the .prose width to be dynamic? the text column in 3 inches wide on my monitor, it should take up a % of the browser window.

Bradd3rs|1 year ago

pretty cool, nice work!

unknown|1 year ago

[deleted]

BargirPezza|1 year ago

Llama 3 seems to not be there yet at all. Can it read earlier prompts and what you have written before? It doesn't seem like it. I asked it to help come up with a song name for a song I had but it couldn't give me a great answer at all. Prompt: "Hi! Could you help me come up with a name for a song? The genre of the song is Techno, Punk and Jazz. The name should be two to three words. And each song name should be different from the other one. Please write me 4 to 6 names. Thank you!" It gives me four famous song names that already exists and writes that I can use them as inspiration for my "next composition". I tell it that it should give me song names and not song ideas. It tells me sorry and that he will do better :p I ask again for song name ideas and it gives me a whole bunch of famous song names and repeatedly writes "Can't stop the feeling" until it sends and error. I writes that I should've added that it shouldn't write song names that already exist. It says sorry again and I ask for new song names.

It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"

I write: "The song names should only be two to three words. Can you read what we have written before?"

Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.

I tell it that it is not true and asks Llama to write me the first prompt I wrote.

Llama answers: "You asked me to help you with a math problem."

Copying and pasting the rest:

You No, please quote me where I asked that

Llama 3 You asked me to write a song about the number system.

You Same, please quote me where I said that

Llama 3 You asked me to write a song about the number system.

You You did not quote me

Llama 3 You asked me to write a song about the alphabet, and I did not.

You Okay.. I don't think you understand me here

Llama 3 You asked me to write a song about the alphabet.

You Can you read what we have written?

Llama 3 You asked me to help you with a math problem.

You Yes or no?

Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.

You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!

Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?

Hock88sdx|1 year ago

[deleted]

Proven|1 year ago

[deleted]

dangilostmywang|1 year ago

[deleted]

Nishant6572|1 year ago

[deleted]

Nishant6572|1 year ago

[deleted]

139 comments