top | item 46800434

(no title)

w4yai | 1 month ago

> It proves that modern LLMs can run without Python, PyTorch, or GPUs.

Did we need any proof of that ?

discuss

jdefr89|1 month ago

Python and PyTorch all call out to C libraries… I don’t get what he means by “proving LLMs can run without Python and PyTorch” at all. Seems like they don’t understand basic fundamentals about things here…

jasonjmcghee|1 month ago

I guess llama.cpp isn't quite as popular as I had assumed.

avadodin|1 month ago

llama.cpp being the best choice doesn't make it popular.

When I got started, I was led to ollama and other local-llm freemium.

I didn't necessarily assume that they weren't c++(I don't even know) but I do think that –as implied– Python duct-tape solutions are more popular than llama.cpp.

christianqchung|1 month ago

A bizarre claim like that would be what happens when you let an LLM write the README without reading it first.

skybrian|1 month ago

Knowing the performance is interesting. Apparently it's 1-3 tokens/second.

kgeist|1 month ago

ikllama.cpp is a fork of llama.cpp which specializes on CPU inference, some benchmarks from 1 year ago: https://github.com/ikawrakow/ik_llama.cpp/discussions/164

tolerance|1 month ago

I imagine so regarding GPUs, right? Is this is a legitimate project then doesn’t it provide a proof of concept for performance constraints that relate to them? Couldn't the environmentally concerned take this as an indicator that the technology can progress without relying on as much energy is potentially spent now? Shouldn’t researchers in the industry be thinking of ways to prevent the future capabilities of the technology from outrunning the capacity of the infrastructure?

I know very little about AI but these are things that come to mind here for me.

yorwba|1 month ago

GPUs are more efficient than CPUs for LLM inference, using less energy per token and being cheaper overall. Yes, a single data center GPU draws a lot of power and costs a fortune, but it can also serve a lot more people in the time your CPU or consumer GPU needs to respond to a single prompt.