Assistants API is promising, but earlier versions have many issues, especially with how it calculates the costs. As per OpenAI docs, you pay for data storage, a fixed price per API call, + token usage. It sounds straightforward until you start using it.
Here is how it works. When you upload attachments, in my case a very large PDF, it chunks that PDF into small parts and stores them in a vector database. It seems like the chunking part is not that great, as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt, which inflates your per request costs to 10 times more than the prompt + response tokens combined. So, be mindful of the hidden costs and monitor your usage.
> as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt,
This is how RAG works.
While you can come up with work-arounds like using lesser LLMs as a pre-filtering step the fact is that if you need GPT to read the doc you need GPT to read the doc.
Yup, this seems right. You pay for tokens no matter what. Even in other APIs. Did you know you can set an expire for files, vector stores, etc? No need to pay for long term storage on those. Also, threads are free.
There isn’t really any other way for this to work. The only way for the model to answer questions on your pdf is for the information to be somewhere in the prompt.
I'd be interested in knowing if anyone is seriously using the assistants API, it feels like such a lock in to OpenAIs platform when your can alternatively just use completions that are much more easily interchanged.
I've indeed refused to work with some providers giving only a chat interface and not a completion interface because it made the communication "less natural" to the model (like adding new system messages in between for function calling on models which don't officially does it, or adding other categories than system/user/assistant)
I've used it and in some cases it's taking days and weeks of development away to get to testing the market.
In some cases the lock in is what it is for now because a particular model in reality is so far ahead, or staying ahead.
It doesn't mean other options won't become available, but it does matter to relate your need to your actions.
Getting something working consistently for example might be the first goal, and then learning to implement it with multiple models might be secondary. The chances of that increase the later other models are explored in some cases.
It should be possible to tell pretty quickly if something works in a particular model that's the leader, how others compare to it and how to track the rate of change between them.
I know at least one team is at work is using the Assistants API, and I'm talking with another team that is leaning pretty heavily towards using it over building a custom RAG solution themselves, or even over other in-house frameworks.
I use it mostly exclusively (I've even developed a Python library for it, https://github.com/skorokithakis/ez-openai), because it does RAG and function calling out of the box. It's pretty convenient, even if OpenAI's APIs are generally a trash fire.
I've not seen any of these "agentic" systems be all that useful in practice. Complicated chain of software where a lot can wrong at any step, and the probability of failure explodes when you have many steps.
- Writing what I want in Python/other-lingo gives me much more customizability than these frameworks offer.
- No worries about the future plans of the repo and having to deal with abandonware.
- No vendor lock in. Currently most repos like this focus on OpenAI's models, but I prefer to work with local models of all kinds and any abstraction above llama.cpp or llama-cpp-python is a no-no for me.
The last point means I refuse to build on top of ollama's API as it's yet another wrapper around llama.cpp.
There is really no way to make the ensemble behave with an acceptable level of consistency.
Where we ended up is now having a frontier model generate a whole tree of possible execution plans, and then have the user select one of those path, and then we just run whatever the user chose in a plain sequence until the next decision point that needs user approval.
I've encountered two viable cases: instructions are too complex, too many tools, or wildly different processing steps, in which case it semplify a lot the processing to have a few well defined steps each doing their thing, and a coordinator on top, either sequential, or intelligent, that is only focuesed on next step routing.
the other is memory for conversational retrieval. ai memory is still quite limited, especially if there needs to be a lot of token in context, and context too long impede the ability of llm of focus on the task itself, especially if the context is itself a conversation or a request, so spreading the context along a few agents, and propagating the user request among agent, and having those produce answer fragment for another llm to formulate an answer allows to not lose the conversational context without swamping the llm with noise.
the problem tho remains latency as son as you nest them latency explodes as you can only stream the last layer of llm output
I imagine having an agent set up with specific RAG context to solve a specific problem and having another with a different RAG context to solve a different problem can be useful.
“A lot of research has been doing in this are and we can expect a lot more in 2024 in this space. I promise to share some clarity around where I think this industry is headed. In personal talks I have warned that multi-agent systems are complex and hard to get right. I've seen little evidence of real-world use cases too”
These assistant systems fascinate me, but I just don’t have the time and energy to set something up. I was going to ask if anyone had a good experience with it, but the above makes it sound like there’s not much hope at the moment. Curious what other people’s experience are.
We tried using a multi-agent system for a complex NLP-type task and we found:
- Too many errors that just propogate on top of each other, if a single agent in the chain generates something even a little bit off then the whole system goes off the rails.
- You often end up having to pass a massive amount of shared context to every agent which just increases the cost dramatically.
Curiously enough we had an architect from OpenAI tell us the same thing about agent systems a few days ago (our company is a big spender so they serve a consulting function), so I don't think anybody is really finding success with multi-agent systems currently. IMO the core tech is nowhere near good enough yet.
Thanks @beoberha, I am too. I like one take I heard on Twitter. The sentiment was something like these types of systems are useful under the AI-Powered Productivity industry which has incremental gains, no big bangs. Said another way, if your job was to help a TON of your employees be more productive individually, it is worth it because companies measure those efforts broadly and the payoff is there. But again, not big. My advice for folks to stay lower level and hook AI automation up with simple, closed loop, LLM patterns that feel more like basic API calls in a choreographed manner. OMG, hope all that made sense
A lot of folks I've spoken with say that single-agent systems are still extremely limited, let alone multi-agent platforms. In general, it seems to boil down to:
- Agents need lots of manual tuning and guardrails to make them useful
- Agents with too many guardrails are not general-purpose enough to be worth the time and effort to build
I believe truly great agents will only come from models whose weights are dynamically updated. I hope I'm wrong.
> In my opinion, exploration of multi-agent systems is going to require a broader audience of engineers. For AI to become a true commodity, it needs to move out of the Python origins and into more popular languages like JavaScript , a major fact on why I wrote Experts.js.
Callbacks, which are not async and can’t be streamed that way directly without wrapping it in some helper function.
(Which does seem obvious; I’m also not sure why they called it out specifically as not being async friendly? I guess most callback style functions these days have async equivalents in popular libraries and these ones don’t)
My main conversation “loop” at https://olympia.chat has tool functions connected to “helper AIs” for things such as integrating with email. It lets me minimize functions on the main loop and actually works really well.
yatz|1 year ago
Here is how it works. When you upload attachments, in my case a very large PDF, it chunks that PDF into small parts and stores them in a vector database. It seems like the chunking part is not that great, as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt, which inflates your per request costs to 10 times more than the prompt + response tokens combined. So, be mindful of the hidden costs and monitor your usage.
nl|1 year ago
This is how RAG works.
While you can come up with work-arounds like using lesser LLMs as a pre-filtering step the fact is that if you need GPT to read the doc you need GPT to read the doc.
metaskills|1 year ago
iamflimflam1|1 year ago
unknown|1 year ago
[deleted]
hnuser123456|1 year ago
xrendan|1 year ago
Nedomas|1 year ago
phh|1 year ago
j45|1 year ago
In some cases the lock in is what it is for now because a particular model in reality is so far ahead, or staying ahead.
It doesn't mean other options won't become available, but it does matter to relate your need to your actions.
Getting something working consistently for example might be the first goal, and then learning to implement it with multiple models might be secondary. The chances of that increase the later other models are explored in some cases.
It should be possible to tell pretty quickly if something works in a particular model that's the leader, how others compare to it and how to track the rate of change between them.
oddthink|1 year ago
stavros|1 year ago
__loam|1 year ago
behnamoh|1 year ago
- Writing what I want in Python/other-lingo gives me much more customizability than these frameworks offer.
- No worries about the future plans of the repo and having to deal with abandonware.
- No vendor lock in. Currently most repos like this focus on OpenAI's models, but I prefer to work with local models of all kinds and any abstraction above llama.cpp or llama-cpp-python is a no-no for me.
The last point means I refuse to build on top of ollama's API as it's yet another wrapper around llama.cpp.
rcarmo|1 year ago
csouzaf|1 year ago
LASR|1 year ago
There is really no way to make the ensemble behave with an acceptable level of consistency.
Where we ended up is now having a frontier model generate a whole tree of possible execution plans, and then have the user select one of those path, and then we just run whatever the user chose in a plain sequence until the next decision point that needs user approval.
avereveard|1 year ago
the other is memory for conversational retrieval. ai memory is still quite limited, especially if there needs to be a lot of token in context, and context too long impede the ability of llm of focus on the task itself, especially if the context is itself a conversation or a request, so spreading the context along a few agents, and propagating the user request among agent, and having those produce answer fragment for another llm to formulate an answer allows to not lose the conversational context without swamping the llm with noise.
the problem tho remains latency as son as you nest them latency explodes as you can only stream the last layer of llm output
coffeebeqn|1 year ago
ww520|1 year ago
beoberha|1 year ago
“A lot of research has been doing in this are and we can expect a lot more in 2024 in this space. I promise to share some clarity around where I think this industry is headed. In personal talks I have warned that multi-agent systems are complex and hard to get right. I've seen little evidence of real-world use cases too”
These assistant systems fascinate me, but I just don’t have the time and energy to set something up. I was going to ask if anyone had a good experience with it, but the above makes it sound like there’s not much hope at the moment. Curious what other people’s experience are.
dongobread|1 year ago
- Too many errors that just propogate on top of each other, if a single agent in the chain generates something even a little bit off then the whole system goes off the rails.
- You often end up having to pass a massive amount of shared context to every agent which just increases the cost dramatically.
Curiously enough we had an architect from OpenAI tell us the same thing about agent systems a few days ago (our company is a big spender so they serve a consulting function), so I don't think anybody is really finding success with multi-agent systems currently. IMO the core tech is nowhere near good enough yet.
metaskills|1 year ago
fzliu|1 year ago
- Agents need lots of manual tuning and guardrails to make them useful
- Agents with too many guardrails are not general-purpose enough to be worth the time and effort to build
I believe truly great agents will only come from models whose weights are dynamically updated. I hope I'm wrong.
m3kw9|1 year ago
unknown|1 year ago
[deleted]
mentos|1 year ago
Would be great to be able to ask it, 'have we completed the X process with contractor Y yet?'
valiant-comma|1 year ago
https://github.com/h2oai/h2ogpt
squirrel|1 year ago
david_shi|1 year ago
jackbravo|1 year ago
> In my opinion, exploration of multi-agent systems is going to require a broader audience of engineers. For AI to become a true commodity, it needs to move out of the Python origins and into more popular languages like JavaScript , a major fact on why I wrote Experts.js.
I wholeheartedly agree
unknown|1 year ago
[deleted]
ec109685|1 year ago
What is unfriendly about this?
It’s easy to collect the streaming output and return it all when the llm’s response is done.wokwokwok|1 year ago
> assistant.on("textDelta”, () => …
Callbacks, which are not async and can’t be streamed that way directly without wrapping it in some helper function.
(Which does seem obvious; I’m also not sure why they called it out specifically as not being async friendly? I guess most callback style functions these days have async equivalents in popular libraries and these ones don’t)
arresin|1 year ago
Is this right? Aren’t you prematurely unwrapping the promise here?
unknown|1 year ago
[deleted]
tr14|1 year ago
spiritplumber|1 year ago
moltar|1 year ago
shepherdjerred|1 year ago
metaskills|1 year ago
obiefernandez|1 year ago
Terretta|1 year ago
unknown|1 year ago
[deleted]
bongodongobob|1 year ago