top | item 44829566

(no title)

nonhaver | 6 months ago

i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).

heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful

discuss

ModernMech|6 months ago

Why are your examples so vague?

I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.

The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.

Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.

nonhaver|6 months ago

not to offend - but it sounds like your response/worries are based more on an emotional reaction. and rightly so, this is by all means a very scary and uncertain time. and undeniably these companies have not taken into account the impact their products will cause and the safety surrounding that.

however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned

> hallucinations

are reduced with GPT-5

https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...

"gpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3"

> limited context window

same deal. gemini 2.5-pro has a 1 million token context window and GPT-5 is 400k up from 200k with o3

https://blog.google/technology/google-deepmind/gemini-model-...

"native multimodality and a long context window. 2.5 Pro ships today with a 1 million token context window (2 million coming soon)"

> expensive to operate and train

we don't know for certain but GPT-5 provides the most intelligence for the cheapest price at $10/1 million output tokens which is unprecedented

https://platform.openai.com/docs/models/gpt-5

> guardrails

are very well implemented in certain models like google who provide multiple safety levels

https://ai.google.dev/gemini-api/docs/safety-settings

"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."

now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like

> Inability to recall simple information

> inability to stay on task

> (doesn't) support its output

> (no) long term planning

ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.

as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.