(no title)
skeyo
|
2 years ago
My initial thought on the timing is how much it coincides with the timing of Twitter's API becoming ridiculously expensive ($42k/mo). That maybe Reddit thought they could hop on that bandwagon and make some coin. But LLM also makes a ton of sense.
nicce|2 years ago
Biggest parties have already mined the data, which is enough for models for long time.
Unless you want the model to find some specific comment yesterday.
Frost1x|2 years ago
The cases LLM will need this data in is compiling together more modern useful human knowledge as our knowledge base grows. Information in existing LLMs could shift. This is sort of the issue even academic textbooks deal with when publishing what is considered foundational knowledge: sometimes we discover something new that makes it either not quite correct or invalid.
These are the sort of obvious failures and disconnects that should become apparent if training lags behind. LLM services interested in revenue without plans for continuously updating training data are somewhat betting that not too much will change from most end users perspectives for awhile and for some use cases that might be true but the limits of training data over time for public instances of GPT for example have already hindered some.
Much of prompting, from my anecdata, needs to take that into consideration as one of the base constraints (does this model even have up-to-date information it could query and dump something useful from). To some degree those training limits also help expose "hallucinations" or interpolation/extrapolation attempts of LLM models. If I know it doesn't have this information in the training set and test the system against it, I can observe how well it interpolates, extrapolates, and is transparent about when that's happening. For example if I ask existing models about new syntaxes and structures introduced in Java 21, most should return something back like it doesn't exist, it lacks that newer information, or something to that effect. If instead it starts producing code samples that it couldn't possibly have knowledge of, then I know it's passing back garbage. If it's being continually updated at some frequency, I'm no longer so sure and it may actually be providing new useful information.
CTDOCodebases|2 years ago
By allowing LLMs low cost or free access to their users data companies like Reddit are essentially helping companies like OpenAI choke their traffic.
Over time as people realise it’s quicker to ask ChatGPT a question than it is to post a question on reddit they will start losing content also.
Also think about the difficulty of policing content generated by swarms of LLM bots with API access for PR campaigns.
throw_nbvc1234|2 years ago
d11z|2 years ago