Associated Press clarifies standards around generative AI

[+] LispSporks22|2 years ago|reply

> OpenAI, the ChatGPT maker committed to paying to train its models on AP news stories going back to 1985

Why are they paying or even asking for permission for training on data?

[+] nl|2 years ago|reply

Despite the cynical responses here there is actually a practical reason why OpenAI is paying for this: The AP News Archive is not available online to be crawled. See https://www.ap.org/content/archive

There's a reasonable strong argument that crawling public pages for "indexing" (aka learning) is fair use based on the Google precedent and case law from ther early 2000s.

The argument is much less strong if those records aren't available.

[+] ceejayoz|2 years ago|reply

Because it’s gonna occasionally spit out direct uncredited quotes from AP articles, and they know it. There’s gonna be a whole new field of law cropping up around this stuff; getting the big players appeased will be important.

DALL-E helpfully slaps a ShutterStock logo on some creations, for example. https://twitter.com/amoebadesign/status/1534542037814591490

[+] claytonjy|2 years ago|reply

Because they see the lawsuits and shifts in public perception. Now that they've "cheated" their way to the top, they can play by the rules to entrench themselves so others can't catch up.

[+] mrtksn|2 years ago|reply

Interesting, isn't it? I'm sure it has some legal or PR reason or something but IMHO the more important part is about acknowledging the problem: The current copyright system doesn't work and something is needed to compensate for the work.

The internet has shaken the system but the content producers were able to adapt, albeit resulting in lower quality content.

However with the raise of AI, the thing completely shattered. Previously, someone reading the content and telling it to others wasn't a problem that breaks the compensation scheme for content producers but with ChatGPT and similar we have a situation where this "person" can tall about it to literally everyone. Some new compensation scheme is needed and OpenAI is probably trying o act as the "nice guy" to prevent the urgent need of a scheme that might limit their ability to consume other people's content.

[+] noobermin|2 years ago|reply

That's nice, so OpenAI will pay you if you're big enough but if you're a small fry they won't, good to know.

[+] colechristensen|2 years ago|reply

Copyright law doesn’t quite have a handle on machine learning training rules but it will. Belligerence at this point will only encourage stronger rules in the future.

[+] Retric|2 years ago|reply

So they can make a local copy of all AP stories before training the AI. The AI is a derivative work and it’s unclear if it’s distinct enough to be a problem or not. Finally, some of what ChatGPT is going to spit out is going to very closely resemble AP stories which may itself be a problem.

This stuff is a legal minefield, as a for profit company building their core product it’s very difficult to argue each of these is fair use etc. Though I am sure that argument will be made, it’s risky when dealing with companies whose businesses model centers around IP.

[+] sacado2|2 years ago|reply

Because they don't own it? AFAIK, ChatGPT's model isn't open source, so they seem to know data has some value.

[+] skepticATX|2 years ago|reply

Seems like a very reasonable policy. I wish that other companies would adopt similar policies instead of trying to stuff AI in at every chance.

[+] gnicholas|2 years ago|reply

I'm about to go to the Online News Association (ONA) conference, which starts this week. I am really interested to see what is said at the various sessions about generative AI. Last year there was a mention of generative AI, but this was before chatGPT officially launched. I'm sure this year will be very different, as different organizations embrace or fend off this undeniable new development. Some orgs/journalists will surely use it to churn out more articles in less time. Others will probably adopt a purist stance and eschew it entirely.

There are good reasons to keep confidential info out of LLMs that you don't control, but I'd think it would make sense for anyone to run text through a locally-hosted LLM for editing suggestions and the like.

[+] RheingoldRiver|2 years ago|reply

I'm interested in the style guide point about using non-gendered pronouns for LLMs - if I were to write a style guide, I would say, "use the gender appropriate for the persona designed by the company," for example Siri-the-llm would be "she" but ChatGPT or Sydney would be "it," a male persona llm would be "he," an explicitly nonbinary one would be "they," etc. Respect the company's style guide etc.

But I can see the potential for harm in over-humanizing by a news outlet. I'd be interested to hear what their decision-making process was for this point, if it was obvious or if they went back and forth, what arguments they had for which direction, etc.

[+] thomastjeffery|2 years ago|reply

The core problem surrounding LLMs is personification. Narratives that surround LLMs have failed to draw a clear distinction between what they hope to be, and what they are.

An LLM hopes to be an "intelligence" that can understand and manipulate text along the logical boundaries of language; and do so intentionally.

What an LLM really is, is an inference model that can reorganize text across boundaries that closely "align to" real language patterns. This is accomplished by creating a completely new pattern (the model) from inferring whatever patterns already exist in the training corpus' text.

A Large Language Model (LLM) serves as an alternative to true language comprehension. It is not an equivalent replacement, nor does it intelligently navigate itself with any explicit intent.

The act of "intelligently navigating the content of language" is at the core of journalism. It's incredibly important for journalist to both recognize and articulate the difference between an Artificial Intelligence realized, and any technology in the category of AI pursuit.

[+] michaelt|2 years ago|reply

> Respect the company's style guide etc.

Really? I would have said the opposite - journalists have no obligation to parrot what companies' marketing departments feed them, and in fact usually ought to do the opposite.

Russia might name its invasion of Ukraine "Anti-Nazi Operation Freedom Eagle" but you wouldn't expect a war correspondent to repeat such obvious propaganda. In general, journalists have no obligation to follow companies' and governments' naming preferences.

[+] CatWChainsaw|2 years ago|reply

An LLM is always an it.

[+] thomastjeffery|2 years ago|reply

A bunch of very reasonable conclusions that would have been more obvious in the first place if we would just stop calling these tools "Artificial Intelligence" instead of what they are.

[+] pipingdog|2 years ago|reply

A replacement for artists, on the other hand... (the graphic for the article was quite obviously AI generated.)

[+] croes|2 years ago|reply

Niemanlab != Associated press

[+] donachristoper|2 years ago|reply

Hey user! Could you expalin it more for the easy understandibilty of the members.

36 comments