top | item 40426539

(no title)

vbo | 1 year ago

I don't want to defend Altman. He may or may not be a good actor. But as an engineer, I love the idea of building something magical, yet lately that's not straightforward tinkering - unless you force your way - because people raise all sorts of concerns that they wouldn't have 30 years ago. Google (search) was built on similar data harvesting and we all loved it in the early days, because it was immensely useful. So is ChatGPT, but people are far more vocal nowadays about how what it's doing is wrong from various angles. And all their concerns are valid. But if openai had started out by seeking permission to train on any and every piece of content out there (like this comment, for example) they wouldn't have been able to create something as good (and bad) as ChatGPT. In the early search days, this was settled (for a while) via robots.txt, which for all intents and purposes openai should be adhering to anyway.

But it's more nuanced for LLMs, because LLMs create derivative content, and we're going to have to decide how we think about and regulate what is essentially a new domain and method and angle on existing legislation. Until that happens, there will be friction, and given we live in these particular times, people will be outraged.

That said, using SJ's voice given she explicitly refused is unacceptable. It gets interesting if there really is a voice actor that sounds just like her, but now that openai ceased using that voice, the chances of seeing that play out in court are slimmer.

discuss

marcus_holmes|1 year ago

Google search linked to your content on your site. It didn't steal your content, it helped people find it.

ChatGPT does not help people find your content on your site. It takes your content and plays it back to people who might have been interested in your site, keeping them on its site. This is the opposite of search, the opposite of helping.

And robots.txt is a way of allowing/disallowing search indexing, not stealing all the content from the site. I agree that something like robots.txt would be useful, but consenting to search indexing is a long, long way from consenting to AI plagiarism.

vbo|1 year ago

Point is we couldn't have a way of consenting to ai training until after we had llms. And I'm guessing we will, pretty quickly.

brabel|1 year ago

> But if openai had started out by seeking permission to train on any and every piece of content out there...

But why would anyone seek permission to use public data? Unless you've got Terms and Conditions on reading your website or you gatekeep it to registered users, it's public information, isn't it? Isn't public information what makes the web great? I just don't understand why people are upset about public data being used by AI (or literally anything else. Like open source, you can't choose who can use the information you're providing).

In the case being discussed here, it's obviously different, they used the voice of a particular person without their consent for profit. That's a totally separate discussion.

johnnyanmac|1 year ago

>why would anyone seek permission to use public data?

first of all it's not all public data. software licenses should already establish that just because something is on the internet doesn't mean it's free game.

>Unless you've got Terms and Conditions

The new york times did:

https://help.nytimes.com/hc/en-us/articles/115014893428-Term...

Even if you want to bring up an archive of the pre-lawsuit TOS, I'd be surprised if that mostly wasn't the same TOS for decades. OpenAI didn't care.

>Isn't public information what makes the web great?

no. Twitter is "public information" (not really, but I'll go with your informal definition here). If that's what "public information" becomes then maybe we should curate for quality instead of quantity.

Spam is also public information and I don't need to explain how that only makes the internet worse. and honestly, that's what AI will become if left unchecked.

> Like open source, you can't choose who can use the information you're providing

That's literally what software licenses are for. You can't stop people from ignoring your license, but breaking that license opens you wide open for lawsuits.

philistine|1 year ago

The right to copy public information to read it does not grant the right to copy public information to feed it into a for-profit system to make a LLM that cannot function without the collective material that you took.

yellowapple|1 year ago

> Google (search) was built on similar data harvesting and we all loved it in the early days, because it was immensely useful. So is ChatGPT, but people are far more vocal nowadays about how what it's doing is wrong from various angles.

Part of that is that we've seen what Google has become as a result of that data harvesting. If even basic search engines are able to evolve into something as cancerous to the modern web as Google, then what sorts of monstrosities will these LLM-hosting corporations like OpenAI become? People of such a mindset are more vocal now because they believe it was a mistake to have not been as vocal then.

The other part is that Google is (typically) upfront about where its results originate. Most LLMs don't provide links to their source material, and most LLMs are prone to hallucinations and other wild yet confident inaccuracies.

So if you can't trust ChatGPT to respect users, and you can't trust ChatGPT to provide accurate results, then what can you trust ChatGPT to do?

> It gets interesting if there really is a voice actor that sounds just like her, but now that openai ceased using that voice, the chances of seeing that play out in court are slimmer.

It's common to pull things temporarily while lawyers pick through them with fine-toothed combs. While it doesn't sound like SJ's lawyers have shown an intent to sue yet, that seems like a highly probable outcome; if I was in either legal teams' shoes, I'd be pulling lines from SJ's movies and interviews and such and having the Sky model recite them to verify whether or not they're too similar - and OpenAI would be smart to restrict that ability to their own lawyers, even if they're innocent after all.

moogly|1 year ago

> Google (search) was built on similar data harvesting and we all loved it in the early days

Google Search linked back to the original source. That was the use case: to find a place to go, and you went there. Way less scummy start than OpenAI.

beeboobaa3|1 year ago

As an engineer, the current state of LLMs is just uninteresting. They basically made a magical box that may or may not do what you want if you manage to convince it to, and fair chance it'll spout out bullshit. This is like the opposite of engineering.

xanderlewis|1 year ago

In my opinion, they're extremely interesting... for about a week. After that, you realise the limitations and good-old-fashioned algorithms and software that has some semblance of reliability start to look quite attractive.