Fear of AI just killed a useful tool

[+] actuallyalys|2 years ago|reply

I found this article frustratingly vague on how prosecraft.io actually worked. As far as I can tell, the author scraped the web for books, including in-copyright books. Then he analyzed it with techniques based on "classical" natural language processing techniques, rather than transformers or deep learning. He appears to have retained the books he scraped for future analysis. The site itself seems to use only snippets.

However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

For what it's worth, the Computational Story Lab's hendometer [1] seems to have largely out-of-copyright books from Project Gutenberg, plus the Harry Potter series.

[0]: https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

[1]: https://hedonometer.org/books/v3/863/

Edit: Apparently he was working on an LLM project. https://twitter.com/stealcase/status/1688721685585809408. It's unclear whether he was planning to use the books he scraped (although as @stealcase points out, GPT-Neox itself was trained on books that were pirated).

[+] Palmik|2 years ago|reply

I am a bit confused about what's so outrageous about this tool. It seems that both the book authors, and some of the people in the discussion here, conflate rudimentary statistics about a book (number of words of certain kind) with the latest wave of generative AI. They are very different in both what value they provide, and what risk they pose to book authors.

The tool that book authors got outraged about only provides basic metrics, not dissimilar from other metrics such as "page count", and can't be used to produce new content which could deprive the book authors from revenue.

[+] rockemsockem|2 years ago|reply

If you read through the angry Twitter thread it's clear that almost everyone thinks that either a) the site is a pirate site that lets you download books or b) that the site lets you generate works in the style of an author. Neither of which is true of course.

There are a handful (like < 3 people) who seem to understand what the site actually does who were still angry because the creator seems to have pirated the books. I actually don't know about the legality of something like that. Surely providing pirated books is illegal, but IDK if acquiring pirated books actually is.

I think it's clear though that most of the outrage would still be there even if the author had purchased each and every book.

[+] golemotron|2 years ago|reply

[deleted]

[+] jmull|2 years ago|reply

If you want to do this kind of thing, let authors opt-in (or publishers).

Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.

I suspect the authors are put-off by a couple things:

- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.

- the use of these scans in a commercial product

The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

[+] sampo|2 years ago|reply

Summary: prosecraft.io counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions. This is not generative AI, this is old school statistics.

And then it sometimes presented a page worth of quoted text from a book. Which should fall under fair use.

https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

[+] gs17|2 years ago|reply

> I don't think you even need fair use for this

You shouldn't, at least for posting basic statistics. They're facts, not copyrightable.

[+] warning26|2 years ago|reply

> counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions

You're pretty much describing exactly what an LLM "learns" about text. I agree that it should obviously fall under fair use, but as the author of this article found out, there are quite a few who (very vocally) disagree.

[+] t14n|2 years ago|reply

Hrm. It seems like the authors are caught up in things like "vividness" score and the "sentiment analysis" of the text; I guess because it's loosely related to AI?

But it seems like a bulk of the stats collected are things that I would find really useful. I've probably asked myself, "how many words are in this book" on 10+ separate occasions, both as a reader and as a writer.

It also seems like there were also counts of things like adjectives, verbs, adverbs, passive verbs, etc -- stats that I might want to know about a novel.

The bulk of the service seems rather "boring" and non-AI. Unfortunate that the whole thing was taken down because of a few features. Hopefully it'll come back.

[+] keiferski|2 years ago|reply

For this particular example, the tool doesn't seem like it's a big deal. It just analyzes works for data. I'm not sure how this would be any different from a literary critic doing the same thing manually.

In general, though, I think artists would be less hostile to technological innovations if the people imploring them to "figure out how to embrace the technology rather than fear it" weren't actively trying to destroy their livelihoods, almost always without the slightest interest in helping them figure out the new economic situation. The attitude is, "It's the reality now, deal with it," all while enjoying the job security and high salaries of tech jobs. You can see the same attitude displayed when it comes to piracy: "too bad, deal with it, I have a good job, I don't care if you don't anymore."

This stuff would be received far better by the creative community if AI companies were to say, establish an artist sponsorship program, push for UBI, or otherwise show that they care even a tiny bit about the people they're making redundant.

[+] bertil|2 years ago|reply

I agree with you. There’s a pattern that I see a lot, of having:

1. large powerful players doing something not entirely helpful;

2. victims of that protesting that change vehemently; all that in vain because the players are powerful and have sheltered themselves from criticism, usually via lobbying;

3. regulatory capture or protests go after a smaller player, which is widely advertised to accuse 2. of going too far — even when the problem in 1. is still entirely there, and now ignored.

It’s definitely the case with globalization (large conglomerate benefit, people protest, and a small artisan who started selling abroad is featured being victimized by tariffs), fossil fuel (large oil extractor, climate advocate, farmer seeing fertilizer prize go up), immigration, American cultural hegemony, car dominance over cities, etc.

That pattern allows larger players still doing harm to wash their morals. I feel like we need better antibodies to say: No, this does not absolve them.

[+] pcthrowaway|2 years ago|reply

> if AI companies were to say, establish an artist sponsorship program, push for UBI

Sam Altman, for all his faults, is actually a massive proponent of UBI. I mean, that was one of the claimed objectives of Worldcoin (though he advocates for UBI in general: https://thewalrus.ca/will-universal-basic-income-save-us-fro... )

[+] A4ET8a8uTh0|2 years ago|reply

I will admit that I am mildly confused by this outrage, but it is X/twitter so the standards are different.

All that said, I remember doing basic text analysis in college and then sentiment analysis in my MBA class.. is the concern out there, because of how source material was acquired?

[+] dahwolf|2 years ago|reply

Fully agree.

Not an artist myself, but this basic assumption in tech that you can just take somebody's shit without informing them, without permission, without compensation, without basic due diligence, and then go do whatever the hell you want with it needs to stop.

For the artists' sake but also for tech's sake. This model can't work, it's a complete dead-end that will wipe out livelihoods and culture.

But I can ensure you artists can/will be equally hypocritical themselves. Surely they've pirated themselves, removed paywalls from articles, blocked ads, borrowed the neighbor's Netflix account.

[+] palata|2 years ago|reply

I think it applies to many technologies other than generative AI. How many devs actually think about ethics nowadays? I think it's all lost in the big companies they work for, behind the excuse that "it is not their job to figure out how their work is being used".

Interestingly, I think most devs would think twice before being paid for designing a missile. But somehow they don't really seem to think about the impact of work that is not obviously a weapon. Social network, Stable Diffusion, ChatGPT, SpaceX... everything disruptive has the potential to be very bad (I see a lot more harmful use-cases for ChatGPT than legit ones, but maybe that's just me). But somehow engineers seem to believe that it is not their problem.

[+] wrp|2 years ago|reply

My summary of the case: Someone did statistical analysis of a bunch of texts and created a tool that evaluates your text according to the developed model. Writers accused him of plagiarizing/using the content of their works.

[+] iamflimflam1|2 years ago|reply

Something that we need to learn is that these brief outbreaks on social media burn themselves out pretty quickly. Everyone shouts for a bit and then moves onto to the next bit of manufactured outrage.

[+] leach|2 years ago|reply

I agree, people get angry at something for a day and then the timeline tells them to be angry at the next thing.

Always feel bad for people who cave to the mob, usually if the mob is yelling at you you’re on the right track

[+] LikelyABurner|2 years ago|reply

I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

We already tried this with human communication and gave birth to the dystopian nightmare that is social media, why keep repeating our mistakes?

[+] belorn|2 years ago|reply

Relying on Authors Guild, Inc. v. Google, Inc in order determine fair use for AI models, there are a few key aspects to consider. The outcome should not supersede, supplant or become a replacement for the original works, nor should they sell portions of them. It should also preferable enhance the sale of the original work to the benefit of the copyright holder.

In this specific case regarding prosecraft, all those criteria might be fulfilled, and it might be so that under those specific conditions the use of any copyrighted work for the creation of AI models is fair (or at least under US law).

[+] ricochet11|2 years ago|reply

its sad he took the site down, it looks like a neat project. it seems to be fair use, so it really is just an issue of consent and keeping people happy. the issue is some people will always be fearful/miserable. should the rest of us be held back in exploring culture because they refuse to play?

anyway i'm sure there will be ten other similar sites by the end of the week...

[+] LikelyABurner|2 years ago|reply

“That book’s vividness score (TM) is 75% opposed to that other book’s vividness score (TM) which is only 50%! That’s, like, a 50% higher culture score (TM)!”

[+] fsagx|2 years ago|reply

With CLOUD and AI happily in the descriptions, with no private desktop version, and your email to sign up, please.

[+] whywhywhywhy|2 years ago|reply

Doesn't seem like the fear of AI more just authors being petulant, didn't we have the same thing with some hack Star Wars book author attacking the internet archive for daring to host a copy of his book before the AI fad.

[+] fsociety999|2 years ago|reply

What I find so odd about all this stuff is the target is very rarely OpenAI/ChatGPT. I understand it can be a useful tool, but if your concern is that AI has scanned your books without your consent and can generate new content in your writing style, then OpenAI is who you should be complaining about.

Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Open AI is who people should be targeting since they are the ones who have all the money and the politicians in their pockets to basically stomp out any competition. My real fear is not that people find creative uses for AI on a small scale, but that Microsoft/Open AI builds a centralized system that works on their terms where you are forced to play by their rules and they decide what is fact and fiction.

I haven’t read Zach Rosenberg before, but I put this prompt into ChatGPT, and sure enough, it generated what I presume to be writing in his style:

> Could you write two paragraphs in the style of Zach Rosenberg arguing in favor of shutting down a tool that uses AI to analyze the text of his books?

Did the author consent to Open AI scanning the text of his books to generate new text emulating his writing style? Where is the outrage over that?

[+] Nasrudith|2 years ago|reply

> Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Isn't it obvious? Bullies always go after easy targets. And nothing is more popular or loathsome than self-righteous causes for bullies.

[+] jprete|2 years ago|reply

This is an incredibly biased article, hinging entirely on the assumption that AI training is fair use.

[+] harshreality|2 years ago|reply

"Fair use" only applies to instances of copying / redistributing. The hint is in the name: copy-right.

There's a notion, which seems to have taken off among creators who are paranoid about AI eating their livelihoods (which it might eat a chunk of) that copyright prevents people from doing anything with works they [legally] acquired other than personally read, listen, or watch it.

That's not how copyright, as it has existed in the past, works. You can do all the algorithmic processing of your ebook collection that you want. You might be able to display small portions of a book to others, depending on the situation.

Quoting one or two paragraphs out of an entire book seems like reasonably safe fair use, but that won't stop a copyright-maximalist creator (or their publisher) from suing you, and won't stop some copyright-maximalist judge from ruling against you, so it's probably best to minimize the amount of content from a book that you redisplay directly. But you can do all the analysis and statistics generation you want, and display those results to others.

It remains to be seen what judges will do with AI generation of works based on ingesting gigantic amounts of copyrighted work. The entire framework of copyright is going to be broken, and until Congress steps in and changes it, judges are going to go every which way. There's no bright line for 4-factor analysis; it's always been a gut-level "is this a reasonable use that doesn't impact commercial sales too much". There's no possible rational way to draw a line. AI models can generate a painting of a new subject only loosely in the style of a contemporary painter, which would not be copyright infringement, or it can generate a near-clone of an existing work with the right prompting, and depending on how clever the prompter is, a lot of intermediate stages of likeness. Who decides how close to an existing work is too close?

[+] 542458|2 years ago|reply

The article cites the Google vs Authors Guild case (https://www.techdirt.com/2013/11/14/google-gets-total-victor...) which was a total victory for Google. This seems fairly conclusive that the textual analysis here is fair use to me.

> Similarly, Google Books is also transformative in the sense that it has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.

Furthermore, is this actually AI training? This just looks like stats based on heuristics to me, I.e., garden variety sentiment analysis.

[+] rgoulter|2 years ago|reply

IANAL. But, what's your consideration such that you think this usage of the books wouldn't be "fair use"?

e.g. in the US https://en.wikipedia.org/wiki/Fair_use factors of fair use are "purpose/character of the work", "nature of the copyrighted work", "amount/substantiality of the copyrighted work", "effect on the market for the copyrighted work".

The website shows a few statistics computed from a book, and a few excerpts from the book.

I'd think a consideration of those fair use factors favours the website: e.g. you're not going to look at those statistics/excerpts instead of reading the book. The website only shows a small portion of the book. The website's intention is to be educational.

[+] pmarreck|2 years ago|reply

There is no “AI” in the project. It literally just seems to do Bayesian sentiment analysis of books. It does NOT then mine that data to score OTHER books. Please actually read the article next time before commenting

[+] js8|2 years ago|reply

Why it wouldn't be? Indexing is fair use. I think where it gets murky with fair use would be if AI could actually plagiarize the book, but other than that, it should be fair use.

[+] zinekeller|2 years ago|reply

Unfortunately it seems majority of HN clings on "AI is fair use!" the same way GIFs (of movie and TV snippets) are "perceived to be fair use" (but is simply not fair use in most cases).

[+] tensor|2 years ago|reply

There is no AI training here. Did you read the article?

[+] runj__|2 years ago|reply

I honestly don't understand how people can refuse the idea of that some parts most certainly can be reviewed with the help of a computer. There wasn't a "how good is this book" score because a computer might not be able to tell that yet, I don't understand the issue in looking at the number of adverbs in a book with the help of a computer.

[+] GaggiX|2 years ago|reply

Someone should recreate the website and change the word "AI" to "underpaid worker" and everything would be fine.

[+] morkalork|2 years ago|reply

"crowdsourced"

[+] activitypea|2 years ago|reply

I just think it's really funny that a third or so of the article is the author struggling to figure out why this would be useful to anyone.

> scanned and analyzed a whole bunch of books and would let you call up really useful data on books. [...] Frankly, all of that sounds amazing. And amazingly useful. Even more amazing is that he built it, and it worked. It would produce useful analysis of books.

> This is all quite interesting. It’s also the kind of thing that data scientists do on all kinds of work for useful purposes. Smith built Prosecraft into Shaxpir, again, making it a more useful tool.

Author's general illiteracy aside, he's really giving the game away here. I can't even think about the ethical implications of the project, because why would I care to count the number of adverbs and passive voice in all books ever, and why would you need a state of the art LLM-powered AI to do it?

[+] GaggiX|2 years ago|reply

The ethical implication of indexing books?

Also where do you read the author of the website was using LLM?

[+] em-bee|2 years ago|reply

he is not even using an LLM, which is kind of the point. the tool is thrown in and judged as an AI tool when it is just simple statistics that anyone who studied some math can build with a little bit of effort.

[+] steezeburger|2 years ago|reply

Imagine seeing the trends in books throughout certain time period, wars, etc. Or larger trends over the history of all written works. Or all kinds of other neat and useful information that can impact decisions we make today. Do you think all analysis of things you aren't personally interested in is useless?

[+] alwaysbeconsing|2 years ago|reply

I generally appreciate TechDirt but this is a very weak argument, the logic is inverted. That some people who are opposed to LLMs were wrong to attack a tool that has nothing to do with LLMs does not allow us to conclude they are wrong about LLMs.

[+] renewiltord|2 years ago|reply

The tool looks like it was useful to a certain kind of person. If it would actually make money, I would gladly (and probably easily) replicate it because I don't really care that much if Internet randos hate me. But I don't have a good idea for how to keep the return / effort ratio high on this.

I don't need consent for a lot of this, and I probably wouldn't bother. If I made a "List of books with terrible sentences" I wouldn't ask for opt-in or even bother contacting the authors. I will just make the list and quote the sentence.

The law and public opinion is on my side, though I only need the former.

[+] nottorp|2 years ago|reply

Copyright issues aside.

My personal opinion is that these tools are mostly useful to suck the "soul" out of a book. They give you templates and stuff and useful statistics to help you go to the lowest common denominator.

The problem is more visible in the movie industry, where they have had script templates for a hundred years now (actual time interval pulled out of a*), but it's starting to show up in books too.

For those happy to "consume" Netflix series and Marvel movies that are indistinguishable from each other except maybe filmed with different actors, it should be fine.

If you want originality in your entertainment it's sad news.

[+] yboris|2 years ago|reply

I wonder about a parallel for paintings. What if there was an analysis stating exactly the brushes the painter used, the number of strokes, the exact pigments, etc? Would that, in your opinion, "suck the 'soul' out of a painting"?

I could see this as a brilliant learning tool. A tool to provide deep insight into something that would be very challenging to quantify personally. I think all this would make future authors better, not worse.

297 comments