top | item 4923208

Programmer creates 800,000 books algorithmically, starts selling them on Amazon

151 points| Libertatea | 13 years ago |extremetech.com | reply

138 comments

order
[+] duey|13 years ago|reply
I work in ecommerce and this guy and others like him have been creating this stuff for 3-4 years. It's a massive annoyance for us as the only real identifier we got from suppliers is the publisher name, we blocked them for awhile but then we started seeing a massive range of publisher names coming through in order to get around blocks (I assume every other retailer was blocking them as well).

When we first started seeing the automated content books, literally overnight our product set increased several million titles - I now estimate that about 20 million of the 50 million products we have are automated books (they come out with new "editions" all the time as well). This obviously has a massive impact on our search results - these books have keyword laden titles and descriptions, and without a solid identifier it was very difficult to get rid of them. Thankfully recently the suppliers that print these products have started flagging them as crapware.

As for the customer response to this type of product - it's definitely negative, with a massive return rate. As far as I am concerned, this is a massive scam - hedging on the fact that some people are to lazy to return the books.

[+] powershop-co|13 years ago|reply
I wonder if it wouldn't be a good idea to "hellban" them. When they visit the site, they see the books - but for all other users they'd effectively be invisible. It might not be worth the programming effort, though. I wonder how many books this guy sells.
[+] rwg|13 years ago|reply
Just what Amazon needs: 800,000 computer-generated books crapping up the listings along with the thousands of on-demand printed copies of Wikipedia articles. Wonderful.
[+] msutherl|13 years ago|reply
I can't believe this guy has no sense of humor[1]! He's completely serious about providing a service and making tons of money from it.

In one question the interviewer asks: 'could you make a novel for instance?' to which he replies 'well, novels don't usually make money'. Wouldn't you think the correct answer would be 'well, it wouldn't be very good'?

Hauntology at its best[2].

[1] http://www.youtube.com/watch?v=m8WuGKyBR90

[2] http://booktwo.org/notebook/hauntological-futures/

[+] podperson|13 years ago|reply
Although, to be fair, while Amazon doesn't need this, it kind of deserves it.
[+] rockmeamedee|13 years ago|reply
These aren't actually 'books'. They're mostly business reports, which I hate to say, are already programmatically compiled by interns repetitively applying formulas in spreadsheets and building graphs from database queries. That's why they cost more than paperbacks, because "The 2009-2014 Outlook for Plastics Lamp Shades in the United States" is essential for the people in the business.

He also has medical dictionaries which seem to confuse his cause, as he explains "For health titles, only the format editing and production side is automated. The text in the health books was written by medical professionals and edited by a professional editor; the computer expedited formatting using about 50 odd routines (the preface, chapter intros, glossaries, indexes, headings, margins, etc.)"

[+] drucken|13 years ago|reply
So ... automated content aggregation in book format?

Is it it just me or does it seem odd that the agents with the most,

* technology

* resources (spanning all non-technology areas including business and legal)

* digital content access

* familiarity

* etc.

chose not to create such offerings, not even via subsidiary or ex-employees.

I'm talking, of course, of agents such as Google.

To me this speaks volumes of whether this is ultimately (legally) viable...

Note. Publicity around this particular offering has been around since at least 2007, from what I can tell.

[+] Zenst|13 years ago|reply
I'm sure this really not help in future digital library archieves for phrase searching.

Next we will have entire usenet archieves published as ebooks, least a few threads in there day of more interest.

[+] wdr1|13 years ago|reply
As long as you have good search functionality, it shouldn't matter. From what I can tell, so far Amazon does.
[+] jrabone|13 years ago|reply
Aw, man. I used to hate this guy when I worked on the title authority team. The programmatic titles alone often caused the title matchers to infer that all the books were somehow related, and cluster them all together. Oh, the happy hours spent unpicking the resulting mess. The Wikipedia guy was the worst in terms of sheer pointlessness though. Personally I'd delete the lot of them. Hard to think that your job is worthwhile when it consists of cleaning up after other people's crappy perl scripts...
[+] moe|13 years ago|reply
Hard to think that your job is worthwhile when it consists of cleaning up after other people's crappy perl scripts...

At least you are not alone. Every sysadmin in the world knows how you feel.

[+] mapt|13 years ago|reply
Why does Amazon allow it? It's somewhere between outright fraud, and DOSing or Amazon's search functionality; It's clearly something that constitutes abuse.

Similarly, if I were to fund a human team of title-writers to create plausible original titles for a hundred original titles per hour, and fill the actual pages with random words from the OED... Advertising that as a book on a particular topic would likely be some type of fraud. The result is not far from what I've seen of this guy's work.

[+] robryan|13 years ago|reply
This extends into lots of areas on the marketplace. I spend a bit of time looking at some of the listings our products have been matched to by UPC and working out just which product the listing is trying to sell. Often the title/ image/ description can be describing different products.
[+] thematt|13 years ago|reply
Ugh. It feels like the cesspool of the SEO underworld is physically manifesting itself.
[+] nickpinkston|13 years ago|reply
I was waiting for that to come from the 3D printing crowd, but books on-demand work too.
[+] philparker|13 years ago|reply
Phil Parker here. Interesting posts. Here are some links that can clarify for some: Here is a piece about reaching underserved subjects/languages: http://www.huffingtonpost.com/jeff-jarvis/davos-2011-too-lit... Here is a current project dealing with agriculture: http://gulfnews.com/news/gulf/uae/education/campus-in-abu-dh... Here is a poetry project (graph theoric stuff): http://totopoetry.com/poetry/credits/edgepoetry.htm

My favorite page: http://www.totopoetry.com/search.asp?word=truth I have used this approach to write definitions as well (www.websters-online-dictionary.org) The following contrasts definitions of zealously: 1. In a zealous manner. [Human] 2. In an enthusiastic, fervid, ardent or fervent manner. [graph theoretic] 3. In a fanatical manner. [graph theoretic]

A vid on fiction automation: http://vimeo.com/17168987 A debate/reaction amongst literature people: http://www.thepassivevoice.com/10/2012/can-robots-really-wri... Cheers Phil p.s. most of the “books” are used by businesses in narrow markets, and are econometrically estimated, not compiled from internet sources.

[+] Magenta|13 years ago|reply
Phil! Is that Phil Phil or motion-capture salamander Phil?
[+] bangbang|13 years ago|reply
Do you not see your texts as noise in the literary signal?
[+] tejaswidp|13 years ago|reply
could you tell us how many books have you been able to sell on amazon?
[+] jsilence|13 years ago|reply
In the video it can be seen that a fair amount of what the automated system does is versatile formatting ... in Word and Excel. But why oh why? (Semi)-automated formatting of documents is a problem that has been solved with expert results in LaTeX.

Also deviations in the data are recognized and highlited, but not (yet?) examined and elaborated upon. No doubt this will be possible in the near future.

So, kudos for the general approach. Can't wait for the automated reading programs for digesting these books. And that is meant only half jokingly. The data is there, the general knowledge is there and there is enough reasoning power to draw conclusions. The next step would be to make automated descisions based on the available data, so politicians could join the authors in beeing unemployed...

Well I for one welcome our new text blasting overlords.

[+] justhw|13 years ago|reply
This [1]The 2009-2014 Outlook for Wood Toilet Seats in Greater China seems to be the most reviewed book (35) at 3.3 stars, being sold for $495 USD.

[1] http://www.amazon.com/2009-2014-Outlook-Toilet-Seats-Greater...

[+] jyap|13 years ago|reply
Should be noted that they are all joke reviews.
[+] blazingfrog2|13 years ago|reply
Probably the best line in the wealth of well-written comments (yes, I'll admit I did read too many):

Also, I didn't like the Scratch N Sniff parts.

[+] thetrb|13 years ago|reply
If it was $4.95 I might have ordered it as a Christmas gift. But $495 is slightly too much...
[+] greenyoda|13 years ago|reply
Check out the "customer images" for this book...
[+] JasonFruit|13 years ago|reply
At 29 USD for 38 pages, Basketry[0] doesn't look like a particularly useful buy.

Also, from the description: "…editorial decisions to include or exclude events is purely a linguistic process." Is it really correct to describe that as an editorial decision? (Not to mention "editorial decisions…is"?)

[0]: http://www.amazon.com/Basketry-Websters-Timeline-History-700...

[+] greenyoda|13 years ago|reply
The index for this book is also useless. It seems to index every word in the text, whether it's significant or not; for example, "September". Also the index has entries for "Nova" and "Scotia", but not "Nova Scotia" (which is presumably where these two words came from).

I don't think that real authors would face any significant competition from this guy.

[+] krenoten|13 years ago|reply
The reviews tend to paint a pretty grim view of the quality of these books, but the sarcastic ones are some of the funniest reviews I've ever read.

Here's an honest one that shines a light on the quality:

"The description for this book is TOTALLY misleading. It is NOT a book of quotations and phrases. It is a reference book of where to look to find possible quotations- like the old filecard cabinets in the library. On the few pages where you can actally find a quote, it reads like this one: Jack London, from Jerry of the Islands, "I am writing these lines in Honolulu, Hawaii." Huh?? That's it. That's all there is! I'm not sure who would use this book. Certianly not me! I was very disappointed as I was looking for a collection of quotes from notables like Mark Twain, Jack London, etc."

Edit: Sort by avg rating. Goldmine of comedy in the reviews for "Butts" and "Scrotum" books. Wow.

Here's the review for "The 2009-2014 Outlook for Plastics Lamp Shades in the United States":

"(4/5 stars)An instant classic in the Icon style

While this outlook hardly holds a candle to comparable classics such as The World Market for Silica Sands and Quartz Sands: A 2009 Global Trade Perspective, the information is invaluable for any red-blooded American. The five-year span is parsed in fascinating prose, and the 176 pages fly by, feeling almost like a 150-page work.

Don't let your lack of background knowledge deter you - there isn't too much reference to the 2004-2009 report, and most of the important information is explained in exposition.

Luther Blaze runs the show in this non-stop thrill ride of an economic adventure. His last outing ended in the government setting up a secret commission to investigate his possible wrongdoing in stopping the mysterious project Mantis, but now he's back, and ready to run roughshod over anyone in his way.

At a price of approximately 2.81 per page, you know you're getting your money's worth with this paperback. It's a perfect read for the park, or a lazy Sunday afternoon. On a side note, this book is a real pick-up gem! I personally attracted no less than three beautiful women, all of whom wanted my thoughts on the challenging themes and motifs. They all gave their numbers, and there's been no looking back!

The biggest problems with this book are largely physical aspects of the book. I didn't care for the font too much. And the beige background on the cover betrays the intrigue within.

The Icon International group has hit another classic out of the park. I look forward to the next book with baited breath, and I can't wait to see how Agent 71 and Dash get out of this jam."

[+] jonnathanson|13 years ago|reply
The funniest part of "Scrotum" is that it's priced at $28.95. For an 84-page paperback, whose sole purpose is to trace the usage of the word "scrotum" throughout an oddly specific period in English linguistic history (1678 - 2007). I mean, perhaps the content farmer should also write a pricing algorithm?

The review is, indeed, ball-bustingly hilarious.

[+] iyulaev|13 years ago|reply
Reminds me of SCIgen: http://pdos.csail.mit.edu/scigen/

Generate conference-ready CS papers in seconds!

[+] wazoox|13 years ago|reply
Thank you, you made my day. This is really fantastic, I particularly appreciate the nonsensical scales on the graphs (time expressed in dB, number of CPU in Joules) and the hilarious fake reference papers scattered with well known names (Even Erdos :).
[+] hashmymustache|13 years ago|reply
What a greedy misappropriation of an otherwise incredible tool.

I assume what pushed him into the market was the software's economic analysis of the latent market for spam books on Amazon 2010-2014

As a medical student, I particularly love books like this[0] with their description: "If your time is valuable, this book is for you. First, you will not waste time searching the Internet while missing a lot of relevant information. Second, the book also saves you time indexing and defining entries. Finally, you will not waste time and money printing hundreds of web pages."

How is this not fraud?

[0]: http://www.amazon.com/Stevens-Johnson-Syndrome-Dictionary-Bi...

[+] guimarin|13 years ago|reply
This guy is clearly in the wrong business. Books, who cares about books, he should turn his system onto writing patents.
[+] xefer|13 years ago|reply
How does it determine the price? By the pound? Amazon should clearly mark these as computer generated. I'll bet most of the purchases are from people who didn't know what they were even getting.
[+] tjr|13 years ago|reply
Maybe that random Amazon-buying-bot will end up getting some of these.
[+] Thrall|13 years ago|reply
Is there evidence that anyone's actually bought one?
[+] westicle|13 years ago|reply
Roald Dahl came up with this idea decades ago.

For anyone not familiar with his work, I highly recommend his collection of short stories titled "The great automatic grammatizator". The major plot of the eponymous story is about a hacker who builds a novel-writing machine so that he can drive authors out of the market by out-producing them.

Very entertaining and, it seems, prophetic.

[+] DanBC|13 years ago|reply
Does the Library of Congress (and the other national library collections) have to keep copies of all these books?

Because actually now I'm angry that tax dollars are spent buying, shipping, storing, garbage.

[+] jebblue|13 years ago|reply
>> Because of his specialty in marketing, it’s easy to assume that these books are designed for spam-like purposes

I just want to know, is this even legal? Is he selling books?

http://en.wikipedia.org/wiki/Book

"The body of all written works including books is literature. "

http://en.wikipedia.org/wiki/Literature

"Literature is commonly classified as having two major forms—fiction and non-fiction—and two major techniques—poetry and prose."

http://en.wikipedia.org/wiki/Prose

"Prose is a form of language which applies ordinary grammatical structure and natural flow of speech rather than rhythmic structure (as in traditional poetry)."

http://en.wikipedia.org/wiki/Natural_speech

"In the philosophy of language, a natural language (or ordinary language) is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect."

The answer is no, since the text did not come from human intellect but computer programming; they can't be classified as books.

[+] Thrall|13 years ago|reply
If you go looking for overly specific definitions of words in natural language, which is effectively defined by its usage, any conclusions you draw from these 'definitions' will be flawed at best. Obtaining your fundamental definitions from selected lines in wikipedia adds bias as well as inaccuracies.

Also, even assuming your definitions were reasonable, you fail to explore the possibility that his books are poetry.

[+] kgc|13 years ago|reply
Title is incorrect. There are only 522826 of this author's books listed on Amazon, not 800000: http://goo.gl/byxKn
[+] Yhippa|13 years ago|reply
So this guy has been able to find a common schema across several different domains and add rules to it to churn out content. I like the concept. I watched the video in the article and could see how this could be used for instruction in the case where the resources like trained teachers are scarce.

I've had a feeling for a long time that due to the predictability of humans and our processes that it this is inevitable. I think it's great that he can do this for things like instruction but if this were to get "smart" enough that would put a lot of people out of work.

[+] DanBC|13 years ago|reply
I'm curious about the licencing - he's scraping some stuff from Wikipedia, but it's not clear which bits.

I'm also confused about the super high price. Is this deliberate to avoid having to refund to very many unhappy customers?

And while these books are probably awful he's going to be known by the future people as one of the innovators of auto-generated content. At least he's not breaking spam filters with Markov chains.

It's gently odd that AI got stuck for a while; I very much hope that AI research and practice gets a bit more attention and funding.