Why literature is the ultimate big-data challenge

[+] DrNuke|9 years ago|reply

Can't really agree, literature is literature because of form not content, every literary work is its own separate world with references to other similarly separate worlds from the past. Data analysis can help finding invariants, like in Meter in Poetry from Prof Nigel Fabb https://www.amazon.co.uk/Meter-Poetry-Theory-Nigel-Fabb/dp/0... , but not the reason why literature is literary at all, which comes from social sciences. Mythopoesis as the cumulative sum of the real and imaginary worlds thought by humans is another possible result but it would not be literature, languages and theory of language instead.

[+] wjn0|9 years ago|reply

> every literary work is its own separate world with references to other similarly separate worlds from the past.

This sounds like the basis of hypertext fiction (https://en.wikipedia.org/wiki/Hypertext_fiction) which has of course existed for much longer than big-data as a concept.

As for what characterizes literature, I'm inclined to agree with you. However, is there not inherent value in another (albeit computational in nature) reading, so to speak? If we take Barthes to be correct, what does it say when a well-trained NLP model draws similar conclusions to humans with regard to imagery, analogy, irony, metaphor, etc. when reading major literary works? Different conclusions? Or no conclusions? What if some works "compute" and some don't? What if some works's features are culture-independent, i.e., a model trained on an Eastern literary corpus computes similar features as a model trained on a Western literary corpus, while some features aren't?

Perhaps these questions are more superficial than I'm making them out to be, but it seems presumptuous to assume that methods that look at this problem from this angle won't get at _any_ literary features.

[+] hackuser|9 years ago|reply

We often start with the null hypothesis of artistic "exceptionalism, which imagines him [or her] as a freak of isolated genius", i.e., we assume that the credited creator worked alone.

I think that's wrong and few serious human endeavors are accomplished alone. In the arts, look at stories about how works were really created. Pop music is an easy example because it's well known: Songs are written with a suggestion from a friend, input from the producer, are based on something the performer heard on the train, are created when someone uncredited sits in on the session and provides the key hook, etc. A friend is writing a book, and I spent hours reading it and offering suggestions; I'll receive no credit (and I don't want or deserve it). In the code people write, how much is done without any help from others, without using existing code and ideas? As the saying goes, good artists borrow; great artists steal.

[+] aghillo|9 years ago|reply

Howard Becker's Art Worlds book looks at this very issue - it's a very interesting read.

[+] Jun8|9 years ago|reply

Although the field had its crackpots it was never as off-the-rails as the Economist writer makes it out to be. Mosteller and Wallace's analysis of the Federalist Papers is a well-known early effort (https://priceonomics.com/how-statistics-solved-a-175-year-ol...).

Part of my MA covered this topic (stylometry), you can take a look (https://www.dropbox.com/s/3q9ljrgnntgs6ee/t.pdf). It lists some of the references up until that time (c. 2004).

[+] Isamu|9 years ago|reply

Thanks!

[+] Isamu|9 years ago|reply

I would be interested in reading some technical papers on the subject, could someone post a few? Whether about Shakespeare or other literature analysis?

That said, this is SO not big data. It is small data, but potentially interesting analysis.

[+] NarcolepticFrog|9 years ago|reply

I think it depends on what you mean. I don't think that just the number of bytes of data should be used to measure the size of your dataset. For example, if I have 1TB of all 1's, this is a lot of data, but not very interesting.

I think a more nuanced notion of size is the /information content/ of the datasets. I haven't thought about it carefully, but I'm sure you can quantify this more explicitly in terms of information theory or other non-bit-based complexity measures.

From this point of view, literature may not be large (in terms of the number of bytes), but in terms of information content, it is incredibly dense. A large portion of human knowledge is written somewhere. From this point of view, it is in fact big data.

[+] woliveirajr|9 years ago|reply

One example: http://www.inf.ufpr.br/lesoliveira/download/FSI2013.pdf

[+] zwischenzug|9 years ago|reply

Reminds me of my first ever blog post...

https://zwischenzugs.wordpress.com/2011/03/06/shakespeare_un...

[+] woliveirajr|9 years ago|reply

I'm interested in this subject, as I've studied it before. Was really nice to see how many techniques are used, with those functional words being just one (with good results, it must be said). Even reducing words to n-grams is interesting, since it catches the radix of the words (leaving those prefixes and suffixes out of the question).

[+] bmc7505|9 years ago|reply

Interesting talk about Authorship Detection from 28c3: https://www.youtube.com/watch?v=-b0Ta9h62_E

[+] awinter-py|9 years ago|reply

interesting they use romeo & juliet as the first example and attribute to marlowe.

most of romeo & juliet is copied scene-for-scene from the english translation of an italian play by the same name. Most of the lines you remember from the play were added in shakespeare's version.

In the example in this article, I'm getting chills from how much better the shakespeare line is vs the marlowe line he stole.

If you read R&J side by side with its italian source, start with the 'if I profane' scene ('let lips do what hands do').

21 comments