top | item 30309959

(no title)

3wolf | 4 years ago

I'd be curious to see what the plot of average annual cosine distance would look like when using different sets of pre-trained embeddings. I suspect the corpus used is biased toward more recent documents. It wouldn't surprise me if there's more variance in the embeddings of documents that look less like those in the training set, e.g. if you were to embed documents written in German you may get some extreme outliers.

discuss

avs733|4 years ago

The telling part for me is the scale of the graphs from the original paper and the blog post

The paper graph shows a decrease from like .1 to .075 over 30 years

The blog post shows a decrease from .35 to .1 over 100 years

However, the crimson had a notable drop around 2000 - on the order of the entire decrease in the research study - and then looks fairly stable.

It’s almost like there are policy and effective communication reasons for narrowing your vocabulary to that appropriate for a target audience.

If you look at the graph from the underlying study, there’s a bit of a “shoulder” around 1997. That is telling. 1997 was when the NSF introduced a new, clear and explicit, set of review criteria - broader impacts and intellectual merit [0]. That change alone is likely a cause of significant (Meaningful?) linguistic narrowing. To get NSF funding, researchers now had to explain why their research matters using explicit language that aligned with specific strategic objectives of the funding agency.

Then you add a layer of Goodharts law. During this time period, two other things were also happening. First, University’s increasingly began to rely on external funding - especially public university’s. Second, the field of “faculty development” was increasingly formalizing and offering training and support on things like grant writing. Those trainings include a lot of focus on using normalized, almost shibboleth like, language in grant applications. Ensuring that there is religion of key words so that it is easy for the reviewers to establish that a particular grant application addresses the required review criteria.

So the data set used here is part of the problem - if they had used submitted rather than funded they would likely see different results. Not because ideas are narrowing but because grant applications are basically a human manifested api, and someone tried to actually standardize it.

[0] https://stem.colostate.edu/a-history-of-the-broader-impacts-...

cowsandmilk|4 years ago

This. Diversity words are almost always describing the Broader Impact. Meanwhile, the original study is pretending they are for the research ideas of the grant.

thesephist|4 years ago

This point should get more visibility — embeddings are not made in the abstract; they reflect the lexicon of their training sets, and I strongly suspect word embeddings used here reflect the lexicon and word frequencies (and their meaning/usage context) in modern literature. Using them for text in the past (nearly a century back!) warrants some skepticism.