(no title)
3wolf
|
4 years ago
I'd be curious to see what the plot of average annual cosine distance would look like when using different sets of pre-trained embeddings. I suspect the corpus used is biased toward more recent documents. It wouldn't surprise me if there's more variance in the embeddings of documents that look less like those in the training set, e.g. if you were to embed documents written in German you may get some extreme outliers.
avs733|4 years ago
The paper graph shows a decrease from like .1 to .075 over 30 years
The blog post shows a decrease from .35 to .1 over 100 years
However, the crimson had a notable drop around 2000 - on the order of the entire decrease in the research study - and then looks fairly stable.
It’s almost like there are policy and effective communication reasons for narrowing your vocabulary to that appropriate for a target audience.
If you look at the graph from the underlying study, there’s a bit of a “shoulder” around 1997. That is telling. 1997 was when the NSF introduced a new, clear and explicit, set of review criteria - broader impacts and intellectual merit [0]. That change alone is likely a cause of significant (Meaningful?) linguistic narrowing. To get NSF funding, researchers now had to explain why their research matters using explicit language that aligned with specific strategic objectives of the funding agency.
Then you add a layer of Goodharts law. During this time period, two other things were also happening. First, University’s increasingly began to rely on external funding - especially public university’s. Second, the field of “faculty development” was increasingly formalizing and offering training and support on things like grant writing. Those trainings include a lot of focus on using normalized, almost shibboleth like, language in grant applications. Ensuring that there is religion of key words so that it is easy for the reviewers to establish that a particular grant application addresses the required review criteria.
So the data set used here is part of the problem - if they had used submitted rather than funded they would likely see different results. Not because ideas are narrowing but because grant applications are basically a human manifested api, and someone tried to actually standardize it.
[0] https://stem.colostate.edu/a-history-of-the-broader-impacts-...
cowsandmilk|4 years ago
thesephist|4 years ago