I've been interested in NLP for tagging stories based on topics and themes (detectives, werewolves, murder mystery, etc.), so need accurate disambiguation of parts of speech and ways of detecting uses of metaphore, similies, etc. to describe those. I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.
I've looked at several existing NLP frameworks (Open NLP, Stanford NLP) and none of them are accurate enough -- they fail on things like adjectives and old english second person pronouns. This makes them practically unusable for proper sense diambiguation, lemma and part of speech based rules, etc.
The Open NLP tokenizer is also terrible at tokenizing title abbreviations ("Dr", etc.) and things like the use of "--" to delimit text, which is frequently found various Project Gutenberg texts. You can train the Open NLP tokenizer, but it works on what it has seen, so you need to give it every variation of "(Mr|Mrs|Miss|Ms|Rev|Dr|...). [A-Z]" for it to tokenize those titles; the same for other tokens.
Consider the letters section of Phyicical Review, that turned into a weekly journal Physical Review Letters back in the 50s. It was a blog or forum back when papers were distributed on paper.
Basically because of the slow pace of review and publication the letters column became a way to talk about recent results or problems, and then follow up letters (i.e. comments on the blog posts) became common. So the editors decided to hive it off and speed up its publication schedule.
Thank **ing god! Hiding behind jargon and "the process" is an indicator of having nothing to say. I see this as a rolling up of the metaphorical sleeves, a sign that stuff is actually happening.
The particular headline pattern is used a lot and 99% of the time it indeed is at the front of opinion pieces written by individuals. But this time, it's different. In the syntax, in this specific form, the plural and the singular come out the same. E.g. in the sentence "I accompanied my friend to his parent's house", it can be either the house of his single mom/dad or the house of his two parents living together.
Yeah those aren't good especially stuff like healthcare a PhD student can't get the needed data.
The other issue is, if you do focus on LLMs, it's too hyped your research would be too overlapping/competing especially as you've got a dissertation to write. It's a hard problem.
According to the article, the original research on language model was kick started by Claude Shannon's early contributions in Markov chain model of English words.
If you are in the field of Information and Communication Technology (ICT) there are hardly any area in the field which their fundamentals do not have Shannon's hands in it.
Leonard Kleinrock once remarked that he has to focus on the exotic queuing theory field that later leads to the packet switching and then Internet because most of the fundamentals problems in electrical and computer engineering (older version of ICT) have already been solved by Shannon.
Isn't the main problem with NLP research now that you'll need a ton of money to run your experiments? How can an "average" PhD researcher hope to validate their hypothesis if they need several thousand dollars per test?
Much of the interesting pure research can be done at smaller scale, the larger models are arguably more product engineering than research. At least from a certain perspective.
[+] [-] rhdunn|2 years ago|reply
I've looked at several existing NLP frameworks (Open NLP, Stanford NLP) and none of them are accurate enough -- they fail on things like adjectives and old english second person pronouns. This makes them practically unusable for proper sense diambiguation, lemma and part of speech based rules, etc.
The Open NLP tokenizer is also terrible at tokenizing title abbreviations ("Dr", etc.) and things like the use of "--" to delimit text, which is frequently found various Project Gutenberg texts. You can train the Open NLP tokenizer, but it works on what it has seen, so you need to give it every variation of "(Mr|Mrs|Miss|Ms|Rev|Dr|...). [A-Z]" for it to tokenize those titles; the same for other tokens.
[+] [-] nl|2 years ago|reply
I find it substantially better than other tools as PoS tagger.
Also worth noting the that your assertion that you need these features to classify genres isn't obviously true to me at all.
[+] [-] viksit|2 years ago|reply
[+] [-] dontupvoteme|2 years ago|reply
[+] [-] teruakohatu|2 years ago|reply
[+] [-] codethief|2 years ago|reply
[+] [-] gumby|2 years ago|reply
Basically because of the slow pace of review and publication the letters column became a way to talk about recent results or problems, and then follow up letters (i.e. comments on the blog posts) became common. So the editors decided to hive it off and speed up its publication schedule.
[+] [-] sdwr|2 years ago|reply
[+] [-] SecretDreams|2 years ago|reply
[+] [-] penguin_booze|2 years ago|reply
[+] [-] vpastore|2 years ago|reply
[+] [-] dcl|2 years ago|reply
[+] [-] SecretDreams|2 years ago|reply
[+] [-] est31|2 years ago|reply
[+] [-] reesul|2 years ago|reply
[+] [-] totorovirus|2 years ago|reply
[+] [-] whatyesaid|2 years ago|reply
The other issue is, if you do focus on LLMs, it's too hyped your research would be too overlapping/competing especially as you've got a dissertation to write. It's a hard problem.
[+] [-] teleforce|2 years ago|reply
If you are in the field of Information and Communication Technology (ICT) there are hardly any area in the field which their fundamentals do not have Shannon's hands in it.
Leonard Kleinrock once remarked that he has to focus on the exotic queuing theory field that later leads to the packet switching and then Internet because most of the fundamentals problems in electrical and computer engineering (older version of ICT) have already been solved by Shannon.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] etamponi|2 years ago|reply
[+] [-] nl|2 years ago|reply
There are plenty of research directions that are outlined in this document that don't require huge compute budget.
[+] [-] jasmer|2 years ago|reply
[+] [-] Randomizer42|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] al__be__rt|2 years ago|reply
[+] [-] lgessler|2 years ago|reply