top | item 5579336

Things I Learned Building The Most Powerful Language Processing Engine

46 points| drakaal | 13 years ago |xyhd.tv | reply

59 comments

order
[+] gliese1337|13 years ago|reply

  Linguists don’t know squat about grammar in modern times. Everything is a verb these days. I Google things. I FedEx things. My game gets nerfed.
That's a bit broad. Talk to the right kind of linguists, and yes, they do know these things.

Every problem he talks about is well-known. Not to say that they aren't still problems- they are problems, and hard ones, that linguists (computational and otherwise) don't have good standard solutions for. Thing is, most of the time, you can get away with not bothering to solve the real problem. Most of the time, you can pretend that white-space separates words and periods separated sentences and that 8 parts of speech is good enough, and it will be good enough. And thus the incentive to spend lots of time and research money on solving all of the big problems in natural language processing is reduced.

Except when it isn't good enough, and we wonder "why the heck hasn't this been solved yet? Doesn't anybody realize that this library is totally broken?" But yeah, we know that it's broken, we know what the problems are, they're just really frickin' hard.

[+] drakaal|13 years ago|reply
You are right there are a select few. But it is very few. I linked to the quora as just one example. The challenge is that they focus on what is "right" not what is "common".

It is like swum. It is a real word, and I have swum many times in my pool, but it is not how humans write. More of practical linguistics is about how people use words, not how they are supposed to use words.

[+] gavinh|13 years ago|reply
I've been looking at stremor.com to find some justification for your bold claims. I'm still looking; these are some ancillary points:

-The copy throughout the site does not inspire confidence. Your point that many texts are written poorly is valid. That does not require you to also write poorly. I've seen far too many sentence fragments on your site. You site also includes some embarrassing misuses of words, like "Some may believe using heuristic science in language analysis infers it is a learning system." These incidents discredit your claims about your technology.

-I am having difficulty finding specific information about how your technology works.

-The people responsible for your graphics, visual design, and the video about the summarization app should not have those responsibilities. Summly had nice a aesthetic; your aesthetic is jeopardizing your credibility.

I apologize if my comments sound hostile. When you make claims as bold as yours you should prepare for scrutiny. The problems you are addressing are interesting; I wish you good luck.

[+] drakaal|13 years ago|reply
You judge the technology based on the marketing? Summly had no tech and lots of marketing. SRI had no marketing and lots of tech. I'd rather be SRI than Summly. Take note as to which one got a Billion dollar valuation.
[+] bane|13 years ago|reply
"200,000 words. Ha! 400K words. I laugh. 3.2 Million words. I still know I am missing stuff. Single word nouns in just the singular form exceeds 150k. 40k verbs and conjugates. 37k adjectives. 10K adverbs. I know I am still missing things."

I was all set to hate this post, but I found I ended up largely agreeing with it, especially the quoted bit above. I remember working on a very focused NLP tool a few years ago and needed a comprehensive English Lexicon. No problem, I'll just scrape WordNet or similar, not even close. Then you start dealing with stemming and conjugations and such and realize that almost all of the algorithms for dealing with this kind of thing would barely even be hacks in software terms, yet there they sit, regurgitated in countless libraries, generating garbage stems all over the place. It ends up just being easier collecting all the stemmed forms as well and just building some smart in-memory indexes and data structures for searching millions of words.

Vector space models? Why do they work? Nobody really seems to know! Just jam all your words into some matrixes and run some simple calculations and voila you get something seems to kind of like it sorta works some of the time.

Sentence tokenization is stupid hard, but shouldn't be, parantheses have all kinds of different meanings, commas are a mess...English is stupid.

The worst bit though really is that most of the research->turned into software assumes astonishingly brittle models about the language which almost never seems to describe any actual usage of the language which always means very frustrating almost right results out of NLP systems. My previous sentence, for example, would cause most NLP software to blow a gasket.

[+] drakaal|13 years ago|reply
I ran it through our stuff we do fine with segmentation of your comments. We still have trouble with certain poorly formatted numbers, or if you do something like for get a space after a sentence that ends in a number. "I'll take 2.In case I need one later."

Typos are really hard for rules systems.

PS Glad you didn't hate it.

[+] Groxx|13 years ago|reply
Interesting blog post, though I wish they would give us some meat rather than what's essentially a rant. I do find it a bit amusing that it doesn't perform well in their own system:

>I am the CTO of Stremor, we make TLDR Reader. Sentence objects Sentenceobjects represent information extracted from a single sentence within a document. Attributes and methods available on Sentenceobjects: ∗ text: The raw text of the sentence as a string. ∗ names: A list of all names detected in the sentence.

Yeah, this is similar to picking at Google Translate not translating Google into the same text as they use on their other-language homepages. I'm honestly not complaining, I just found it funny :) To be fair, it does a significantly better job at other blocks of text I've thrown at it. Not great, but surprisingly good - I'll have to prod it more.

[+] drakaal|13 years ago|reply
As I mentioned in comments, I didn't mark up my "code paste" so it decided the documentation was more important than my comments.

The TLDR version is optimized for news at up to 5000 words. We have some other stuff for Fiction but it isn't public, and a version that is specific to politics.

[+] drakaal|13 years ago|reply
Oh, and the 25% view "short" is pretty good.
[+] akavlie|13 years ago|reply
One of the Stremor devs here. I'm helping to build Liquid Helium, and wrote the cited README. Never expected that it would make its way to a public blog post :-).

Yes, language processing is hard. There are two challenges here:

1) Understanding the ambiguities of language, when every word in the sentence can be 2+ parts of speech.

2) Making it fast, and making it fit within the relatively constrained RAM limits of App Engine instances.

We're wrestling with both while we greatly expand on what Liquid Helium can do. It's not easy, but some of the things we're able to do with it are pretty magical.

[+] drakaal|13 years ago|reply
To akavlie's Ram limit comment: We run on Google AppEngine. It gives us great scalability, but it limits us to about 512 Megs of ram.

We made this choice because with a small team it gave us the most freedom to focus on the code, rather than the scalability and the infrastructure. We don't manage servers, or routers, or load balancers.

[+] drakaal|13 years ago|reply
To the ambiguity comment:

Can is really annoying. Can is a verb. Can is a Noun. It is part of a Noun Adjunct.

The rules to determine "The chicken soup can help when you are sick" and "The chicken soup can landed on my foot" and "We can peaches to eat later" is very hard. More so if you put them in a complex compound sentence.

[+] kurumo|13 years ago|reply
What precisely makes this particular engine 'the most powerful in the world'? Does it do domain independent named entity recognition with an F score better than 0.8? For what classes of entities? Is it at least adaptable without oodles of training data? Does it do syntactic parsing? With F scores of 0.9 or better? Faster than 200ms per sentence? Across domains? Does it do anything at all in languages other than English? If there is a page on that site where it answers these types of questions I couldn't find it..
[+] drakaal|13 years ago|reply
Install the TLDR plugin. Pick a web site. Or better yet go out to project gutenberg pick a book. Tom Sawyer. Push TLDR. Way faster than 200ms per sentence.

Yes it does most Germanic and Romance languages.

Yes it does domain independent named entities with a a higher score than anything else on the planet. ALL English classes. Medical, Dental, Animal. (that doesn't include Latin uses of animal names) Technical.

As I said we are just stepping out of stealth. I linked a PDF in the comments here.

[+] zwegner|13 years ago|reply
Aside from the NLP-related aspects, which are pretty interesting by themselves, I was glad to see this:

> The biggest thing I learned. The thing I also hope my team has learned. Everyone else has hit the limits of what they can do because they weren’t willing to burn it all to the ground and start over. We start over a lot. We code for 3 days and then decide this won’t work, and we do it over again. We take the lessons we learned but little of the code. The second, or thrid time we do it right based on what we now know.

I am a big fan of rewrites, and am sad to see how much of the software engineering world has centered around the Spolskyesque idea of rewriting being a horrible waste of time. Truth is, 99.9...% of software just sucks. The more and more infrastructure we build up around it, the more constrained we become. Just look at the most common languages/tools/technologies we use on a daily basis. C++/Java/JS/Python? These are just awful. Hell, I'm even like Python compared to most languages, but there's still so much historical cruft, and it's ridiculously slow to boot.

I think this is mostly because people just suck at writing software in general. Rewriting in the industry usually doesn't buy you anything since it's the same people writing it, and they're still constrained by all the other software that they interact with. But if everyone was more willing to rewrite, focusing on code quality, I believe our tools would be less intertwined, we could achieve much faster rates of progress, and the life of a software engineer would be a lot more tolerable.

[+] drakaal|13 years ago|reply
I fixed my misspelling of third...

I think for us the biggest thing is that often we don't know what approach will work best until we try, and often we have to balance more than just readability or performance. Memory usage issues meant that we prototyped one way for getting all the words in to memory, proved that we had enough information about the words for 99% of what we were going to do, and that we could get the other 1% later, and then re-wrote with a different "loader" and different information about the words loaded.

I like functional programming but it made sense to make parts of the code object oriented not just for readability or ease of programming, but because it was more performant.

So our rewrites are partly for performance, partly because the end goal moved slightly, and partly for maintainability. As we stack on new features often the way we should do things changes drastically.

[+] jbg4|13 years ago|reply
When smart people get cocky about the hard work they've done, it gives me great confidence in their claims. Cockiness for smart people is reserved for only such occasions when the problem has been solved and tested, born into a growth state, ready to evolve. I am following this story with great interest because I'm rooting for the creative geniuses at Stremor and for the technological advances they are producing.
[+] homosaur|13 years ago|reply
"I am sure I will have a lot of mistakes for the grammar Nazi’s to point out in this post."

I'm hoping this was a subtle and brilliant joke and not just a typo.

[+] drakaal|13 years ago|reply
In a VC pitch you always include a small easy to fix, minor issue. One you know the answer to. That way they point it out, you tell them how smart they were for mentioning it, and then the solution. They feel like they proved you aren't perfect, and you avoid them poking hard enough to find your real issues.
[+] zomgbbq|13 years ago|reply
It would be cool to be able to download an SDK and build experiments with this without having to email sales and engage in a licensing agreement for the software first. I've always appreciated the fact that SaaS products like parse.com, twilio.com, and stripe.com have a low barrier to experiementation and probably lends to the reason why so many solutions today use their technology.
[+] drakaal|13 years ago|reply
We are working on getting it in to the Azure Data Market place with a free monthly tier of a certain number of API calls. Microsoft is being slow (more than 5 weeks). We are looking at Mashape but we have not heard good things about their uptime, and their accuracy in billing.

As a developer I feel your pain. Balancing our building, business concerns, and support for an API is hard. We are a small company and are doing our best. But if you do email sales we will make sure to get you access to the API as soon as it is available.

[+] drakaal|13 years ago|reply
We were mostly flying under the radar. But with two companies doing summarization being bought for $30M in the past month we are becoming less stealth.

I know it sounds like a bold statement, but I believe it to be true. Doing things right required we build our own tools, and not rely on libraries from third parties. I think we benefitted a lot from that philosophy.

[+] anigbrowl|13 years ago|reply
I installed the Chrome plugin with interest, but it only seems to work on the stremor.com page. However, for all I know it's in alpha/beta stage.

Like the general approach, which looks very promising.

[+] drakaal|13 years ago|reply
Did you restart Chrome after? It should work everywhere.
[+] esperluette|13 years ago|reply
I can't be the only person who wanted to write this:

tl;dr

[+] drakaal|13 years ago|reply
We have an app for that. http://www.tldrstuff.com

The "short" 25% view is pretty good on this. The summary doesn't read as well because it picks up the ReadMe because I didn't mark it as code snippet because I'm sucky at WordPress editing.

[+] jorah|13 years ago|reply
I am supposed to buy tools for analyzing writing from people who cannot write?
[+] drakaal|13 years ago|reply
People become editors at book companies not because they can write, but because they know how others should.

I'm confident few of the Engineers for F1 racing are spectacular drivers.

I know my PE teacher couldn't touch his toes.

[+] jaytaylor|13 years ago|reply
Any chance this project will ever be open-sourced?
[+] drakaal|13 years ago|reply
Parts of it maybe. But as a whole not likely. Much like SRI we are hoping that we will license the tech to companies. We could build full products, but we think that the way this can be best applied is as enhancements to other products. Search, Technical and Customer Support, Content Authoring, News Aggregation.

We also have tech that could vastly improve book and movie reccomendations based on the content and themes of books, not just "did someone like you like this" kinds of systems.