Natural Language Processing for the Working Programmer

[+] microtonal|15 years ago|reply

We just started out one and a half week ago, joining the Pragmatic Programmer's writing month. We though a 'release early, release often' approach would be best, that's why there are just a few in-progress chapters.

We will keep you posted, and thanks for the encouragement!

[+] angrycoder|15 years ago|reply

Thanks for your work, I am really enjoying it so far. As a counterpoint to some of the comments regarding the choice of language, I had the opposite response. Oh neat, I get to learn haskell and natural language processing at the same time.

If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.

[+] roel_v|15 years ago|reply

You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.

I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).

- Is this an NLP problem? - If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?

[+] jimmyjim|15 years ago|reply

Hey Daniel,

I actually remember reading your Slackware book a few years back. I've no doubt that the quality of this text will be as superb as that one's! Cheers!

[+] hvs|15 years ago|reply

It's an interesting paper that I intend to dig into more carefully, but I kind of wish that a paper "for the Working Programmer" used a language like Python rather than Haskell. I'm aware that Haskell has a very nice type system for doing things like this -- and I'm a language nerd myself, so It's not that -- but it just seems like it would be more practical in something more "mainstream."

That said, it is interesting from what I've read so far.

[+] albertsun|15 years ago|reply

Perhaps try this instead?

http://www.nltk.org/book

[+] 1331|15 years ago|reply

The title is likely a hat tip to a famous functional textbook: _ML for the Working Programmer_.

http://www.cl.cam.ac.uk/~lp15/MLbook/

[+] microtonal|15 years ago|reply

Or maybe it's a cover-up to bring fun languages to the working programmer.

[+] warfangle|15 years ago|reply

I may try to translate the examples into Javascript (CommonJS platform, not client-side .. on second thought, I may simply do it on NodeJS with Coffeescript) - just so I can learn the topic better. I find that much like taking notes during a lecture, it helps me retain the knowledge better.

Maybe you should try to do the same with python? :)

[+] jlees|15 years ago|reply

Agreed, I saw Haskell in the TOC and stopped reading. Still, an interesting project - and about as appropriate for "working" programmers as Larry Paulson's book, so no issues with the title.

It'd be easy enough to rewrite most of the examples in another language anyway (I'd hope), even if elegance is lost in the process...

[+] dons|15 years ago|reply

It is practical if it gets the job done.

[+] sabat|15 years ago|reply

I was just wondering if anyone's done something like this for Ruby.

[+] waterside81|15 years ago|reply

I've posted this link before, but these NLP posts keep popping up on HN, so I'll keep posting.

Over at http://www.repustate.com, we're taking the more common functions that NLTK performs (and the ones it should) and porting them over as web services. NLTK is kind of buggy here & there, and it's not too great if you're dealing with big data sets. Our API, with the obvious handicap of network latency, is lightning fast because we ported many NLTK functions down to raw C.

Our API is free so have at it, let me know if you want to see us add anything.

[+] unknown|15 years ago|reply

[deleted]

[+] riffraff|15 years ago|reply

whiled I'm sure something nice will come out of it you may wish to temporarily disable the NER feature because, it seems to amount to "select capitalized words" at least on the few pages I tried (wikipedia, bitcoin, nytimes).

It is blazing fast though :)

[+] LeBlanc|15 years ago|reply

I'll have to put this on my 'to read' list, it looks really interesting. I think natural language processing/understanding may become one of those next 'big things' like mobile and social media simply because understanding what a user is trying to do will become very important.

If anyone is interested in playing around with a robust natural language processing tool, I built an API for the Stanford Parser. http://nlp.naturalparsing.com/browserparser/parse

[+] mark_l_watson|15 years ago|reply

Thanks Daniël, this is cool!

I am not a very good Haskel programmer, but I spend an occasional evening with it, and I am interested in NLP also (have been working off and on on NLP since the early 1980s).

From skimming through the book, it looks like a nice read and just went on my reading list.

[+] samratjp|15 years ago|reply

This is pretty neat. At the risk of sounding childish, here I go -- I wish books like these could be given life like tryruby.org where you could try out examples and learn along the way. That would be wicked cool.

For now, OpenStudy will do the trick. I created a "StudyPad" if anyone wants to go through this book together. http://openstudy.com/studypads/Natural-Language-Processing-f...

[+] jasonjei|15 years ago|reply

It's interesting to note that a lot of natural language processing is English-centered. It's clear that English natural language processing is way ahead of the curve, but based on the quality of Chinese results on Google Translate, I take it Asian languages don't do so hot when it comes to natural language processing?

[+] syllogism|15 years ago|reply

Translation is a lot harder for language pairs that are less related. Most of the European languages are fairly close cousins, so translation between, say, English and French isn't that hard.

That said, it's generally true that for most NLP tasks, we're doing much better on languages similar to English.

[+] brettbender|15 years ago|reply

Also note that translation is a very different task than parsing, or part-of-speech tagging, for example. Summarization and translation are both open research topics in NLP from what I understand, and aren't really 'solved' in any language.

[+] _corbett|15 years ago|reply

translation is also a lot harder for data poor pairs of languages. machine translation relies on training on a parallel corpus (=same text, different languages), and gets better the bigger this is.

[+] unknown|15 years ago|reply

[deleted]

36 comments