Show HN: TL;DRizer - an algorithmic summarizer webapp/api in java (weekend hack)

[+] HunterV|13 years ago|reply

Lorem Ipsum:

From:

Etiam tincidunt dolor at est sagittis a rhoncus turpis egestas. Integer elementum erat nec nisi molestie eu tempus magna feugiat. Mauris eu ligula et ligula vulputate tempor. Etiam vel lectus et mi vulputate rutrum. Cras libero ipsum, rhoncus at accumsan id, adipiscing iaculis turpis. Cras vel metus nec enim consectetur aliquet vel nec nunc. Proin at mauris purus. Nullam nulla dui, interdum nec pharetra sit amet, vulputate a lectus. Nunc vulputate pellentesque purus at euismod. Nam in justo quis ante porttitor pellentesque. Quisque quis purus a magna scelerisque egestas quis id sapien. Ut non felis sit amet ipsum sodales placerat. Proin nibh massa, sollicitudin et posuere a, placerat convallis magna. Duis lacinia mauris sit amet ante pharetra sed bibendum lorem euismod.

To:

Mauris eu ligula et ligula vulputate tempor. Etiam vel lectus et mi vulputate rutrum. Cras libero ipsum, rhoncus at accumsan id, adipiscing iaculis turpis. Nullam nulla dui, interdum nec pharetra sit amet, vulputate a lectus. Duis lacinia mauris sit amet ante pharetra sed bibendum lorem euismod.

So much faster to read, I never had the time to read through all those design mockups!

[+] mohaps|13 years ago|reply

well, it ain't called TL;DRizer for nothing! :P

[+] MojoJolo|13 years ago|reply

Hello. I recently finished my thesis for my MS CS degree. My thesis is about automatic summarization. It undergoes research, defense, and I think its result is good enough for me. It uses statistical approach and machine learning. My main issue about it is not the summarization part, but the text extraction part. I can't seem to extract article in a web page well enough. I'm using boilerpipe (https://code.google.com/p/boilerpipe/) for it. It can do most tricks, but it's not that good for me. May I ask how you extract the main article in the page?

Here's a preview of mine (http://www.textteaser.com/ui/article?link=http%3A%2F%2Fwww.p...). Go to its home page to read more news. It caters Philippine news and will soon enters alpha stage. I'm planning to open up the API or open source it. HN, which is better? The API is ready, registration is the only thing that it lacks.

You can try the API here: http://api.textteaser.com/api/?url=http://www.theverge.com/2...

Just replace the url parameter with the URL of what you want to summarize. Some URLs are not tested yet, and may produce errors. :)

[+] midko|13 years ago|reply

Hi. I study CS with an inclination towards ML but I don't know anything about the topic of automated summarizing. I'm curious, since you've taken a ML approach, did you still need to rely on NLP and if so, was this very problematic? Also, do you perhaps know an article or a paper that could serve as a good starting point/overview of what approaches there are to summarizing and what are the current difficulties. Thanks

[+] mailshanx|13 years ago|reply

You should try diffbot. They use a vision based method to extract text from webpages. The tool looks pretty polished and seems to work rather well.

[+] logn|13 years ago|reply

Generated 5 Sentence Summary for http://www.businessinsider.com/why-marissa-mayer-bought-a-30...

Back in March, Yahoo bought a startup called Summly for $30 million. Before Yahoo shut it down, Summly was a news aggregation app for smartphones. According to Summly's own Web site, the technology behind the app was "built" by an organization called "SRI International," not by the startup's employees. And indeed, inside Yahoo, Summly is called "Yahoo's Siri." A source close to Yahoo says that CEO Marissa Mayer believes summarization technology is "going to be huge for Yahoo" as it builds "personalized news feeds" into mobile versions of its "core experiences," including Yahoo Finance and Yahoo Sports. The job of implementing this technology at Yahoo will not be given to anyone from Summly, including its young CEO.

--

Edit: Adding this...

Generated 3 Sentence Summary of Gettysburg Address

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract.

[+] drakaal|13 years ago|reply

http://TLDRstuff.com comes back with:

Why Marissa Mayer Bought A $30M Startup - Business Insider: The deal got a lot of attention because Summly's CEO is 17-year-old Nick D'Aloisio. Acquiring Summly seems to have been an almost incidental side effect of a deal Yahoo made with SRI for a piece of "summarization technology." Until Yahoo bought it, SRI International held equity in Summly. The job of implementing this technology at Yahoo will not be given to anyone from Summly, including its young CEO.

Notice that the version from TLDR Stuff actually has the Answer in it. "Incidental side effect of a deal Yahoo make with SRI"

It also tells you that they aren't interested in the young CEO.

This is possible because it is not a keyword density Algo, the core technology called Liquid Helium is a Language Heuristics Engine and it can put weight on which sentences are Causality, and which are Subject matter. This creates a version of the text that tells you Who What Why and if there is still space, How. You can't do that with just a KeyWord density or where in the article is this system.

Summly claimed to have that tech, SRI has some of it, but what they really have is a nice Concept Tree, and sentence parser.

A far cry from a system that knows which points are important, not just which points are most talked about, because as you see in this Business insider article the important part isn't "what is summly" or "who is nick" or "Who is mayer" it is "Why did Yahoo do this" and that is captured in the TLDRSTuff/Stremor version, and not in TLDRizer.

[+] sherjilozair|13 years ago|reply

This misses the most important line in the article: Acquiring Summly seems to have been an almost incidental side effect of a deal Yahoo made with SRI for a piece of "summarization technology".

[+] cosx|13 years ago|reply

I have to say this summary seems to make the whole summly purchase a lot clearer for me, rightly or wrongly, intentional or not.

$30 million to have the entire tech sphere spread the word on how Yahoo are evolving and are going to make peronalised summarized on the go news a core feature seems like a bargain (there certainly is a market for something like that I would say -- or at the very least, if implemented well, it certainly would be nice to have feature -- possibly a habit changer).

They no longer seem to me at the very least, the old who are they again company that they once were.

It's a shame they won't be changing their name anytime soon though. That on it's own puts me off a little. Forgive me for the snark.

[+] pokoleo|13 years ago|reply

I really like this.

Here's an example of one of PG's essays run through the algorithm: [http://paulgraham.com/startupideas.html]

The most important thing to understand about paths out of the initial idea is the meta-fact that these are hard to see. Empirically, the way to have good startup ideas is to become the sort of person who has them. If you know a lot about programming and you start learning about some other field, you'll probably see problems that software could solve. Some of the most valuable new ideas take root first among people in their teens and early twenties. So if you're a young founder (under 23 say), are there things you and your friends would like to do that current technology won't let you. But there may still be money to be made from something like journalism. Similarly, since the most successful startups generally ride some wave bigger than themselves, it could be a good trick to look for waves and ask how one could benefit from them.

If you ran examples of PG's essays through this, people would see the immediate benefit.

[+] Charlesmigli|13 years ago|reply

I think no algorithm can perform such a summarizing task. If you're looking for summaries of PG's essays see here http://tldr.io/discover/paulgraham.com.

[+] mohaps|13 years ago|reply

Thanks for the idea. Try this :) http://tldrzr.herokuapp.com/tldr/?feed_url=http://www.paulgr...

[+] mohaps|13 years ago|reply

haha! :) Yeah, this was kinda fueled by the news of the summly acquisition and too many red bulls drunk during the drive from LA to SFO after wondercon.

[+] shakeel_mohamed|13 years ago|reply

YES. As a college student, this is amazing for those long readings for classes one isn't interested int. I wanted to build sort of the reverse of this at one point (take a question/prompt as input, generate a response).

Are you planning on open sourcing this?

[+] mohaps|13 years ago|reply

yeah, I plan on open sourcing this. waiting on some technicalities.

[+] micheleg|13 years ago|reply

Sorta cool. The Yahoo purchase of Summly is not much more than a PR play. The technology wasn't/isn't there. And, while this "weekend hack" is neat, the quality of summaries isn't close to that of the TLDR plug-in (http://www.tldrstuff.com/#desktop) Not only does the Stremor plug-in get "what is important with the article" the plug-in is simply on all of my browsers and works FAST. Fun discussion though, and props to little Nick.

[+] cpio|13 years ago|reply

I've got an IOException while trying to summarize http://matt-welsh.blogspot.com/2013/04/running-software-team... And a different one for http://googleblog.blogspot.com I guess you should put more effort in your html parser. Try Apache Tika, perhaps.

[+] mohaps|13 years ago|reply

try this url: http://matt-welsh.blogspot.com/feeds/posts/default?alt=rss it works. same feed url pattern will work for google blog too http://googleblog.blogspot.com/feeds/posts/default?alt=rss

[+] mohaps|13 years ago|reply

okay, added a fix to try and extract article text from non-feed urls. try http://tldrzr.herokuapp.com/tldr/?feed_url=http://matt-welsh... :)

[+] mohaps|13 years ago|reply

ah, it won't work directly for webpages (html). the url is expected to be that of a RSS/Atom feed. for the html web pages, copy pasting the text to the textarea works.

Will try to add url content type detection in the next cut and summarizing non-feed url's next up

[+] mohaps|13 years ago|reply

Now the url can be a page, I try to extract the article text using boilerpipe. :) Also added a simple GET endpoint for linking. Try this summary of PG's "Writing and Speaking" essay: http://tldrzr.herokuapp.com/tldr/?feed_url=http://www.paulgr...

[+] rodrigoavie|13 years ago|reply

So, when is Yahoo! buying it? How much the deal?

[+] Trezoid|13 years ago|reply

Yahoo will buy something that USES this tech, and outsources the actual building as well...

[+] dpcx|13 years ago|reply

It seems to not properly handle embedded HTML; using my feed (http://www.dp.cx/blog/rss.xml), look at the story titled "The Difficulty of Parsing the Web" and notice the <select /> box that is rendered.

[+] drakaal|13 years ago|reply

Not bad for a weekend, but http://www.tldrstuff.com does a much better job. Especially where the sentences don't break on . like where J. R. R. Tolkien is concerned.

And The TLDR Plugin works with HTML and on all western languages.

[+] Charlesmigli|13 years ago|reply

As as cofounder of http://tldr.io this confirms our vision that for now (and for many years) only people can perform such a hard task like summarizing.

[+] MojoJolo|13 years ago|reply

I agree with "for now" but not "for many years". Right now, most or all automatic summarizers are doing extraction. Which is just lifting sentences from the original article itself. It is different from the human perception of summary, which is abstraction. That uses the most important parts of the article and paraphrase it for easy reading.

Right now, abstraction or paraphrasing is hard to do by a computer. But I think and hopefully it will be possible in few years time. There are various open source and academic tools that can do some pretty good NLP. I'm looking into Apache OpenNLP, and WordNet. I'm hoping for 2 or 3 years time.

BTW, I have an app similar to your tldr.io. Check my HN comment (https://news.ycombinator.com/item?id=5523770) for more info about it. ;)

[+] hayksaakian|13 years ago|reply

It seems like your comment was well intended, but you come off as a bit presumptuous.

The problem TODAY is not whether or not a computer can summarize, but rather to what extent we as humans are satisfied with the computer's summary.

In some cases a dumb summary is good enough (first 200 characters for example). Given this baseline, and a target (human summary), you have to admit it's really an incremental process.

[+] billirvine|13 years ago|reply

People are inherently biased. It's impossible for any person to read news and not inject their own personal tendencies into a summary they write.

[+] peter_l_downs|13 years ago|reply

Awesome! Looks super similar to an old sideproject of mine, www.bookshrink.com. The algorithm's different -- yours is aimed more towards summaries, while mine was aimed at sentence importance.

[+] mohaps|13 years ago|reply

yeah, I'm working on adding more summarizer algorithms. I've been thinking on the lines of weighing up rhetorical questions, weighing down exclamation mark (cheap sarcasm detection) etc.

[+] mohaps|13 years ago|reply

Updated the app with some goodies like links to summaries, ability to summarize all types of urls (not just feed urls) and a "spiffy" new logo :) Also did some css fixes etc.

[+] mohaps|13 years ago|reply

TL;DRzr is now open source! https://news.ycombinator.com/item?id=5535827

[+] bambax|13 years ago|reply

Tried the rss feed from my blog and got an NPE:

http://blog.medusis.com/rss

Does it expect a specific format?

[+] mohaps|13 years ago|reply

try now. the blog.medusis.com/rss link works now. Thanks for the feedback. since this grabs the page text (when no rss text is found) a lot of junk like copyright notices etc. shows up in summary. Will have to add some logic to scrub those. It also behaves horribly with code snippets.

[+] mohaps|13 years ago|reply

no, i use ROME to parse RSS feeds. So it should be able to handle whatever that can handle. Let me check

[+] devopstom|13 years ago|reply

Now to sell it to Yahoo for $30 Million!

[+] keeran|13 years ago|reply

Also see http://tldr.it (a RailsRumble 2010 entry)

[+] mohaps|13 years ago|reply

nice :) much better UI. As you can tell, I really suck at HTML/JS coding.

[+] jbrooksuk|13 years ago|reply

Very cool! I can't wait for it to be open sourced.

[+] mohaps|13 years ago|reply

trying to figure out (short of creating a new repo from current code) how to mirror the heroku git repo for this on github

67 comments