top | item 1411182

Example: Why your Title to URL algorithm shouldn't chop off partial words...

101 points| spxdcz | 16 years ago |bbc.co.uk | reply

23 comments

order
[+] kljensen|16 years ago|reply
Not only funny, but bad SEO. They'd be better off separating the words rather than concatenating them.
[+] ojbyrne|16 years ago|reply
Funny, but more suited to reddit.
[+] spxdcz|16 years ago|reply
That's an interesting comment. I'm still trying to 'work out' the Hacker News crowd (even though I've been here almost two years!).

I've submitted plenty of things that get to #1, or front page, but can't seem to find much consistency in what people like / don't like. Not as obviously as Reddit/Digg/Slashdot, anyway - which is a good thing! I'm very pleased that HN isn't as one-dimensional as some other sites; it's what keeps me coming back.

But it is weird, what becomes popular and what doesn't. For example, I submitted this particular story, which I was a bit unsure about (like you say, more suited to Reddit), but still people vote it up.

Other times, I can submit content that I think is genuinely interesting/fascinating, that is much more technical and comprehensive than this rather silly BBC/URL story, and it doesn't get a single vote. Maybe it's the time of day / day of the week.

On that note, does anyone know of any data repositories for Hacker News front page items? An API or raw-data download that can be analyzed? I remember someone a few months ago doing some analysis on the best time of day to submit, but was wondering if there was any public data out there, or whether I should start spidering/collecting my own?

[+] jerf|16 years ago|reply
Yes. A "hurr hurr" link is now the top HN link. A sad day.
[+] j_baker|16 years ago|reply
To be totally honest, I'd rather have HN allow posts that are of dubious topicality than to turn into another stackoverflow where topics are closed/deleted if they don't meet the strictest definition of what's allowed on the site. Sometimes the harm of off-topic posts is much less than the harm in doing away with them.
[+] retube|16 years ago|reply
indeed. this is a cross-post.
[+] adulau|16 years ago|reply
URL normalization/canonicalization is already a hard work but chopping off URL to avoid dirty words in any language looks also difficult. How could we implement that?

First, we need to know the existing dirty or vulgar words in a specific language (and maybe a region). Do we have WordNet for that? in any language? with a classification, a dirty word is not the same as an insult but sometime an insult can be a dirty word. I don't know any good reference of that. There are some websites with insults but a good dictionary...

Now if we look at the algorithm to do so, you will also need to know the language used in a page. It's often fine to have a French speaking page including a chopped off URL containing "cum" but not for English.

Wait a minute? Is this a problem really important? At the end, it's the only way to publish a recipe on HN...

[+] harpastum|16 years ago|reply
A simple enough solution would be to simply not include partial words. They don't help SEO, and when your phrase is long enough to require it, it's unlikely you'll have uniqueness issues.
[+] endtime|16 years ago|reply
Isn't this an example of why it should chop off partial words?
[+] dkimball|16 years ago|reply
This is evocative of ferrethandjobs.com, although in that case it was a matter of capital letters not coming through.

URLs seem to require their own grammatical rules to avoid outrageous results, and this is just considering English; incorporating other languages likely to be in one's target audience would be important, too. One saving grace is that most languages don't sound much like each other... most of the time. (Which makes when they do all the more painful.)

Beware foreign borrowings, too...

[+] r0s|16 years ago|reply
Having braved the harsh wilds of mod_rewrite myself recently, I can sympathize. Getting it working at all seems to be the challenge, with compromise a usual accomplice.

On the other hand, I was only invested in a personal project. The BBC should have higher standards.

[+] MicahWedemeyer|16 years ago|reply
For our non-American friends, "cum" is a bit of a dirty word, and something most people would prefer not to have their carrots glazed with.
[+] bandris|16 years ago|reply
And people in the UK don't know it? :)