While it's always a bit annoying to see a meme take up article slots on Hacker News regularly, I don't mind this one. I see sites like Mahalo as doing massive disservice to huge numbers of people --
* thousands of AdWords advertisers that have paid for their ads to be matched with content sites, not scraper pages with a huge ad to text ratio
* thousands of publishers whose work is being scraped, aggregated and outranked without as much as a backlink
* millions of web searchers that are hitting these pages instead of the real sources of the content they were searching for
And calling out companies that harm the fabric of the web for everyone else is worth doing.
The question is: why doesn't google do anything about this? There are countless such pages. Why not just blacklist domains? Or artificially give them lower pagerank?
Its really a shame how low Google's standards are for AdSense. I've tried advertising in the content network several times, but each time I end up wasting so much time blocking MFA sites that I give up.
Although I understand your points the AdWords and search issues is googles problem. People pay for a product with flaws in it and google is selling, not Mahalo. Mahalo has the freedom to do whatever they want with their pages and they can't be held responsible for what google in it's turn does with them.
Now, stealing other peoples work, that is obviously Mahalos doing and should be a case for the court.
I'm not really sure what the point with this article is. Mahalo spams google - so what? It's up to them to do this and up to google to prevent it. Business is business, even on the internet.
Here's one thing to consider : Google's algorithms for detecting the actual value of clicks for advertisers has improved greatly over the past few years.
If Mahalo’s traffic was utter crap, it would be dropped.
It's really too bad Google doesn't allow you to simply blacklist domains from your search results permanently. The thing that frustrates me the most about these SPAM sites is the fact they're constantly popping up and my down voting of the result seems to do absolutely nothing unless I'm using the exact same search query. Just let me blacklist Mahalo and other sites like it permanently. Better yet make it possible to subscribe to a blocklist so the community can pool it's resources and fight back.
Mahalo is not particularly interesting, not particularly evil, he doesn't really do anything, and yet - we keep talking about him and putting his shit on the front page of hacker news.
It's a mediocre aggregator/linkfarm, with some mechanical turk style incentives for humans to contribute, and a nice chunk of $ in the bank. Just ignore/mock him for another year or two until the funding runs dry.
You are absolutely right, it is mediocre crap. The issue is that through the spam techniques described in the article, he can now gain undeserved top 10 rankings for many phrases using pages that have no business being there.
If you are not part of the web development community then these discussions will most likely bore the hell out of you. However, if you are, and if you are aware of how many innocent sites Google bans or penalizes on a daily basis, or AdSense accounts that get canceled with no appeal for offenses much less than his, then this stuff actually matters.
One of the things I really miss in google is a persistent blocking preference a.k.a. site blacklist. Mahalo would go straight in there, along with expert-sexchange, sedo parking and a few others.
That is the angle Jason used when he created his steaming pile. It doesn't mean that anyone in the industry agrees with Jason's strategy.
And if you want to place the blame where it belongs remember that Google is the company funding all this content scraping with their ads programs.
I just tried searching for a recent post from an official Google blog (about AdSense using referral data for more relevant ad targeting) and found a scraper site with their ads outranking them for their own content. Pretty sad.
Not everything that uses the phrase "SEO" should be painted with the same brush. A lot of SEO best practices are making the HTML markup more semantic and human-readable. White-hat SEO dovetails pretty well with a human-readable web and accessibility standards, and is one of the better business cases for kicking the Flash habit. Just because Calacanis makes his name by abusing loopholes in the algorithms doesn't mean that's the only sort of thing that the name "SEO" applies to.
No, different datacenters will show different results... sometimes very different. It also matters if you are visiting Google.com or one of the country variants.
According to what Jason said in another comment, however, all of their pages are listed in their xml sitemap, and all of those are listed in a master xml sitemap index located here (warning! huge files if you follow the links in the first one!):
this is getting really old and we're not interested in doing anything black hat or even gray hat. as such we're doing the following:
1. we're removing (or building out) any page in our system created by our users with under 200 words of original content. This will take a couple of weeks but it's tarted.
2. we're not letting users create stub pages (short pages) until we can noindex them and put them in a different directory (i.e. /stubs/) so google can easily tell the difference between them.
these pages are < 1% of our revenue and low single digits of our traffic. we don't benefit from them materially, and I think we're being targeted by Aaron Wall and other SEOs for my "seo is bullshit" comment from 2005 or so.
I guess that is fine... I gotta live with the ramifications of what I say. however, for the record I don't believe that SEO is BS any more... when i said that it was when we were building joystiq and autoblog and we spent zero time on SEO.
All that being said, we're being targeted by a small group of folks who want to take us down. we're only going to get stronger from this because our hundreds of contributors are rallying around building out the short pages.
Topix, Kosmix, NYTimes and Zimbio are all making quality topic pages and are not getting attacked over it. not sure why there is some double standard.
regardless.... this is not a material thing for us. we're flushing all these pages and moving them to a different directory going forward so that search engines know where they are located (i.e. /stubs/ ).
thanks for the ass kicking.... having a horrible day today over this.
Jason, did you even read my article? This isn't about the traffic those autogenerated pages get, it's about the fact that through the minuscule amounts of PageRank that they are each capable of grabbing, you are now able to rank your mediocre pages with absolutely zero influence from the rest of the web.
We're not talking about stub pages, it's all the fully automated bullshit that you are generating. They not only need to be deindexed, they need to be nofollowed or removed altogether.
How is it you are out there playing the wounded puppy when apparently you haven't even read the articles or followed the reference links? You can't just skim this one and then craft a rebuttal and think you've addressed the issue. There's a lot of data in those paragraphs you apparently just skimmed over (if that even).
You have over 500,000 pages listed in your xml sitemap, and Google appears to have over 330,000 of them indexed. Click on this link, please, and actually go look at 10-12 of the pages we are talking about here:
Tell me how long it takes you, just by clicking through, to find even 3 pages that have any human interaction in them whatsoever.
Maybe, just maybe, you really don't have a clue what is happening. I personally don't believe that's the case, but if so then whoever it is you have working for you that set this up knows how to spam like a pro.
we're removing (or building out) any page in our system created by our users with under 200 words of original content. This will take a couple of weeks but it's tarted.
or
i'm also getting a list of every page under 300 words and having the page managers build them out in 30 days or deleting them.
It's obviously of interest and importance to a number of people involved in this field or troubled by poor quality material showing up in Google. If you're not one of those people, it's pretty easy to identify these links and not upvote them or visit the articles/comments, etc.
There are countless articles on HN that I have no interest in (e.g., I don't even know what Clojure is), but I just don't click through to them.
A lot of the comments in here seem like more of a personal attack than anything else. You might as well change the title of this post to "Jason Calacanis ruined the Internet" or something to that effect..
Can we stop the drama already? I think we're going to need a hose to control this mob.
[+] [-] dangrossman|16 years ago|reply
* thousands of AdWords advertisers that have paid for their ads to be matched with content sites, not scraper pages with a huge ad to text ratio
* thousands of publishers whose work is being scraped, aggregated and outranked without as much as a backlink
* millions of web searchers that are hitting these pages instead of the real sources of the content they were searching for
And calling out companies that harm the fabric of the web for everyone else is worth doing.
[+] [-] w00pla|16 years ago|reply
Or is it okay as long as they get ad-sense money?
[+] [-] qeorge|16 years ago|reply
[+] [-] wheaties|16 years ago|reply
[+] [-] danik|16 years ago|reply
Now, stealing other peoples work, that is obviously Mahalos doing and should be a case for the court.
I'm not really sure what the point with this article is. Mahalo spams google - so what? It's up to them to do this and up to google to prevent it. Business is business, even on the internet.
[+] [-] aresant|16 years ago|reply
If Mahalo’s traffic was utter crap, it would be dropped.
[+] [-] access_denied|16 years ago|reply
[deleted]
[+] [-] jsz0|16 years ago|reply
[+] [-] Devilboy|16 years ago|reply
[+] [-] japherwocky|16 years ago|reply
Mahalo is not particularly interesting, not particularly evil, he doesn't really do anything, and yet - we keep talking about him and putting his shit on the front page of hacker news.
It's a mediocre aggregator/linkfarm, with some mechanical turk style incentives for humans to contribute, and a nice chunk of $ in the bank. Just ignore/mock him for another year or two until the funding runs dry.
[+] [-] mvandemar|16 years ago|reply
If you are not part of the web development community then these discussions will most likely bore the hell out of you. However, if you are, and if you are aware of how many innocent sites Google bans or penalizes on a daily basis, or AdSense accounts that get canceled with no appeal for offenses much less than his, then this stuff actually matters.
[+] [-] moe|16 years ago|reply
[+] [-] ryoshu|16 years ago|reply
[+] [-] pclark|16 years ago|reply
[+] [-] mvandemar|16 years ago|reply
Nice call.
[+] [-] jachee|16 years ago|reply
[+] [-] aaronwall|16 years ago|reply
And if you want to place the blame where it belongs remember that Google is the company funding all this content scraping with their ads programs.
I just tried searching for a recent post from an official Google blog (about AdSense using referral data for more relevant ad targeting) and found a scraper site with their ads outranking them for their own content. Pretty sad.
[+] [-] lmkg|16 years ago|reply
[+] [-] axod|16 years ago|reply
I see 'Results 1 - 10 of about 2,200,000 from mahalo.com'
Have things stepped up a gear or am I misunderstanding?
[+] [-] mvandemar|16 years ago|reply
According to what Jason said in another comment, however, all of their pages are listed in their xml sitemap, and all of those are listed in a master xml sitemap index located here (warning! huge files if you follow the links in the first one!):
http://www.mahalo.com/sitemapindex.xml
Based on what I saw 2 million+ looks like a huge overestimate, if what Jason said is true.
Edit: My bad, frederickcook's answer was the right one. I didn't realize you were doing a regular text search.
[+] [-] frederickcook|16 years ago|reply
Simply "mahalo.com" lists every indexed page with that text on it, such as this one.
[+] [-] jasonmcalacanis|16 years ago|reply
1. we're removing (or building out) any page in our system created by our users with under 200 words of original content. This will take a couple of weeks but it's tarted.
2. we're not letting users create stub pages (short pages) until we can noindex them and put them in a different directory (i.e. /stubs/) so google can easily tell the difference between them.
these pages are < 1% of our revenue and low single digits of our traffic. we don't benefit from them materially, and I think we're being targeted by Aaron Wall and other SEOs for my "seo is bullshit" comment from 2005 or so.
I guess that is fine... I gotta live with the ramifications of what I say. however, for the record I don't believe that SEO is BS any more... when i said that it was when we were building joystiq and autoblog and we spent zero time on SEO.
All that being said, we're being targeted by a small group of folks who want to take us down. we're only going to get stronger from this because our hundreds of contributors are rallying around building out the short pages.
Topix, Kosmix, NYTimes and Zimbio are all making quality topic pages and are not getting attacked over it. not sure why there is some double standard.
regardless.... this is not a material thing for us. we're flushing all these pages and moving them to a different directory going forward so that search engines know where they are located (i.e. /stubs/ ).
thanks for the ass kicking.... having a horrible day today over this.
jcal
http://bit.ly/jasondown
[+] [-] mvandemar|16 years ago|reply
We're not talking about stub pages, it's all the fully automated bullshit that you are generating. They not only need to be deindexed, they need to be nofollowed or removed altogether.
How is it you are out there playing the wounded puppy when apparently you haven't even read the articles or followed the reference links? You can't just skim this one and then craft a rebuttal and think you've addressed the issue. There's a lot of data in those paragraphs you apparently just skimmed over (if that even).
You have over 500,000 pages listed in your xml sitemap, and Google appears to have over 330,000 of them indexed. Click on this link, please, and actually go look at 10-12 of the pages we are talking about here:
http://tinyurl.com/yzmxq7b
Tell me how long it takes you, just by clicking through, to find even 3 pages that have any human interaction in them whatsoever.
Maybe, just maybe, you really don't have a clue what is happening. I personally don't believe that's the case, but if so then whoever it is you have working for you that set this up knows how to spam like a pro.
[+] [-] tdm911|16 years ago|reply
or
i'm also getting a list of every page under 300 words and having the page managers build them out in 30 days or deleting them.
from: http://news.ycombinator.com/item?id=1143512
is it under 200 words or under 300? are the goal posts moving already?
[+] [-] chintan|16 years ago|reply
Kosmix always had noindex in their "search results" - Now stop whining like a baby and get your ass back to work instead of justifying your mistakes.
[+] [-] Joepuf|16 years ago|reply
[+] [-] mcutts|16 years ago|reply
[deleted]
[+] [-] icey|16 years ago|reply
http://news.ycombinator.com/user?id=MattCutts
[+] [-] pclark|16 years ago|reply
[+] [-] CoachRufus87|16 years ago|reply
[+] [-] prawn|16 years ago|reply
There are countless articles on HN that I have no interest in (e.g., I don't even know what Clojure is), but I just don't click through to them.
[+] [-] fjabre|16 years ago|reply
A lot of the comments in here seem like more of a personal attack than anything else. You might as well change the title of this post to "Jason Calacanis ruined the Internet" or something to that effect..
Can we stop the drama already? I think we're going to need a hose to control this mob.