top | item 39613640

(no title)

vGPU | 2 years ago

Has it gotten any better recently?

I run a node but I haven’t actually used it as a search engine in a while, as I found the result quality to be exceedingly poor.

discuss

Avamander|2 years ago

No.

Either it picks up too much garbage if you allow any P2P data exchange (can't allow only outgoing AFAIK) or it kinda only knows about the sites you know about. Which kinda defeats the purpose.

Even assuming you just want a specific index for yourself of your own content then it struggles to display useful snippets about the results, which makes it really tedious to shift through the already poor results.

If you try to proactively blacklist garbage, which is incredibly tedious because there's no quick "delete from index and blocklist" button under index explorer, then you'll soon run into an unmanageable blocklist, the admin interface doesn't handle long lists well. At some point (around 160k blocked domains) Yacy just runs out of heap during startup trying to load it which makes the instance unusable.

It also can't really handle being reverse proxied (accessed securely by both the users and peers).

It also likes to completely deplete disk space or memory, so both have to be forcefully constrained. But that ends up with a nonfunctional instance you can't really manage. It also doesn't separate functionality enough that you could manually delete a corrupt index for example.

Running (z)grep on locally stored web archives works significantly better.

bobajeff|2 years ago

Those are pretty bad issues. I remember using it along time ago and only remember the results being bad. I've heard that Yacy could be good for searching sites you've already visited but it sounds like even that might not be a good use case for it.

I do understand the taking up of disk space thing. It's hard to store text of all your sites without it talking up a lot of space unless you can intelligently determine which text is unique and desired. Unless you are just crawling static pages it becomes hard to know what needs to be saved or updated.

rahen|2 years ago

I remember trying it for a while in 2012, but the results were essentially worthless, probably because there were so few nodes/crawlers back then. I guess the more users there are, the better the results.

viraptor|2 years ago

Alternatively, ignore the public network (it's still useless) and run it as your own crawler. Seed it with your browsing history, some aggregators like HN, your favourite RSS feeds, etc. and you'll be good.

WarOnPrivacy|2 years ago

> I remember trying it for a while in 2012, but the results were essentially worthless,

I had mine crawling gov, mil, etc sties for pages that Google was starting to delist back then. Inbound requests were heavy with porn until I tweaked - IDK, something.