(no title)
josefcullhed | 4 years ago
My name is Josef Cullhed. I am the programmer of alexandria.org and one of two founders. We want to build an open source and non profit search engine and right now we are developing in our spare time and are funding the servers ourselves. We are indexing commoncrawl and the search engine is in a really early stage.
We would be super happy to find more developers who want to help us.
cocoafleck|4 years ago
josefcullhed|4 years ago
Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc). Then we have two indexes, one with URLs and one with links (we index the link text). Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula: domain_score = expm1(5 * link.m_score) + 0.1; url_score = expm1(10 * link.m_score) + 0.1;
then we add the domain and url score to url.m_score
where link.m_score is the HC of the source domain.
kreeben|4 years ago
I'd consider contributing. Seems you have something here.
josefcullhed|4 years ago
It takes us a couple of days to build the index but we have been coding this for about 1 year.
All the indexes are on disk.
Seirdy|4 years ago
1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?
2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.
3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.
4. What are some of your favorite search engines? Have you considered collaboration with any?
IvanHall|4 years ago
1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)
2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.
3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.
4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.
badrabbit|4 years ago
unknown|4 years ago
[deleted]
schemescape|4 years ago
Apologies if I missed it (and solely out of curiosity), but how roughly much does hosting Alexandria Search cost (per month)? (I'm assuming you've optimized for cost to avoid spending your own money!)
I have some other questions (around crawlers, parsing, and dependencies), but I need to read the other comments first (to see if my questions have already been answered).
josefcullhed|4 years ago
The active index is running on 4 servers and we have one server for hosting the frontend and the api (the API is what is used by the frontend, ex: https://api.alexandria.org/?q=hacker%20news)
Then we have one fileserver storing raw data to be indexed. The cost for those 6 servers are around 520 USD per month.
jw1224|4 years ago
Does this mean we’re not in Commoncrawl? Or are there any factors you weight much more heavily than Google might?
phrozbug|4 years ago
josefcullhed|4 years ago
We hope we can become a useful search engine powered by open source and donations instead of ads.
foobarandlmj|4 years ago
linspace|4 years ago
1. How do you plan to finance?
2. How will you avoid SEO?
3. What kind of help would be most welcome?
IvanHall|4 years ago
1. We would prefer to be funded with donations like Wikipedia.
2. I don't think we can avoid it completely, perhaps with volunteers helping us determine the trustworthiness of websites. Do you have any suggestions?
3. I think programmers and people with experience raising money for nonprofits could help the most right now. But if you see some other way you would want to contribute, please let us know!