top | item 30727142

(no title)

Hello,

My name is Josef Cullhed. I am the programmer of alexandria.org and one of two founders. We want to build an open source and non profit search engine and right now we are developing in our spare time and are funding the servers ourselves. We are indexing commoncrawl and the search engine is in a really early stage.

We would be super happy to find more developers who want to help us.

discuss

cocoafleck|4 years ago

I was trying to learn more about the ranking algorithm that Alexandria uses, and I was a bit confused by the documentation on Github for it. Would I be correct in that it uses "Harmonic Centrality" (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf) at least for part of the algorithm?

josefcullhed|4 years ago

Hi,

Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc). Then we have two indexes, one with URLs and one with links (we index the link text). Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula: domain_score = expm1(5 * link.m_score) + 0.1; url_score = expm1(10 * link.m_score) + 0.1;

then we add the domain and url score to url.m_score

where link.m_score is the HC of the source domain.

kreeben|4 years ago

Thanks for sharing this with the world. Did you manage to include all of a common crawl in an index? How long did that take you to produce such an index? Is your index in-memory or on disk?

I'd consider contributing. Seems you have something here.

josefcullhed|4 years ago

The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost.

It takes us a couple of days to build the index but we have been coding this for about 1 year.

All the indexes are on disk.

Seirdy|4 years ago

Oh boy, I have too many questions. I'd appreciate any answers you're able/willing to give:

1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?

2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.

3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.

4. What are some of your favorite search engines? Have you considered collaboration with any?

IvanHall|4 years ago

Hello, Ivan here (the other founder).

1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)

2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.

3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.

4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.

badrabbit|4 years ago

The UI is amazing. Don't change it significantly!

unknown|4 years ago

[deleted]

schemescape|4 years ago

Very impressive work so far!

Apologies if I missed it (and solely out of curiosity), but how roughly much does hosting Alexandria Search cost (per month)? (I'm assuming you've optimized for cost to avoid spending your own money!)

I have some other questions (around crawlers, parsing, and dependencies), but I need to read the other comments first (to see if my questions have already been answered).

josefcullhed|4 years ago

Thanks!

The active index is running on 4 servers and we have one server for hosting the frontend and the api (the API is what is used by the frontend, ex: https://api.alexandria.org/?q=hacker%20news)

Then we have one fileserver storing raw data to be indexed. The cost for those 6 servers are around 520 USD per month.

jw1224|4 years ago

I searched for a competitive keyword my SaaS business recently reached #1 on Google for. All of our competitors came up, but we were nowhere to be seen (I gave up after page 5).

Does this mean we’re not in Commoncrawl? Or are there any factors you weight much more heavily than Google might?

phrozbug|4 years ago

What will be the USP that makes it a success we are all waiting for? At the moment I'm switching between DDG & Google.

josefcullhed|4 years ago

I just think that the timing is right. I think we are in a spot in time where it does not cost billions of dollars to build a search engine like it did 20 years ago. The relevant parts of the internet is probably shrinking and Moore's Law is making computing exponentially cheaper so there has to be an inflection point somewhere.

We hope we can become a useful search engine powered by open source and donations instead of ads.

foobarandlmj|4 years ago

hello, so i was studying B+ trees today, you see, in the morning i browsed hackernews and saw alexandria.org, opened the tab, kept it open, went about my day, got frustrated with my search results, noticed the alexandria tab and tried it, every result was meaningful, well done .

linspace|4 years ago

Awesome work.

1. How do you plan to finance?

2. How will you avoid SEO?

3. What kind of help would be most welcome?

IvanHall|4 years ago

Hello, Ivan here (the other founder).

1. We would prefer to be funded with donations like Wikipedia.

2. I don't think we can avoid it completely, perhaps with volunteers helping us determine the trustworthiness of websites. Do you have any suggestions?

3. I think programmers and people with experience raising money for nonprofits could help the most right now. But if you see some other way you would want to contribute, please let us know!