top | item 6869598

(no title)

Check out gensim if you want to do topic modeling or similarity comparisons in Python.

It has good implementations of various algorithms, some of which support streaming or dirstribution, and it allows loading and dumping data in various formats.

I've used it for building content based recommender using tf-idf, lsi and similarity index. After the index is built, queries to it are really fast. It can handle quite large corpuses with little memory.

discuss

sbrother|12 years ago

Second this, I'm surprised you don't read more about it here. We use it in production to recommend image searchterms based on unstructured text, and it performs better with a few lines of python code than anything our team could write in a lower level language in months. It's REALLY fast once you've built an index.

The reason for that is a pretty epic list of dependencies (have fun explaining why the prod boxes need a fortran compiler), but in terms of efficiency and speed of development it's an obvious choice.

Radim|12 years ago

:-)

Hopefully the SciPy & BLAS dependencies will only get easier to install from now on... Continuum Analytics received shit loads of money and some of it is going towards better scientific Python packaging, I believe.

hnriot|12 years ago

gensim is awesome, it abstracts very complex algorithms into extremely simple function calls. The models.HdpModel class is very powerful.