Yes, I am using it on a not so small dataset (roughly 1 million docs) and the output is a fairly efficient model. I am using gensim with pre-trained word vectors. New docs can be inferred via .infer_vector().
Overall my approach is less automated than what I have seen in your codebase so it’s likely a bigger investment. I am happy to share more.
The blog post link on GitHub was a nice walk through of your method and I was interested in what you think the hit rate was for getting successful text for embeddings from TFA links. 100K is a good sized corpus but wondering how many got skipped due to paywalls or 404 links or any other problems ?
julien040|2 years ago
If you tried it, did you have great results with? I may use it in future projects.
fewald_net|2 years ago
Overall my approach is less automated than what I have seen in your codebase so it’s likely a bigger investment. I am happy to share more.
jimmySixDOF|2 years ago