(no title)
johnwatson11218 | 3 months ago
Then I use HMAP + DBSCAN to create a 2D projection of my dataset. DBSCAN writes the clusters to a csv file. I read that back in to create topics, docs2topcs join table. Then I join each topic into a mega doc and consider the original corpus, I compute tf-idf, using only db functions. This gives me the top 5 or so terms per topic and serves as useful topic labels.
I can do 30 to 50 docs in an couple of hours. I imported 1100 pdf files and it took all weekend on an old gaming laptop w/ a ssd. I have a gpu, and I think the embedding steps would go faster but I'm still doing it all synchronously w/o any parallel processing.
No comments yet.