top | item 45846998

(no title)

I have a pipeline in Docker compose that starts up postgresql on one container and a python container. The python scripts will recursively read all the pdf files in a directory, use pdf plumber to parse the text to store in a postgres table. Then I use sentence_transformers to take 100 char, w/ 10 char overlap, chunks and embed each section as a 384D vector which is written back to the db. Then I average all the chunks to create a single embedding for the entire pdf file. I have used numpy as well as built in postgres functions to average and it fast either way.

Then I use HMAP + DBSCAN to create a 2D projection of my dataset. DBSCAN writes the clusters to a csv file. I read that back in to create topics, docs2topcs join table. Then I join each topic into a mega doc and consider the original corpus, I compute tf-idf, using only db functions. This gives me the top 5 or so terms per topic and serves as useful topic labels.

I can do 30 to 50 docs in an couple of hours. I imported 1100 pdf files and it took all weekend on an old gaming laptop w/ a ssd. I have a gpu, and I think the embedding steps would go faster but I'm still doing it all synchronously w/o any parallel processing.

discuss

No comments yet.