I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
johnwatson11218|1 month ago
johnwatson11218|1 month ago
-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --
CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;
EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;ct0|1 month ago
hellisad|1 month ago
fittingopposite|1 month ago