(no title)
btrettel | 8 days ago
https://academia.stackexchange.com/a/173314/31143
https://www.reddit.com/r/datacurator/comments/p75xlu/how_i_o...
I don't read everything I have from start to finish. A lot of this is for future reference.
Since that StackExchange post, I'm now up to about 36.6K PDF files in 4.4K directories, with 14.5K symlinks so I can put files in multiple directories.
I also have a separate version controlled repo with notes a bunch of subjects. I'm planning to eventually merge my PDF hierarchy and the notes to have a unified system. It's going to have to be done in stages.
kyboren|7 days ago
I know about Sci-Hub, Anna's Archive, etc., but I'm not so interested in a giant landfill containing all papers ever written. I'm much more interested in a curated collection of useful papers.
btrettel|7 days ago
For copyright reasons I can not share the entire thing as-is. I have plans to share most notes in there and bibliographic data for most directories. Doing so would be a major project in itself as this was never designed for that. I have some information I would prefer to keep private in there that's going to have to be filtered out, and I would prefer to clean some of it up to be in a more "presentable" state.
As for how useful you'd find it, I think that depends entirely on the overlap between my interests and yours.
You might be interested in this project of mine: https://github.com/btrettel/specialized-bibs
kamomkoian|7 days ago
I’m curious — when working with such a large collection, how do you typically rediscover material or connect related ideas across different parts of the hierarchy? Do you rely primarily on directory structure, full-text search, or your notes as the main index?
And as you move toward merging the PDFs and notes into a unified system, do you see the notes becoming the central navigation layer, or will the directory structure remain primary?
btrettel|6 days ago
For navigating the directories, I have a Python script called cdref that will search the directory names, which has proved to be very useful. If there's one match, it'll go directly to that directory, and if there are multiple, a TUI will pop up and allow me to select the directory I want.
I haven't found full-text search of the documents themselves to be particularly useful because terminology varies, frequently what I'm looking for isn't in the text (could be a figure, for instance), and probably thousands of my documents haven't been OCRed. I think that relying too heavily on full-text search of the documents assumes that other people will organize information in a way useful to me, which isn't realistic [1]. Full-text search of the documents is a part of my system, still, but it's mostly used to find things to put in the directories or notes so that I can easily find the documents again without having to remember the right keywords. (Though I also often keep track of useful keywords.)
Often I won't remember where I keep some things or even if I have a directory or note on something at all. So I might accidentally create a redundant directory or note. But frequently I later realize that and use it as an opportunity to increase the connectivity of my directories and notes through symlinks. Then if I go to the "wrong" place, a symlink will send me where I should go. And if something pops into my head as related, I add a symlink or a note in the README file for a particular directory. (The README files in the directories are separate from the version controlled notes but will eventually merge, as I indicated.) Over the years, I've accumulated a lot of connections like this.
With all of this said, I think the important thing is to find a system that works for you that you can slowly scale over time. It doesn't need to look like my system. I've iteratively developed a system that works for me over 10+ years at this point. The scale is easy if you have a system you contribute a bit to on a regular basis over a long period of time.
[1] I've been also looking into having a large local bibliographic database to in part as an alternative to online scientific search engines like Google Scholar because I don't want to assume such services will always be available.