top | item 47111138

(no title)

btrettel | 8 days ago

A directory hierarchy works well for me. I've described my setup online before:

https://academia.stackexchange.com/a/173314/31143

https://www.reddit.com/r/datacurator/comments/p75xlu/how_i_o...

I don't read everything I have from start to finish. A lot of this is for future reference.

Since that StackExchange post, I'm now up to about 36.6K PDF files in 4.4K directories, with 14.5K symlinks so I can put files in multiple directories.

I also have a separate version controlled repo with notes a bunch of subjects. I'm planning to eventually merge my PDF hierarchy and the notes to have a unified system. It's going to have to be done in stages.

discuss

kyboren|7 days ago

How many GB is your PDF collection? Have you considered sharing it more widely?

I know about Sci-Hub, Anna's Archive, etc., but I'm not so interested in a giant landfill containing all papers ever written. I'm much more interested in a curated collection of useful papers.

btrettel|7 days ago

The root directory of the archive is 142 GB large. It's not only PDFs, but mostly PDFs. It includes many things that were never online and some things that were online at one point but are not online any longer.

For copyright reasons I can not share the entire thing as-is. I have plans to share most notes in there and bibliographic data for most directories. Doing so would be a major project in itself as this was never designed for that. I have some information I would prefer to keep private in there that's going to have to be filtered out, and I would prefer to clean some of it up to be in a more "presentable" state.

As for how useful you'd find it, I think that depends entirely on the overlap between my interests and yours.

You might be interested in this project of mine: https://github.com/btrettel/specialized-bibs

kamomkoian|7 days ago

That’s an impressive and thoughtfully structured system, especially at that scale. The use of symlinks and a separate version-controlled notes repository makes a lot of sense for long-term archival.

I’m curious — when working with such a large collection, how do you typically rediscover material or connect related ideas across different parts of the hierarchy? Do you rely primarily on directory structure, full-text search, or your notes as the main index?

And as you move toward merging the PDFs and notes into a unified system, do you see the notes becoming the central navigation layer, or will the directory structure remain primary?

btrettel|6 days ago

It's mostly navigating the PDF directories or notes repository, full-text search of my notes, or (less frequently) searching Zotero for bibliographic data. I don't use tagging for this and I'll address full-text search of the documents in a bit. I can't say that either direct navigation or text search of the notes is dominant as I do a lot of both. Having multiple ways to find information is good for redundancy as if one way fails, you can try another. So I don't think the balanced approach I have will change in the future.

For navigating the directories, I have a Python script called cdref that will search the directory names, which has proved to be very useful. If there's one match, it'll go directly to that directory, and if there are multiple, a TUI will pop up and allow me to select the directory I want.

I haven't found full-text search of the documents themselves to be particularly useful because terminology varies, frequently what I'm looking for isn't in the text (could be a figure, for instance), and probably thousands of my documents haven't been OCRed. I think that relying too heavily on full-text search of the documents assumes that other people will organize information in a way useful to me, which isn't realistic [1]. Full-text search of the documents is a part of my system, still, but it's mostly used to find things to put in the directories or notes so that I can easily find the documents again without having to remember the right keywords. (Though I also often keep track of useful keywords.)

Often I won't remember where I keep some things or even if I have a directory or note on something at all. So I might accidentally create a redundant directory or note. But frequently I later realize that and use it as an opportunity to increase the connectivity of my directories and notes through symlinks. Then if I go to the "wrong" place, a symlink will send me where I should go. And if something pops into my head as related, I add a symlink or a note in the README file for a particular directory. (The README files in the directories are separate from the version controlled notes but will eventually merge, as I indicated.) Over the years, I've accumulated a lot of connections like this.

With all of this said, I think the important thing is to find a system that works for you that you can slowly scale over time. It doesn't need to look like my system. I've iteratively developed a system that works for me over 10+ years at this point. The scale is easy if you have a system you contribute a bit to on a regular basis over a long period of time.

[1] I've been also looking into having a large local bibliographic database to in part as an alternative to online scientific search engines like Google Scholar because I don't want to assume such services will always be available.