I made a command line utility called `dedup` a while back to do the same thing. It has a dry-run mode, will “intelligently” choose the best clone source, understands hard links and other clones, preserves metadata, deals with HFS compressed files properly. It hasn’t destroyed any of my own data, but like any file system tool, use at your own risk.
Replying to myself now that I've had a chance to try the scan, but not the deduplication. I work with disc images, program binaries, intermediate representations in a workspace that's 7.6G.
A few notes:
* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.
* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB
* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.
* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.
* Scanning files in Hyperspace took around 50s vs `dedup` at 14s
* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.
* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)
* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.
* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.
* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.
* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.
Just tried it, and it works well! I didn't realize the potential of this technique until I saw just how many dupes there were of certain types of files, especially in node_modules. It wasn't uncommon to see it replace 50 copies of some js file with one, and that was just in a specific subdirectory.
I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.
After all how many perfect duplicate files do you probably create a month accidentally?
There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
And you can always rerun it for free to see if you have enough stuff worth paying for again.
For me the value in a dedup app like this isn't as much the space savings, since I just don't generate huge amounts, but it's the lack of duplicated files, some of which or all in aggregate may be large. There are some weird scenarios where this occurs, usually due to having to reconcile a hard drive recovery with another location for the files, or a messy download directory with an organized destination.
For example, I discovered my time machine backup kicked out the oldest versions of files I didn't know it had a record of and thought I'd long since lost, but it destroyed the names of the directories and obfuscated the contents somewhat. Thousands of numerically named directories, some of which have files I may want to hang onto, but don't know whether I already have them or not, or where they are since it's completely unstructured. Likewise, many of them may just have one of the same build system text file I can obvs toss away.
am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?
it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)
however has anyone been able to find out from the website how much the license actually costs?
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
I don't know exactly what Siracusa is doing here, but I can take an educated guess:
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.
I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
Just want to mention: Apple ships a modified version of the copy command (good old cp) that supports the ability to use the cloning feature of APFS by using the -c flag.
And in case your cp doesn't support it, you could also do it by invoking Python. Something like `import Foundation; Foundation.NSFileManager.defaultManager().copyItemAtPath_toPath_error_(...)`.
> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).
How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
Would be nice if git could make use of this on macOS.
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.
There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?
Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.
In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
Oh wow, what a funny coincidence. I hadn't visited the site in a couple of years but someone linked me "Front and Center" yesterday, so I saw the icon for this app and had no clue it had only appeared there maybe hours earlier.
The idea is not new, of course, and I've written one of these (for Linux, with hardlinks) years ago but in the end just deleted all the duplicate files in my mp3 collection and didn't touch the rest of the files on the disk, because not a lot of size was reclaimed.
I wonder for whom this really saves a lot of space. (I saw someone mentioning node_modules, had to chuckle there).
But today I learned about this APFS feature, nice.
Interesting idea, and I like the idea of people getting paid for making useful things.
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
> I like the idea of people getting paid for making useful things
> It would be nice if it was open source
> I get a data security itch having a random piece of software from the internet scan every file on an HD
With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
--
The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.
Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
In earlier episodes of ATP when they were musing on possible names, one listener suggested the frankly amazing "Dupe Nukem". I get that this is a potential IP problem, which is why John didn't use it, but surely Duke Nukem is not a zealously-defended brand in 2025. I think interest in that particular name has been stone dead for a while now.
[+] [-] jonhohle|1 year ago|reply
0 - https://github.com/ttkb-oss/dedup
[+] [-] jonhohle|1 year ago|reply
A few notes:
* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.
* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB
* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.
* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.
* Scanning files in Hyperspace took around 50s vs `dedup` at 14s
* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.
* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)
* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.
* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.
* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.
* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.
[+] [-] gurjeet|1 year ago|reply
I ran it over my Postgres development directories that have almost identical files. It saved me about 1.7GB.
The project doesn't have any license associated with it. If you don't mind, can you please license this project with a license of your choice.
As a gesture of thanks, I have attempted to improve the installation step slightly and have created this pull request: https://github.com/ttkb-oss/dedup/pull/6
[+] [-] LVB|1 year ago|reply
I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.
[+] [-] actinium226|1 year ago|reply
I was also really impressed that `make` ran basically instantly.
[+] [-] Recursing|1 year ago|reply
[+] [-] bob1029|1 year ago|reply
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?
[+] [-] petercooper|1 year ago|reply
[+] [-] MBCook|1 year ago|reply
After all how many perfect duplicate files do you probably create a month accidentally?
There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
And you can always rerun it for free to see if you have enough stuff worth paying for again.
[+] [-] brailsafe|1 year ago|reply
For example, I discovered my time machine backup kicked out the oldest versions of files I didn't know it had a record of and thought I'd long since lost, but it destroyed the names of the directories and obfuscated the contents somewhat. Thousands of numerically named directories, some of which have files I may want to hang onto, but don't know whether I already have them or not, or where they are since it's completely unstructured. Likewise, many of them may just have one of the same build system text file I can obvs toss away.
[+] [-] mentalgear|1 year ago|reply
[+] [-] sejje|1 year ago|reply
I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.
[+] [-] jedbrooke|1 year ago|reply
however has anyone been able to find out from the website how much the license actually costs?
[+] [-] astennumero|1 year ago|reply
[+] [-] borland|1 year ago|reply
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
[+] [-] diegs|1 year ago|reply
[+] [-] w4yai|1 year ago|reply
[+] [-] williamsmj|1 year ago|reply
[+] [-] alwillis|1 year ago|reply
[+] [-] kccqzy|1 year ago|reply
[+] [-] BWStearns|1 year ago|reply
I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?
[+] [-] madeofpalk|1 year ago|reply
https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...
[+] [-] lgdskhglsa|1 year ago|reply
[+] [-] mattgreenrocks|1 year ago|reply
> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).
How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.
[+] [-] TrapLord_Rhodo|1 year ago|reply
[+] [-] bhouston|1 year ago|reply
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
[+] [-] albertzeyer|1 year ago|reply
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
https://github.com/albertz/system-tools/blob/master/bin/merg...
[+] [-] andrewla|1 year ago|reply
[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones
[+] [-] notatallshaw|1 year ago|reply
[+] [-] galaxyLogic|1 year ago|reply
If it works it's a no-brainer so why isn't it the default?
https://learn.microsoft.com/en-us/windows/dev-drive/#dev-dri...
[+] [-] jamesfmilne|1 year ago|reply
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
[+] [-] theamk|1 year ago|reply
There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
[+] [-] globular-toast|1 year ago|reply
[+] [-] diggan|1 year ago|reply
[+] [-] whalesalad|1 year ago|reply
[+] [-] diggan|1 year ago|reply
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
[+] [-] protonbob|1 year ago|reply
[+] [-] dewey|1 year ago|reply
In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
[+] [-] wink|1 year ago|reply
The idea is not new, of course, and I've written one of these (for Linux, with hardlinks) years ago but in the end just deleted all the duplicate files in my mp3 collection and didn't touch the rest of the files on the disk, because not a lot of size was reclaimed.
I wonder for whom this really saves a lot of space. (I saw someone mentioning node_modules, had to chuckle there).
But today I learned about this APFS feature, nice.
[+] [-] exitb|1 year ago|reply
[+] [-] xnx|1 year ago|reply
[+] [-] zerd|1 year ago|reply
[+] [-] password4321|1 year ago|reply
[+] [-] butlike|1 year ago|reply
[+] [-] bsimpson|1 year ago|reply
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
[+] [-] Nevermark|1 year ago|reply
> It would be nice if it was open source
> I get a data security itch having a random piece of software from the internet scan every file on an HD
With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
--
The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
[+] [-] benced|1 year ago|reply
Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
[+] [-] Analemma_|1 year ago|reply
[+] [-] gcr|1 year ago|reply
Here’s a question though: how does this work with transparently compressed files on APFS?
In my past experience, using reflinks is fine and using transparent compression is fine, but combining them leads to hard-to-debug file corruption.
[+] [-] PStamatiou|1 year ago|reply