top | item 43173462

Hyperspace

849 points| tobr | 1 year ago |hypercritical.co

456 comments

order
[+] jonhohle|1 year ago|reply
I made a command line utility called `dedup` a while back to do the same thing. It has a dry-run mode, will “intelligently” choose the best clone source, understands hard links and other clones, preserves metadata, deals with HFS compressed files properly. It hasn’t destroyed any of my own data, but like any file system tool, use at your own risk.

0 - https://github.com/ttkb-oss/dedup

[+] jonhohle|1 year ago|reply
Replying to myself now that I've had a chance to try the scan, but not the deduplication. I work with disc images, program binaries, intermediate representations in a workspace that's 7.6G.

A few notes:

* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.

* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB

* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.

* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.

* Scanning files in Hyperspace took around 50s vs `dedup` at 14s

* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.

* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)

* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.

* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.

* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.

* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.

[+] gurjeet|1 year ago|reply
Thank you for creating and sharing this utility.

I ran it over my Postgres development directories that have almost identical files. It saved me about 1.7GB.

The project doesn't have any license associated with it. If you don't mind, can you please license this project with a license of your choice.

As a gesture of thanks, I have attempted to improve the installation step slightly and have created this pull request: https://github.com/ttkb-oss/dedup/pull/6

[+] LVB|1 year ago|reply
Just tried it, and it works well! I didn't realize the potential of this technique until I saw just how many dupes there were of certain types of files, especially in node_modules. It wasn't uncommon to see it replace 50 copies of some js file with one, and that was just in a specific subdirectory.

I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.

[+] actinium226|1 year ago|reply
Wow, that's some excellent documentation.

I was also really impressed that `make` ran basically instantly.

[+] bob1029|1 year ago|reply
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.

This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?

[+] petercooper|1 year ago|reply
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
[+] MBCook|1 year ago|reply
He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.

After all how many perfect duplicate files do you probably create a month accidentally?

There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.

And you can always rerun it for free to see if you have enough stuff worth paying for again.

[+] brailsafe|1 year ago|reply
For me the value in a dedup app like this isn't as much the space savings, since I just don't generate huge amounts, but it's the lack of duplicated files, some of which or all in aggregate may be large. There are some weird scenarios where this occurs, usually due to having to reconcile a hard drive recovery with another location for the files, or a messy download directory with an organized destination.

For example, I discovered my time machine backup kicked out the oldest versions of files I didn't know it had a record of and thought I'd long since lost, but it destroyed the names of the directories and obfuscated the contents somewhat. Thousands of numerically named directories, some of which have files I may want to hang onto, but don't know whether I already have them or not, or where they are since it's completely unstructured. Likewise, many of them may just have one of the same build system text file I can obvs toss away.

[+] mentalgear|1 year ago|reply
am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?
[+] sejje|1 year ago|reply
I also really like this pricing model.

I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.

[+] jedbrooke|1 year ago|reply
it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)

however has anyone been able to find out from the website how much the license actually costs?

[+] astennumero|1 year ago|reply
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
[+] borland|1 year ago|reply
I don't know exactly what Siracusa is doing here, but I can take an educated guess:

For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.

The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.

[+] w4yai|1 year ago|reply
I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
[+] williamsmj|1 year ago|reply
Deleted comment based on a misunderstanding.
[+] alwillis|1 year ago|reply
Just want to mention: Apple ships a modified version of the copy command (good old cp) that supports the ability to use the cloning feature of APFS by using the -c flag.
[+] kccqzy|1 year ago|reply
And in case your cp doesn't support it, you could also do it by invoking Python. Something like `import Foundation; Foundation.NSFileManager.defaultManager().copyItemAtPath_toPath_error_(...)`.
[+] BWStearns|1 year ago|reply
I have file A that's in two places and I run this.

I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?

[+] mattgreenrocks|1 year ago|reply
What jumped out to me:

> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).

How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.

[+] TrapLord_Rhodo|1 year ago|reply
Start with a story, narrow it down to a problem and show how your solution magically solves that problem. Such a fine example of GREAT marketing.
[+] bhouston|1 year ago|reply
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.

I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)

I tried to scan System and Library but it refused to do so because of permission issues.

I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.

Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.

[+] albertzeyer|1 year ago|reply
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.

My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.

https://github.com/albertz/system-tools/blob/master/bin/merg...

[+] notatallshaw|1 year ago|reply
uv does this out of the box, I think other tools (poetry, hatch, pdm, etc.) do as well but I have less experience with the details.
[+] jamesfmilne|1 year ago|reply
Would be nice if git could make use of this on macOS.

Each worktree I usually work on is several gigs of (mostly) identical files.

Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.

(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)

[+] theamk|1 year ago|reply
"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.

There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.

[+] globular-toast|1 year ago|reply
Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?
[+] diggan|1 year ago|reply
Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
[+] whalesalad|1 year ago|reply
Was pleasantly surprised to see that this is John Siracusa - the original GOAT when it comes to macOS release articles on Ars.
[+] diggan|1 year ago|reply
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.

Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.

[+] protonbob|1 year ago|reply
No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
[+] dewey|1 year ago|reply
The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.

In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.

[+] wink|1 year ago|reply
Oh wow, what a funny coincidence. I hadn't visited the site in a couple of years but someone linked me "Front and Center" yesterday, so I saw the icon for this app and had no clue it had only appeared there maybe hours earlier.

The idea is not new, of course, and I've written one of these (for Linux, with hardlinks) years ago but in the end just deleted all the duplicate files in my mp3 collection and didn't touch the rest of the files on the disk, because not a lot of size was reclaimed.

I wonder for whom this really saves a lot of space. (I saw someone mentioning node_modules, had to chuckle there).

But today I learned about this APFS feature, nice.

[+] exitb|1 year ago|reply
What are examples of files that make up the "dozens of gigabytes" of duplicated data?
[+] xnx|1 year ago|reply
There are some CUDA files that every local AI app install that take multiple GB.
[+] zerd|1 year ago|reply
.terraform, rust target directory, node_modules.
[+] password4321|1 year ago|reply
iMovie used to copy video files etc. into its "library".
[+] butlike|1 year ago|reply
audio files; renders, etc.
[+] bsimpson|1 year ago|reply
Interesting idea, and I like the idea of people getting paid for making useful things.

Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.

[+] Nevermark|1 year ago|reply
> I like the idea of people getting paid for making useful things

> It would be nice if it was open source

> I get a data security itch having a random piece of software from the internet scan every file on an HD

With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.

I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.

--

The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.

[+] benced|1 year ago|reply
You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.

Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.

[+] Analemma_|1 year ago|reply
In earlier episodes of ATP when they were musing on possible names, one listener suggested the frankly amazing "Dupe Nukem". I get that this is a potential IP problem, which is why John didn't use it, but surely Duke Nukem is not a zealously-defended brand in 2025. I think interest in that particular name has been stone dead for a while now.
[+] gcr|1 year ago|reply
I’ve experimented with reflinks and other APFS operations.

Here’s a question though: how does this work with transparently compressed files on APFS?

In my past experience, using reflinks is fine and using transparent compression is fine, but combining them leads to hard-to-debug file corruption.

[+] PStamatiou|1 year ago|reply
Wasn't able to use it on a few directories I tried as they were inside iCloud Drive.