Great to see there's some resistance. What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government. I'm assuming good intentions on the part of Harvard, but keeping this data online against the express will of the government is gonna cost (political) capital. And from what I can see, the archive is hosted by US entities on US-controlled servers on US soil?
This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.
> What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government.
Maybe this?
> In addition to the data collection, we are releasing open source software and documentation for replicating our work and creating similar repositories. With these tools, we aim not only to preserve knowledge ourselves but also to empower others to save and access the data that matters to them.
I think a fully distributed storage system must be the way here. There must be some IPFS type system where Harvard could say "we designated a set of data that we can add to as needed but only delete from with a critical mass of storage providers' consent, here are some instructions for you to add your spare capacity to become a storage provider".
It might not be ironclad, but the ease with which federal workers can fiddle with data they're hosting themselves vs. fiddling with data in Harvard's library is a pretty big difference. And if it ever came to demands for censorship, it wouldn't be Harvard Library's first rodeo.
>how they intend to secure this "vault" against the current government
Is there any risk of the government ordering them to take it down? That seems unlikely to me. The US has strong free speech protection, stronger than European free speech protection.
>keeping this data online against the express will of the government is gonna cost (political) capital.
Costing them political capital (aka the government is unhappy) is different from the government ordering them to take it down. Also, when you say "express will", are you saying the government has explicitly publicly stated that they don't like that Harvard is hosting this data?
From the announcement: "This work is made possible with support from the Filecoin Foundation for the Decentralized Web and the Rockefeller Brothers Fund."
> how they intend to secure this "vault" against the current government
Definitely a concern, if they want to harass Harvard and the other universities they could, but I don't think they'll bother. They know the data will be backed up, that's not the point.
Taking it off of data.gov accomplishes two things:
1) Makes it look like they're doing something, playing to the base. Easy to do
2) Delegitmize any insights the data might have. "Sure you have 'data', is it official data? I don't see it on data.gov. How do we know its not fraudulent?" It makes it harder to use it to justify policy changes. It adds one more tool to the denial crowd.
If I remember correctly, Harvard has immunity to eminent domain under the Massachusetts constitution. Maybe it has a similar right which would make it immune to such attacks?
>I'm assuming good intentions on the part of Harvard, but keeping this data online against the express will of the government is gonna cost (political) capital.
Harvard like other liberal institutions has little to no political capital in a Republican white house in the first place. Why would it cost them any to host data that is in the public domain?
This sort of silly overreaction is part of why people voted for Trump: it is genuinely funny to see people overreact to Trump because theyve been told by the legacy media that he is a "threat to democracy" over and over.
Last time he was elected his government also removed a bunch of climate data from govt websites which was quickly mirrored by third parties. Nobody was taken away by the Gestapo then and there is no reason to think things will be any different this time.
Is anyone out there archiving USGS/NOAA datasets ? It sounds ridiculous, but this appears to be where we are now. There is a submission about NOAA on the frontpage now: "Scientists on alert as NOAA restricts contact with foreign nationals" [1]
I find it assuming that the might of the American government -- in trying to take a bunch of data offline -- is being resisted by a digital "militia" of hobbyist archivers and non profits.
Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.
Well I wouldn't really call it the "American Government" per say... Its a Geriatric former reality TV show host elected to the presidency by offering to do for America what he did for steak or private education. That guy and his cronies really aren't the American Government. They were just elected to be in charge of the American Government.
This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)
Very happy this is happening. There's a ridiculous amount of incredibly valuable data, scientific documents, etc. "out there" that are at risk.
I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).
I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.
Honestly a shame it has to come to this. Sure, people elected this administration and I guess with that comes with a bunch things I disagree with. But the removal of years of scientific research and data from the web (paid for by citizens with their taxes) is absolutely unacceptable. Ravaging CDC data, climate data, etc is horrendous and unforgivable.
From the post: Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov.
This is the first release in our new data vault project to preserve and authenticate vital public datasets for academic research, policymaking, and public use.
black_puppydog|1 year ago
This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.
lloeki|1 year ago
Maybe this?
> In addition to the data collection, we are releasing open source software and documentation for replicating our work and creating similar repositories. With these tools, we aim not only to preserve knowledge ourselves but also to empower others to save and access the data that matters to them.
https://github.com/harvard-lil/data-vault
And since the data lives here: https://source.coop/repositories/harvard-lil/gov-data/descri...
Combined with this:
> To download an individual dataset by name you can construct its URL, such as:
> https://source.coop/harvard-lil/gov-data/collections/data_go...
> https://source.coop/harvard-lil/gov-data/metadata/data_gov/f...
> To download large numbers of files, we recommend the aws or rclone command line tools:
> aws s3 cp s3://us-west-2.opendata.source.coop/harvard-lil/gov-data/collections/data_gov/<name>/v1.zip --no-sign-request
So one could "easily" mirror the whole thing, making it distributed.
bjackman|1 year ago
tlb|1 year ago
Thorrez|1 year ago
Is there any risk of the government ordering them to take it down? That seems unlikely to me. The US has strong free speech protection, stronger than European free speech protection.
>keeping this data online against the express will of the government is gonna cost (political) capital.
Costing them political capital (aka the government is unhappy) is different from the government ordering them to take it down. Also, when you say "express will", are you saying the government has explicitly publicly stated that they don't like that Harvard is hosting this data?
rswail|1 year ago
https://fil.org/
headcanon|1 year ago
Definitely a concern, if they want to harass Harvard and the other universities they could, but I don't think they'll bother. They know the data will be backed up, that's not the point.
Taking it off of data.gov accomplishes two things:
1) Makes it look like they're doing something, playing to the base. Easy to do
2) Delegitmize any insights the data might have. "Sure you have 'data', is it official data? I don't see it on data.gov. How do we know its not fraudulent?" It makes it harder to use it to justify policy changes. It adds one more tool to the denial crowd.
zombot|1 year ago
Yup, I was about to ask whether Trump could still force them to delete what he doesn't like. Time will tell, I guess.
EnnEmmEss|1 year ago
lisp2240|1 year ago
milesrout|1 year ago
Harvard like other liberal institutions has little to no political capital in a Republican white house in the first place. Why would it cost them any to host data that is in the public domain?
This sort of silly overreaction is part of why people voted for Trump: it is genuinely funny to see people overreact to Trump because theyve been told by the legacy media that he is a "threat to democracy" over and over.
Last time he was elected his government also removed a bunch of climate data from govt websites which was quickly mirrored by third parties. Nobody was taken away by the Gestapo then and there is no reason to think things will be any different this time.
cyberlimerence|1 year ago
[1] https://news.ycombinator.com/item?id=42970814
fs111|1 year ago
https://wiki.archiveteam.org/index.php/US_Government
Rebuff5007|1 year ago
Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.
jppope|1 year ago
p3rls|1 year ago
fnands|1 year ago
This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)
EDIT: no, looks like it is only the footprints
mindcrime|1 year ago
I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).
I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.
fredoliveira|1 year ago
zombot|1 year ago
"where they burn books, they will ultimately burn people as well."
Those who delete research will ultimately delete people as well.
https://en.wikiquote.org/wiki/Heinrich_Heine
nxm|1 year ago
[deleted]
unknown|1 year ago
[deleted]
frontalier|1 year ago
https://youtu.be/5RpPTRcz1no?t=1511
govideo|1 year ago
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]