Soft deletion probably isn't worth it

[+] JohnBooty|3 years ago|reply

I've been a software dev since the 90s and at this point, I've learned to basically do things like audit trails and soft deletion by default, unless there's some reason not to.

Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah. It helps the business, it helps you as developer by giving you debug information as well as helping you to cover your ass when you are blamed for some data loss bug that was really user error.

Soft deletion has obvious drawbacks but is usually far less work than implementing equivalent functionality out-of-stream, with verbose logging or some such.

Retrofitting your app and adding soft deletion and audit trails after the fact is usually an order of magnitude more work. Can always add it pre-launch and leave it turned off.

If performance is a concern, this is usually something that can be mitigated. You can e.g. have a reaper job that runs daily and hard-deletes everything that was soft-deleted more than n days ago, or whatever.

[+] gmiller123456|3 years ago|reply

The author uses the "no one ever undeleted anything" as the primary justification. I think this is the part they miss. I've never undeleted a user either, but there have been many times I've gone back to look at something. Either a complaint finally gets around to me as to why the user wanted their account deleted (e.g. feature not working) and it helps to figure out why. Or they're returning and want things set up like they were. Or someone is taking over their roll and needs to be set up like the last person who's already gone.

Though you really shouldn't be relying on a database for an audit trail. It might help find some issues, but things actually used for security shouldn't be writable so easily.

[+] pradn|3 years ago|reply

Soft deletion is just one way to achieve undeletion. The author's proposed solution of moving the resource to another table works just as well. You can move it back to the non-deleted table to perform the undeletion. You can keep around these deleted objects as long as you want; they work as a subset of a proper audit trail. The cost of course is you have more tables, but that is less of a cost than having to add "deleted=False" predicates in all of your queries.

Also note, if you use a soft-deleted column, indexes need to be keyed by that column as well if you want to access non-deleted objects quickly. That's extra complexity.

[+] efsavage|3 years ago|reply

Also if people know that deletion is reversible, they're more likely to actually do it, which can keep things generally tidier.

I don't actually like using a "deleted" column, my standard table has a status column, and deleted is one of those states, along with active/pending/suspended/etc, as the needs dictate. This way I get soft deletes for basically free both in the schema, but also in the queries (which would generally default to active), so it's not really the spaghetti that the author discusses.

[+] rjzzleep|3 years ago|reply

In rails you get these things for free. What I don't get is why everyone rolls their own framework with node.js. It's basically 90s PHP all over again.

EDIT: Soft delete is a trivial piece of code when the framework has a well defined transaction system for its ORM. It's not really related to Rails per se. Your statement is extremely disingenuous, while trying to look smart. Audit trails _can_ be(but don't have to be) more complex, especially when the framework uses a lot of stored procedures to handle operations. But other than that these frameworks are specifically designed to REDUCE complexity of such operations, dependency costs - which are huge in node.js, specifically because you can mix and match anything into everything.

Node.js people tend to stitch together XSS solutions, random templating solutions based on their frontend work, even basic salting of auth passwords becomes unpredictable because you have 30 options on minimal auth libraries.

But yes nothing is ever free. If you want to use Rails you still have to learn ruby and the framework and a basic understand of how ActiveRecord builds queries if you want to be writing performant code. And the same applies to Laravel, Django, or whatever of the 50 patchwork node.js solutions you want to base your code on.

[+] jesseryoung|3 years ago|reply

I was getting ready to disagree with you - but then I tried to think of any time I've actually pushed code to production with the "DELETE" keyword in it. The problems that I've had to solve in my career very rarely call for deleting something.

"Soft deletion" and "audit trail" are technical terms we developers come up for solutions the business wants but maybe hasn't asked for yet. It's not really a soft deletion it's a "deactivate" or "hide". Likewise, it's not an audit trail it's a "history" or "undo". Most of the time your stakeholders and users actually want these features, but don't ask because they perceive this as more expensive to build then just a "delete" button.

[+] augustl|3 years ago|reply

This is why I don't understand why Datomic isn't more popular. Pretty much every system I've worked on never needed to scale past 100s of writes per second due to hard limits on the system (internal backoffice stuff, fundamenally scoped/shardable to defined regions, etc etc). And since Datomic is built with that in mind, you get the trade-off of full history, first class transactions and being able to query for things like "who changes this attribute to its current value, when, and why" is such as super power!

[+] wnevets|3 years ago|reply

This is something that I was forced to learn the hard way more than once. Literally today I needed to undelete a record because a customer was confused by what the "delete" button did and wanted their record back.

[+] corrral|3 years ago|reply

I'd add tagging, for anything that could conceivably use it, when you're doing DB design. May as well start with support, even if the functionality's initially dormant. Someone will ask for it, directly or indirectly, and it won't take long before they do.

[+] nicoburns|3 years ago|reply

+1 on audit trails. And one should always store audit trails in machine readable format. That way you can not only manually inspect what happened, but you can query it too (and reconstruct the entire state as it existed in the past if necessary).

[+] aboodman|3 years ago|reply

  > Somebody always wants to undelete something
  > or examine it to see why it was deleted
  > or see who changed something, or blah blah blah

People for whom this resonates should look into Dolt: https://dolthub.com/.

It's a mysql-compat database that is versioned and forkable - basically imagine git and mysql had a baby. Every transaction creates an entry in the commit log that lets you see when, how, and why the database changed. And just like git you can `dolt checkout <hash>` to see what the data was like at some point in time, or `dolt checkout -b temp-branch <hash>` to make exploratory changes to some historical version, or `dolt revert` to revert a transaction... etc.

There is a lot more power that comes with making the entire database versioned and forkable by default. For example it makes it much easier to recover from catastrophic bad code pushes, etc.

note: Dolt was forked from Noms, my previous project, but I don't work for Dolt or have a stake. Just a fan.

[+] makeitdouble|3 years ago|reply

> unless there's some reason not to.

Yes.

To note, with GDPR there's now legal reasons to do so regarding user personal data. That can be the moment the devs realize they actually can't delete the data, because they soft deleted for so long, many relations are now interlocked and the data model needs to be changed to give a starting point to the deletion cascade.

My lesson from that was to at least have one test deleting a mock user that spans the maximum data breadth of the service . We caught a bunch of these loops in test at dev time and that was pretty great.

[+] cal85|3 years ago|reply

> Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah.

In my experience this happens “rarely”, not “always”.

It can happen, and in some ultra-rare cases the impact of not being able to recover some data might be huge (company-ending, even), and engineers are good at worrying about such edge cases. That’s why we habe protective measures like soft deleting and event sourcing - because of nightmare edge cases, not because we are always having to actually use them. It’s driven by engineers avoiding their worst nightmare: having to say “I’m sorry, I cannot solve this problem for you. The data is gone.” It’s a peace-of-mind thing, not an everyday-need thing.

[+] femiagbabiaka|3 years ago|reply

Having multiple use cases for data is normal and okay. Treating your primary data store that is designed for one set of use cases and using it for them all is very very bad, even before you hit scale. Know which data store to use and when for a use case is a super power than can allow you to scale to much higher magnitudes than one would intuit.

[+] dcow|3 years ago|reply

What's wrong with the author's proposed compromise/solution?

[+] ReptileMan|3 years ago|reply

The proper thing is database snapshots or something like that.

I don't think soft delete is wrong per se, but it is something that should be native to the database engine.

[+] unknown|3 years ago|reply

[deleted]

[+] cpursley|3 years ago|reply

Can you recommend any best practices, design patterns, etc - for implementing soft delete in SQL systems?

[+] brightball|3 years ago|reply

Agreed. The key to soft deletes is to actually move the record out of the original table to ensure they don’t accidentally end up in a query join.

But there’s always something that needs to be undeleted. You can either have an easy way to do it or restore an entire DB backup and cross query the missing records. Soft deletes are a lot easier.

[+] 4m1rk|3 years ago|reply

It's audit log enough for that? As deletion should also be logged.

[+] xxs|3 years ago|reply

GDPR compliance and all - we encrypt and delete stuff... or at the very least "lose" encryption keys 1st.

[+] pg_1234|3 years ago|reply

This 100%

[+] augustiine|3 years ago|reply

[deleted]

[+] dafelst|3 years ago|reply

Views are a simple solution to this problem. Pretty much all moderns RDBMSs support updatable views, so creating views over your tables with a simple WHERE deleted_at IS NULL solves the majority of the author's problems, including (IIRC) foreign key issues, assuming the deletes are done appropriately.

I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against. Sure, concerns about splitting logic between the DB and app layers are valid, but there are fairly well developed techniques for keeping DB and app states, logic and schemas aligned via migrations and partitioning and whatnot.

[+] pbardea|3 years ago|reply

> assuming the deletes are done appropriately

This is one gripe I have with soft-deletion. Since I can no longer rely on ON DELETE CASCADE relationships, I need to re-defined these relationship between objects at the application layer. This gets more and more difficult as relationships between objects increase.

If the goal is to keep a history of all records for compliance reasons or "just in case", I tend to prefer a CDC stream into a separate historical system of record.

[+] semiquaver|3 years ago|reply

The main problem with views for this use case in practice is that they ossify your schema. Views and matviews are effectively a dependency tree, and many common types of schema evolution become substantially more difficult when the system forces you to wrap your DDL in a series of view drop/recreation steps.

This is merely annoying when dealing with regular views because recreating even a large number of views is fast, but can be catastrophic if you have any matviews in your table dependency tree. A matview can easily turn what should be an instantaneous DDL operation into a partial outage while the matview is being regenerated.

(this is all postgres specific, it may be untrue for other systems)

[+] rodw|3 years ago|reply

Seriously. That "Downsides: Code leakage" point is nonsensical.

``` CREATE OR REPLACE VIEW active_customer AS SELECT * FROM customer WHERE deleted_at IS NULL OR deleted_at <= NOW() ; ```

There, I fixed it.

Just use `active_customer` instead of `customer ... deleted_at IS NULL`.

In fact, since the deleted_at column is a timestamp, the original "leakage" query:

``` SELECT * FROM customer WHERE id = @id AND deleted_at IS NULL; ```

is actually broken. A non-null `deleted_at` timestamp that's in the future implies the record hasn't been deleted yet, right?

I've often had junior devs assert that views are some kind of code smell, but these sorts of "canned query/filter that you want to apply very very often" seem like the perfect use case for a view to me. It's DRY, and the fact that your standard "query" is in the database" means you can change it more readily than trying to make sure you hit all the points it might be embedded in the application code.

> I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against

Early-ish in the JDBC days a senior dev I was working with at the time (as a junior dev myself) made a pretty good case that "the database is part of the application" that's always stuck with me. Full database independence via software level abstractions is a pretty silly goal outside of library code. If you have a service that makes extensive use of the database, don't throw away the database features in the interest of some abstract "we could swap out oracle with mysql without changing anything" objective. If you want it to be generic, use the SQL standard, but don't be afraid to have a few db-specific bits in the app code if that's a subsystem you might replace once a decade or something.

I blame the DBA/Dev divide for a lot of this. A lot of the impedance between these layers is social/procedural. If you can change the DB as easily as the code, there's a lot less fear of using the right tool for the specific job.

[+] aeyes|3 years ago|reply

At least in Postgres, having a huge amount of "dead" data in large tables is problematic because vacuum always has to read the full data set.

Even with conditional indexes where you exclude deleted data you take a significant performance hit reading dead blocks because there is no way to quickly vacuum them. You accumulate hours of bloat until your vacuum finishes.

You can't beat a separate insert only archive table which you never have to vacuum.

[+] jaydub|3 years ago|reply

Based on my experience, I like the author's approach since it makes things pretty clear-cut and optimized the storage in the core table (in my experience as well, deletes happen frequently and the soft deletes are rarely touched). In large, row-oriented tables that that storage can add up and even with views/materialized views there's a cost to using/maintaining those as well.

[+] SoftTalker|3 years ago|reply

A problem (unless something has changed, my context is Oracle from some time ago) is that NULL values are not indexed. So the "WHERE deleted_at IS NULL" could trigger a full table scan. It can also cause row migration when the NULL value is eventually filled in. Unless you explicitly need the deleted date, it's probably better to use a non-nullable Y/N for this.

[+] firloop|3 years ago|reply

Views can really bite you performance wise, at least with Postgres. If you add a WHERE against a query on a view, Postgres (edit: often) won't merge in your queries' predicates with the predicates of the view, often leading to large table scans.

[+] 5e92cb50239222b|3 years ago|reply

This is one of those situations where a good ORM can simplify things greatly. For example, with EF Core you can add a global filter which will filter out soft-deleted rows in all queries automatically (unless you add .IgnoreQueryFilters()).

It couples nicely with some hackery which turns removes into soft-deletes. You can remove objects as usual and they get soft-deleted in the database.

I've used this in a few projects and it's fantastic.

https://docs.microsoft.com/en-us/ef/core/querying/filters

https://www.thereformedprogrammer.net/ef-core-in-depth-soft-...

[+] giantg2|3 years ago|reply

"The concept behind soft deletion is to make deletion safer, and reversible."

That's one part. The other part is that in many industries you have regulatory data retention and audit requirements. This is arguably the most valuable and common reason to perform Logical deletes.

[+] danielrhodes|3 years ago|reply

In a previous place I worked, we were programmatically using Box to store files. One day we were presented with a case study in Murphy's Law: a script went awry and deleted everything (10s of thousands of files). There was no clear way to recover these files, they were gone from what we could see. It was a disaster. We got a Box support person on the phone and described what had happened. There was a pause, some mouse clicking and then: "Ok, those files will be back in your account in an hour."

It was 100% our fault. But soft deletes saved us that day. If you're in a situation where you or your customers could benefit from the same, it's wise to not only embrace them but also make sure they work.

[+] dfee|3 years ago|reply

My experience is that soft-deletes are blunt tools bridging the gap between hard deletes and event sourcing (capturing all the changes against the table, in a replay-worthy stream).

Event sourcing is hard – because the engineers responsible for setting it up and managing it aren't generally well skilled in this domain (myself included) and there aren't a wealth of great tools helping engineers find their way into the pit of success.

The downsides of soft-deletes (as identified in the article) are numerous. The biggest problem is that it appears "simple" at first blush (just add a deleted_at column!), but it rots your data model from the inside out.

[+] scott_w|3 years ago|reply

The example the author gives is… frankly awful.

I can’t think of a single case where you’d want to remove the invoices of a customer you delete. Ever. In fact, the opposite is more likely to be a big problem, accidentally cascading your delete to your financial records!

Using a soft delete, your invoices won’t “disappear” because your app WILL have a view for looking at just the invoices.

Source: I built a bookkeeping system and soft deletes is a necessary feature.

[+] lowercased|3 years ago|reply

"Instead, we rolled forward by creating a new app, and helping them copy environment and data from the deleted app to it. So even where soft deletion was theoretically most useful, we still didn’t use it."

But... weren't you using all those env and data info from the soft-deleted set?

I've typically been using soft-deletes for most projects for years. People have accidentally deleted records, and having a process to undelete them - manually or giving them a screen to review/restore - has usually been great.

Yes, if there's a lot of related artefacts not in the database (files/etc) that were literally deleted, you may not be able to get them back. But that's an ever greater edge case in projects I work in as to not be a huge issue. We probably have some files in a backup somewhere, if it's recent. Trying to 'undelete' a record from years ago - yeah, likely ain't gonna happen.

People are used to 'undo' and 'undelete'. Soft-deletes are one way to provide that functionality for some projects.

[+] vivegi|3 years ago|reply

If you do want to retain the deleted records for any purpose (audit, compliance etc.,) it is better to design a DELETED table to maintain the history (just as suggested in the article towards the end).

Once your main tables start getting to the order of tens of millions of records, the filtering by 'deleted_at is NULL' or 'deleted_at is NOT NULL' gets in the way of query performance.

NULL is also not indexed. So, that throws the spanner in the works sometimes (depending on the query).

[+] Gurgler|3 years ago|reply

There's a very legitimate case that I've seen made for soft-deletion in several different situations: foreign keys related to "created-by" columns. Hard-deleting a user who created an object that remains in use after they're gone would trigger referential integrity complaints on those columns. Without being able to reference a "deactivated" user's primary key in such a situation, you'd have to come up with some counterintuitive system for revisiting such objects. And the result (short of removing the foreign key) would be to give you inaccurate information about who created the object. Maybe one of you smarter people has already thought of an elegant way to handle this, but I've never seen one that satisfies my taste.

[+] jelkand|3 years ago|reply

Soft deletion is certainly very situationally worth it. I've found the most value when 1. it is well supported at the ORM layer and 2. business requirements dictate strong auditability of data. While I have undeleted items on occasion, I've used soft deletes more frequently to debug and build a timeline of events around the data.

For context, I've worked in fintech where I often needed to review backoffice approvals, transactions, offers, etc.

[+] munk-a|3 years ago|reply

I just wanted to touch on the fact that eliding soft-deleted rows from queries is really, really easy - this article makes it out to be a constant headache but here's my suggested approach.

    ALTER TABLE blah ADD COLUMN deleted_at NULL TIMESTAMP;
    ALTER TABLE blah RENAME TO blahwithdeleted;
    CREATE VIEW blah (SELECT * FROM blahwithdeleted WHERE deleted_at IS NULL);

And thus your entire application just needs to keep SELECTing from blah while only a few select pieces of code related to undeleting things (or generating reports including deleted things) need to be shifted to read from blahwithdeleted.

[+] deerIRL|3 years ago|reply

As someone who has done development work with Class A data and specifically in the realm of justice, soft deletes aren't simply a good idea, they are required by law.

Most of these downsides are easily mitigatable issues as well. As many users have stated, something like views solves the issue of forgetting the 'deleted' clause.

Lastly, I'm not sure the issue with foreign keys/stray records really resonates with me. I'd be hard pressed to be comfortable allowing a developer or DBA who isn't fully comfortable with the data model to be hard deleting records, let alone flagging them as soft deleted.

[+] GartzenDeHaes|3 years ago|reply

Personally I like no delete designs, which give you a full audit history of changes. This is similar to generally accepted accounting principles. https://en.wikipedia.org/wiki/Generally_Accepted_Accounting_...

[+] vyrotek|3 years ago|reply

I've found SQL Server Temporal Tables are a good alternative to get the benefits of soft-deletes without some of the drawbacks.

https://docs.microsoft.com/en-us/sql/relational-databases/ta...

[+] adrianmsmith|3 years ago|reply

> Instead of keeping deleted data in the same tables from which it was deleted from, there can be a new relation specifically for storing all deleted data

The disadvantage of this is that if you ever do want to access this "deleted" data, e.g. in admin or compliance tools, you now have to do it in two different ways, one way for the main data and a different way in case the data has been "deleted".

The article asserts you'll never need to "undelete" the data. So they're presenting a solution with that assumption, fair enough. Without that assumption, however, moving the data back from an archive table becomes a pain, and if there are any unique constraints e.g. on username or email address, you'll have a problem if you've moved the data out of the main table and another user has used that username or email address.

[+] willlll|3 years ago|reply

For the control plane part of Crunchy Bridge, on day one I decided to go with the deleted_records table that is mentioned at the end of this post. It's been great. No need to keep around dead data that no one ever looks at.

We don't need to have `where deleted_at is null` on every single query. But the best part though is our actual working data set of records we actually care about is tiny compared to the deleted cruft that would have otherwise been just sticking around forever. Backups and restores take no time at all. It's really cool that postgres lets you have conditional indexes on things, but it's even cooler not to need them.

[+] unknown|3 years ago|reply

[deleted]

[+] jonstaab|3 years ago|reply

If you implement soft delete, you should surface it to your user. That's who is accidentally deleting things, and that's who will want to un-delete them. As for side effects like spinning up/down servers, build that into your data model (of course, in a case like Heroku's that can be prohibitively expensive, so don't).

Source: I write back of house software for resale store owners, and accidental deletes happen occasionally. Being able to restore things instills a lot of confidence for our customers.

[+] radu_floricica|3 years ago|reply

Is nobody using log tables? Pretty much every time I touch something in my db, there's a log call that records who did it, when, IP, URL and a (JSON) snapshot of the changed record, which in a pinch can be used for undelete.

It's surprisingly manageable. I mean, yes, it's definitely the largest table in the db, but:

1. it's well worth it

2. most of the stuff in it isn't the main scenario above (a human does something and I record the change) but various automated processes I also want to track, like API calls. which leads to:

3. it's easy to prune - both in time period kept, and by selectively deleting the automated stuff earlier

But it mostly helps by localizing things. It's just one meta-data log table, and everything related to logging actions is there. Not very elegant to keep adding fluff fields to every table, like "add_date" or "deleted_at". When I decided I want to also track the URL of the request I had to change things in just one place, and now I have it for every action everywhere.

Note: don't fall into the "everything is a nail" mistake. Some other dedicated log tables may be necessary, for high-volume or distinct stuff. I also have a mail_log, a sms_log and a separate table for events coming from mobile users (like location history).

[+] habibur|3 years ago|reply

Which is why I don't add that extra deleted field. Rather duplicate all the tables into a new database called "archive" and then insert there before deleting from main.

That works for updates too, by preserving the old data and showing you a time machine like backlog. But the archive database gets too large over time and you need to purge it periodically. You can create some delete triggers for automating this "save before delete" behavior.

494 comments