I don't get why all of the big RDBMSes (PostgreSQL, MariaDB/MySQL, SQL Server, Oracle, ...) don't seem to have built in support for soft deletes up front and center?
CREATE TABLE ... WITH SOFT DELETES
Where the regular DELETE wouldn't get rid of the data for real but rather you could query the deleted records as well, probably have timestamps for everything as a built in low level feature, vs having to handle this with a bunch of ORMs and having to remember to put AND deleted_at IS NULL in all of your custom views.
If we like to talk about in-database processing so much, why don't we just put the actual common features in the DB, so that toggling them on or off doesn't take a bunch of code changes in app, or that you'd even be able to add soft deletes to any legacy app that knows nothing of the concept, on a per table basis or whatever.
Maybe my intuition is wrong, but to me this sounds like a violation of the principle of least power.
"Soft deletes" is just a name for a regular write operation, with specific semantics.
Adding a layer of magic to the DB for this doesn't seem right to me.
And applications could have many different requirements for soft deletes, like the article points out. For example, the simplest version would be just a boolean "deleted" column, but it could also be "deleted_at", "deleted_by", etc.
All of these cases require an bunch of code changes anyway, and the more complex ones could interfere with an implementation of this feature at the database level: such a transparent implementation couldn't access app-specific concerns such as user data, for example.
Adding soft deletes to a legacy app that knows nothing about it would only work for a boolean flag and a maybe date-time value, unless the DBMS would also offer triggers for soft deletes etc?
Seems to me to that this capability would make a DBMS much more complicated.
As an aside, I've never found this to be worth it since you have to change too much and re-test everything for minimal user benefit and time savings. The effort is way worse if the code is not great in the first place. It can be a great decision to make before everything is written.
Maybe it's worth it for files which are hard to reproduce, but you can also rely on DB backups to get those back. If people are regularly deleting things they're not supposed to, you're better off removing the user-facing delete actions, limiting the action to specific users, etc.
I don't think it's all magic because you have to implement soft deletes in your application layer and it would be nice to have a little DB support for it. It doesn't have to be some big thing, just the ability for selects and such to work only on undeleted elements by default unless you ask for
them would be nice so it doesn't pollute your code and make you have to always remember to point the gun away from your foot.
I'd argue that what SQL needs is better facilities for code reuse, metaprogramming and such, it ought to give you the tools that you can make something that lets you add something to the language such that you can add
... WITH SOFT UPDATES
and it adds to the table definition as well as to the schema that will cause subsequent statements to be rewritten. There's a lot of interesting logic (in the literal sense) in SQL that is hidden by a strange, irregular syntax that is more obvious in other approaches to databases such as Datalog. I think it was little appreciated outside the hardcore semantic web community that you could compile SPARQL + OWL to SQL and get powerful inference facilities. SQL is a great target for metaprogramming precisely because it is not Turing complete and that a code generator doesn't have to think at all about the order that events are sequenced in. It's kinda sad that metaprogramming tools for SQL are almost all pre-Chomsky and pre-dragon book internal DSLs like JooQ and SQLAlchemy which have their charms (JooQ's excellent integration with Java IDEs) but fall short of what could be done with SQL-to-SQL and SQL-to-X transformations.
Stored procedures are great but many shops don't use them for various reasons. It doesn't help that they look like a mix of FORTRAN and COBOL and also come in a few variations from the (better) set-based PL/SQL of Oracle to the (worse) Transact-SQL based stored proc of Microsoft SQL and PostgresSQL. The other day I talked with Krisztián Szabó of
who developed a compiler that writes stored procs that do database synchronization.
On the other hand, if you've got access to the internals of the frickin' database I think you can do something better than the ordinary application level soft updates. For instance a "customer record" might very well be not just a row in one table but maybe 15 rows in four tables that are inserted in a transaction and you want to be able to undelete them as a unit.
Different products will handle soft deletes differently. Which queries need to include soft-deleted rows and which don't? What about different levels of soft deletes, e.g. done by the user (can be undone by user) vs. done by an admin (can't be undone by user)?
Implementing soft deletes yourself isn't hard. Yes you'll have to make a bunch of decisions about how they work in every circumstance, but that's the point.
It's just not bothersome enough to deviate from the standard.
If they did this, nobody would use it. They do lots of more useful things that people don't use because it's not portable.
There's a sibling comment about temporal databases. Those solve a very bothersome problem, so a few people use them. That means that there's a chance soft deletes get adopted as a side effect of a much more complex standard.
These are not decisions that should be taken solely by whoever is programming the backend.
They need to be surfaced to the product owner to decide. There may very well be reasons pieces of data should not be stored. And all of this adds complexity, more things to go wrong.
If the product owner wants to start tracking every change and by who, that can completely change your database requirements.
So have that conversation properly. Then decide it's either not worth it and don't add any of these "extra" fields you "might" need, or decide it is and fully spec it out and how much additional time and effort it will be to do it as a proper feature. But don't do it as some half-built just-in-case "favor" to a future programmer who may very well have to rip it out.
On a personal project, do whatever you want. But on something professional, this stuff needs to be specced out and accounted for. This isn't a programming decision, it's a product decision.
This entirely depends on the company culture. I worked in teams where every small decision is in the hand of the PO and I've worked in teams where a software engineer is a respected professional enabled to make their own technical decisions. I found the second option to create higher quality software faster.
Also not sure what you mean by additional effort? Created_at, updated_at or soft-deletes are part of most proper frameworks. In Spring all you need is an annotation, I've been using those in major projects and implementation cost is around a few seconds with so far zero seconds of maintenance effort in years of development. At least those fields are solved problems.
Some things are trivial and nearly free - created_at, updated_at. I don't think engineers need to bring trivialities like this to a "product owner". Own your craft.
When the product you're developing is governed by regulations and standards you need to comply, owning your craft is doing things by the book, not adding fields on your own because it might be useful later.
So what? I've worked places with lots of regulation. Part of every development job is learning the product domain. In that case devs become comfortable with reading standard/law/regulations and anticipating when software implementation might interact with the areas covered.
Sure there were people who's job was to offload as much compliance work from everyone else; by turning it into internal requirements, participating in design discussion and specializing in ensuring compliance. But trying to isolate the development team from it is just asking for micromanagers.
Think before you act. The machine has no brain. Use yours.
> Part of every development job is learning the product domain.
Yes.
> In that case devs become comfortable with reading standard/law/regulations and anticipating when software implementation might interact with the areas covered.
This is what I'm saying, too. A developer needs to think whether what they are doing is OK by the regulation they're flying against. They need to ask for permissions by asking themselves "wait, is this OK by the regulation I'm trying to comply?".
> But trying to isolate the development team from it is just asking for micromanagers.
Nope, I'm all for taking initiatives, and against micromanagement. However, I'm also against "I need no permission because I'm doing something amazing" attitude. So own your craft, "code responsibly".
Oh, I thought you were disagreeing with hamandcheese's point that every little decision doesn't need to go through a product owner before anything happens.
Although those can be more complicated, and it should be clear what they're for and why they exist. Will this result in an object having an updated_at timestamp elsewhere in a larger application? Is it clear which properties that refers to?
I think the tricky part lies on knowing which things can be done without consulting any product owner. I agree that created_at and updated_at don’t cause any harm. deleted_at on the other hand cannot be decided by engineers only (mainly because of GDPR reasons: if something is expected to be totally deleted, then that must be it). As usual, these kind of things are obvious to engineers with years of experience , not so much to newcomers.
I've seen product owners who get blindsided every time by this sort of thing.
On the other hand, in some shops there is a dedicated DBA who is in charge of database schemas and possibly everything else. Before it became fashionable to create a "service layer" where people access the database (now database(s)) throw web endpoints, some organizations would put all the database access into a set of stored procedures managed by the DBA. Maybe that's extreme, but in the real world product owners come and go but the database is forever and deserves to have somebody speaking out for its interests.
> And all of this adds complexity, more things to go wrong
That's a little vague given this specific example, which appears to be about maintaining some form of informative logging; though I don't think it necessarily needs to be in the form of an DB table.
The problem with updated_at and updated_by is that a given record could experience multiple updates by multiple people at multiple times, and you'd only have visibility into the most recent.
The logical conclusion here is to log the updates (and creations and deletions and undeletions and such) themselves:
Technically you don't even need the "foo" table anymore, since you can reconstruct its contents by pulling the most recent transaction for a given foo_id and discarding the reconstructed record if the most recent action on it was a deletion. Probably still a good idea to create a view or somesuch for the sake of convenience, but the point of this is that the log itself becomes the record of truth - and while this approach does cost some disk space (due to duplicated data) and read performance (due to the more complex query involved), it's invaluable for tracking down a record's full lifecycle. Even better if you can enforce append-only access to that table.
This is a pretty typical approach for things like bookkeeping and inventory management (though usually those are tracking the deltas between the old and new states, instead of recording the updated states directly as the above example would imply).
An audit log table often takes a huge amount of space compared to simple fields on the records so there are tradeoffs. Which solution is best depends on how important change logs are.
I kinda agree, but don’t underestimate the power of having things where people are looking.
Put your documentation in doc strings where the function is defined - don’t have a separate file in a separate folder for that. It might separate concerns, but no one is looking there.
Similarly if those fields aren’t nullable, someone trying to add new rows will have to fill in something for those metadata fields - and that something will now very likely be what’s needed, rather than not pushing anything to the audit table.
Obviously your app can outgrow these simple columns, but you’re getting value now.
Event sourcing also works great. You don't need an audit log per se if you already track a history of all commands that introduced changes to your system.
If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
Let's assume we want to remove every message related to user A.
A photo by user B got to be the best of the day because it collected most upvotes. Without the A's vote, it's no longer so. The photo also got to become the best of the month because it was later voted as the top from the best-of-the-day entries, and received a prize. Should we now play the message stream without the A's upvote, things are going to end up radically different, or end up in a processing error.
User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
One way is to ignore the inconsistencies; it deprives you of most of the benefits of event sourcing.
Another way is anonymizing: replace messages about user A with messages about some null user, representing the removed users. This can lead to more paradoxes and message replay inconsistencies.
> If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
That's not how snapshots work. You record the state of your system at a point in time, and then you keep all events that occurred after that point. This means you retain the ability to rebuild the current state from that snapshot by replaying all events. I.e., event sourcing's happy flow.
> User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
Not really. That's just your best attempt at reasoning how the system could work. In the meantime, depending on whether you have a hard requirement on retaining messages from removed users you can either keep them assigned to a deleted user or replace them by deleted messages. This is not a problem caused by event sourcing; it's a problem caused by failing to design a system that meets it's requirements.
Yep. But Event Sourcing comes with its own set of other problems.
Also, I don't think this would apply to OP's post: with Event Sourcing you would not even have those DB tables.
The DB tables suggested by OP are a kin to snapshots, whereas each event would require a separate data store. OP is trying to shoehorn event history into the snapshots, which hardly makes any sense.
*_at and *_by fields in SQL are just denormalization + pruning patterns consolidated, right?
Do the long walk:
Make the schema fully auditable (one record per edit) and the tables normalized (it will feel weird). Then suffer with it, discover that normalization leads to performance decrease.
Then discover that pruned auditing records is a good middle ground. Just the last edit and by whom is often enough (ominous foreshadowing).
Fail miserably by discovering that a single missing auditing record can cost a lot.
Blame database engines for making you choose. Adopt an experimental database with full auditing history. Maybe do incremental backups. Maybe both, since you have grown paranoid by now.
Discover that it is not enough again. Find that no silver bullet exists for auditing.
Now you can make a conscious choice about it. Then you won't need acronyms to remember stuff!
Fair enough, but now your application is relying on 100% uptime of AWS and S3 and no network failures in between. And what happens if your transaction goes through, but the request to AWS doesn’t? What happens if another operation mutates the target meanwhile before you can retry with current state? Your app is also slowing down since it needs to send the events to S3 and guarantee they got there. Now you are reinventing two-stage commits. Unless you aren’t actually making an audit log and don’t care if events are guaranteed to be logged?
So like OP said, no silver bullets exist for auditing.
Correct. This is a system design problem. You want this to be transactional and work at scale? That might be hard to achieve. Maybe if the data can be partioned then each node handles its own auditing in a table ad part of the transaction. There are many possibilities. Allowing inconsistently might be OK too depending on what is required.
It is interesting thinking about record changes as a spectrum towards application logs. At some point too much detail is expensive to store, and you must adopt an archival strategy.
If you see it from the pure SQL point of view, you are in the "blame database engines and adopt an experimental solution".
It is the point where you give up modeling the audit as part of the systems tables.
The drawbacks of this choice are often related to retrieval. It depends on the engine.
I once maintained a system that kept a fully working log replicated instance delayed by 24h, ready for retrieval queries, in addition to regular disk backups (slow costy retrieval).
I am more developer than DBA, so I can probably speak more about modeling solutions than infra-centric solutions.
The problem with this is the audit log is only at the CRUD level which is often too low. Ambiguities can arise. For example if the question is "who published the article" do you look for a create or do you look for an update with published=true? It's even worse when you consider the schema can change over time, so both can be correct but at different points in time. Event sourcing is the way if you want to capture business-level events.
Kind of, the WAL in postgres is effectively an event log, and many people keep replicas of it for backup reasons, which is auditable, kind of meaning that an EDA/Event source is just a shinier version of that?
Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when. Your main table’s created and updated times can be a function of (or a complement to) the edit table.
It is tempting to supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value). This is getting dangerously and excitingly close to a graph database implemented in a relational database! Implement one at your peril — what you gain in schemaless freedom you also lose in terms of having the underlying database engine no longer enforcing consistency on your behalf.
> This is getting dangerously and excitingly close to a graph database implemented in a relational database!
This feels like a great unresolved tension in database / backend design - or maybe I'm just not sophisticated enough to notice the solutions?
Is the solution event sourcing and using the relational database as a "read model" only? Is that where the truly sophisticated application developers are at? Is it really overkill for everybody not working in finance? Or is there just not a framework that's made it super easy yet?
Users demand flexible schemas - should we tell them no?
> Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when.
Aren't you describing a non-functional approach to event sourcing? I mean, if the whole point of your system is to track events that caused changes, why isn't your system built around handling events that cause changes?
> supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value)
I frankly hate this sort of thing whenever I see it. Software engineers have a tendency to optimize for the wrong things.
Generic relations reduce the number of tables in the database. But who cares about the number of tables in the database? Are we paying per table? Optimize for the data model actually being understandable and consistently enforced (+ bonus points for ease of querying).
One thing I do quite frequently which is related to this (and possibly is a pattern in rails) is to use times in place of Booleans.
So is_deleted would contain a timestamp to represent the deleted_at time for example. This means you can store more information for a small marginal cost. It helps that rails will automatically let you use it as a Boolean and will interpret a timestamp as true.
I consider booleans a code smell. It's not a bug, but it's a suggestion that I'm considering something wrong. I will probably want to replace it with something more meaningful in the future. It might be an enum, a subclass, a timestamp, refactoring, or millions of other things, but the Boolean was probably the wrong thing to do even if I don't know it yet.
The way I think about it: a boolean is usually an answer to a question about the state, not the state itself.
A light switch doesn't have an atomic state, it has a range of motion. The answer to the question "is the switch on?" is a boolean answer to a question whose input state is a range (e.g. is distance between contacts <= epsilon).
This seems at first like a controversial idea, but the more I think about it the more I like this thought technology. Merely the idea of asking myself if there's a better way to store a fact like that will potentially improve designs.
The enum idea is often wise; also: for just an example that has probably occurred a hundred thousand times across the world in various businesses...
Original design: store a row that needs to be reported to someone, with an is_reported column that is boolean.
Problem: one day for whatever reason the ReporterService turns out to need to run two of these in parallel. Maybe it's that the reporting is the last step after ingestion in a single service and we need to ingest in parallel. Maybe it's that there are too many reports to different people and the reports themselves are parallelizable (grab 5 clients, grab unreported rows that foreign key to them, report those rows... whoops sometimes two processes choose the same client!)... Maybe it's just that these are run in Kubernetes and if the report happens when you're rolling pods then the request gets retried by both the dying pod and the new pod.
Alternative to boolean: unreported and reported records both live in the `foo` table and then a trigger puts a row for any new Foos into the `foo_unreported` table. This table can now store a lock timestamp, a locker UUID, and denormalize any columns you need (client_id) to select them. The reporter UPDATEs a bunch of rows reserving them, SELECTs whatever it has successfully reserved, reports them, then DELETEs them. It reserves rows where the lock timestamp IS NULL or is less than now minus 5 minutes, and the Reporter itself runs with a 5 minute timeout. The DB will do the barest amount of locking to make sure that two UPDATES don't conflict, there is no risk of deadlock, and the Boolean has turned into whether something exists in a set or not.
A similar trick is used in the classic Python talk “Stop Writing Classes” by @jackdied where a version of The Game of Life is optimized by saying that instead of holding a big 2D array of true/false booleans on a finite gameboard, we'll hold an infinite gameboard with a set of (x,y) pairs of living cells which will internally be backed by a hashmap.
For me enums win especially when you consider that you can get help from your environment every time you add/remove stuff. Some languages force you to deal with the changes (i.e. rust) or you could add linter rules for other languages. But you're more likely to catch a problem before it arises, rather than deal with ever increasing bool checks. Makes reasoning about states a lot easier.
32-bit UNIX timestamps are often signed so you can actually go before that, but most UNIX timestamps are 64-bit now, which can represent quite a larger range. And SQL datetime types might have a totally different range.
Not that it really matters; deleted_at times for your database records will rarely predate the existence of said database.
In addition to the sibling comment, which is exactly right (you should be using a nullable column here, if you're using SQL, for multiple reasons) I reckon this a design issue in the programming language that is largely unrelated to how you model the database. It's pretty easy to run into bugs especially if you compound it with other quirky APIs, like strcmp: `if (strcmp(a, b)) // forgot to do == 0; accidentally backwards!` -- So really, you just don't have much of a choice other than to tread carefully and enable compiler warnings. Personally in this case I'd use an Optional wrapper around the underlying timestamp type anyways, if I needed to be able to represent the UNIX timestamp at 0 as well as an empty state.
So you're still fine as long as you're not tracking things that were deleted on that exact instant 50 years ago, a safe assumption, for instance, for things that happened in your application that has only existed for less time than that. That said, I haven't ever seen this implemented in a way that casts. It's implemented with scopes in the ORM, usually.
MyModel.nondeleted.where(<criteria>)
etc.
which generates a query with "WHERE deleted_at IS NULL"
« Anytime you store Boolean, a kitten dies »
Nobody has ever said that but nobody wants any kitten to die so nobody has ever challenged me anytime I use that statement.
A little while back, I had a conversation with a colleague about sorting entries by "updated at" in the user interface, and to my surprise this was not added by the backend team.
Many of these "we are going to need it"s come from experience. For example in the context of data structures (DS), I have made many "mistakes" that I do correctly a second time. These mistakes made writing algorithms for the DS harder, or made the DS have bad performance.
Sadly, it's hard to transfer this underlying breadth of knowledge and intuition for making good tradeoffs. As such, a one-off tip like this is limited in its usefulness.
Database schemas being perfect out-of-the gate was replaced by reliable migrations.
If it's not data that's essential to serving the current functionality, just add a column later. `updated_at` doesn't have to be accurate for your entire dataset; just set it to `NOW()` when you run the migration.
Sure, migrations are bearable (especially ones that only add columns).
But for the example of the "updated_at" column, or "soft delete" functionality, you only find out you need it because the operations team suddenly discovered they needed that functionality on existing production rows because something weird happened.
In C#-land, we just have it as a standard that ~every table inherits from `ITrackable`, and we wrote a little EF plugin to automatically update the appropriate columns.
“Reliable migrations” almost seems like an oxymoron. Migrations are complicated, difficult and error prone. I think there’s a good takeaway here around good initial schema design practices. The less you have to morph your schema overtime, the less of those risky migrations need to run.
My experience has not been so smooth. Migrations are reasonable, but they're not free and "always keeps backups" sounds like you'd tolerate downtime more than I would.
Even in the best case (e.g. basic column addition), the migration itself can be "noisy neighbors" for other queries. It can cause pressure on downstream systems consuming CDC (and maybe some of those run queries too, and now your load is even higher).
Still depends on what the data represent: you could get yourself in a storm of phone calls from customers if after your latest release there's now a weird note saying their saved document was last updated today.
Somewhat related, but I suggest having both the record updated at, and some kind of "user editing updated at". As I've encountered issues where some data migration ends up touching records and bumping the updated at, which shocks users since they see the UI reshuffle and think they have been hacked when they see the records updated at a time they didn't update them.
I like the heuristics described here.
However if these things aren't making it into a product spec where appropriate, then I smell some dysfunction that goes beyond what's being stored by default.
Product need (expressed as spec, design, etc) should highlight the failure cases where we would expect fields like these to be surfaced.
I'd hope that any given buisness shouldn't need someone with production database access on hand to inform as to why/when/how 'thing' was deleted. Really we'd want the user (be it 'boss' or someone else) to be able to access that information in a controlled manner.
"What information do we need when something goes wrong?". Ask it. Drill it. Ask it again.
That said, if you can't get those things, this seems a fine way to be pragmatic.
Yes! Why something happened is incredibly important. Gitlab made this mistake hard. We have a medium sized instance with some complex CI pipelines and often they'll just get cancelled and it doesn't say why or even who by. And anyone can do it! The only option is to ask the entire company "did anyone cancel this?"
Event-sourcing solves this. And with how cheap storage is, it should be more prevalent in the industry. IMO the biggest thing holding it back is that there isn't a framework that's plug-and-play (say like Next.js is to React) that provides people with that ability.
I've been working on one in Typescript (with eventual re-writes in other langs. like Rust and Go), but it's difficult even coming up with conventions.
Event sourcing is an expensive solution and I don't mean from a storage perspective — it burns engineering cognitive horsepower quickly on things that don't matter. Do it if you're in finance or whatever. Having been burned by my own "let's event source" impulse on data change tracking systems, I now prefer less sophisticated solutions. Figuring out how to deal with slow projections, watching a projection rebuild go from minutes to hours to a few days as a system I expected to handle a few events/minute go to 20 events/second. Fancy caches can't save you if you want to use that vaunted ability to reconstruct from scratch. Event schema evolution also presents difficult tradeoffs: when old events stop having meaning or evolve in meaning you either end up adding on new event subtypes and variants leaving old cruft to accumulate, or you do migrations and edit history on really large tables.
I'd counsel anyone considering event sourcing to use more "low power" solutions like audit logs or soft deletes (if really necessary) first if possible.
Appreciate your perspective, and it makes me wish there was some kind of online 'engineers learning from their mistakes' forum (rare to see "I burned myself"). To hear hard won knowledge distilled like this is a nice reminder to spend ones complexity budget wisely.
I agree with this as written, as think it's important to have some degree of forethought when building out the DB to plan for future growth and needs.
That said, the monkey paw of this would be someone reading it and deciding they should capture and save all possible user data, "just in case", which becomes a liability.
Just curious, how do people feel about this general style of soft deletes currently? Do people still use these in production or prefer to just delete fully or alternatively move deleted rows to a separate tables / schema?
I find the complexity to still feel awkward enough that makes me wonder if deleted_at is worth it. Maybe there are better patterns out there to make this cleaner like triggers to prevent deletion, something else?
As for the article, I couldn't agree more on having timestamps / user ids on all actions. I'd even suggest updated_by to add to the list.
In our product, we have different strategies depending on the requirements. Sometimes, we just delete. Sometimes, we do soft delete with timestamps. Sometimes, we have a history table with or without versioned entities. Sometimes, we have versions in the table. Sometimes, we have an audit log. Sometimes, we use event sourcing (although everyone in the team hates it ;-)
Financial world: records have a "close" or "expire" date which is then purged after some period of time. A deletion doesn't just happen, the record is updated to be "closed" or "expired" and some time after that it's deleted.
Something like a loan could live in a production environment for well over a year after closing, while an internal note may last just a month.
I think soft deletes using timestamptz are a good thing.
Deleting rows directly could mean you're breaking references. For example, say you have a product that the seller wants to delete. Well, what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.
Your backend can selectively query for products, filter out deleted_at for any customer facing queries, but show all products when looking at purchase history.
There are times when deleting rows makes sense, but that's usually because you have a write-heavy table that needs clearing. Yes, soft-deletes requires being careful with WHERE statements filtering out deleted rows, but that's a feature not a bug.
> what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.
You might still want to show to those customers their purchase history including what they bought 25 years ago. For example, my ISP doesn't have anymore that 10 Mb/s fiber optic product I bought im 2000, because it was superseded by 100 Mb/s products and then by 1 Gb/s ones. It's also not my ISP anymore but I use it for the SIM in my phone. That also accumulated a number of product changes along the years.
And think about the inventory of eshops with a zillion products and the archive of the pady orders. Maybe they keep the last few years, maybe everything until the db gets too large.
There can be legal requirements to retain data for a specified time for law enforcement and audits, while at the same time other legal requirements have you requiring to delete data upon customer request.
Doing this with pure 'hard' deletes is not possible, unless you maintain 2 different tables, one of which would still have the soft delete explicit or implicit. You could argue the full db log would contain the data for the former requirement, but while academicly correct this does not fly in practice.
If you have a good audit log, it really doesn't matter. You can always restore it if need be.
If you have no audit log(or a bad one), like lots of apps, then you have to care a lot.
Personally, I just implement a good audit log and then I just delete with impunity. Worst case scenario, someone(maybe even me) made a mistake and I have to run undo_log_audit() with the id of the audit log entry I want to put back. Nearly zero hassle.
The upside, when something goes wrong, I can tell you who, what and when. I usually have to infer the why, or go ask a human, but it's not usually even difficult to do that.
That depends on where the data you need to keep track of is and your architecture. The important thing is, you want your audit log to be able to tell you:
* Who
* What
* When
* Ideally Why
For any change in the system. Also when storing the audit log, take into account that you might need to undo things that happened(not just deletes). For instance maybe some process went haywire and inserted 100k records it wasn't supposed to. A good audit log, you should be able to run something like undo_log_audit(rec1, rec100k) and it will do the right thing. I'm not saying that code needs to exist day 1, but you should take into account the ability to do that when designing it.
Also you need to take into account your regulatory environment. Sometimes it's very very important that your audit logs are write once, and read only afterwards and are stored off machine, etc. Other times it's just for internal use and you can be a little more lax about date integrity of your audit logs.
Our app is heavily database centric. We push into the DB the current unix user, the current PID of the process connecting to the DB, etc(also every user has their own login to the DB so it handles our authentication too). This means our database(Postgres) does all of the audit logging for us. There are plenty of Postgres audit logging extensions. We run 2 of them. One that is trigger based creating entries in a log_audit table(which the undo_log_audit() code uses along with most reporting use cases) and a second one that writes out to syslog(so we can move logs off machine and keep them read only). We are in a regulated industry that gets audited regularly however. Not everyone needs the same level of audit logging.
You need to figure out how you can answer the above questions given your architecture. Normally the "Why" question is hard to answer without talking with a human, but unless you have the who, what and when, it's nearly impossible to even get to the Why part of the question.
Always soft-deletion first. Then it gets exported to a separate archive and only then, after some time and may be attempted to be fully deleted from the initial base.
(It Is Probable That While Not Immediately Required The Implementation of Storage of Data In Question May Be Simpler Now Rather Than Later)
I've gone ahead and included additional detail in the acronym in the event that the clarity is required later, as this would be difficult to retrofit into a shorter, more-established acronym.
Five years ago everybody would lough about "soft deletes" or "marked as deleted". Whoever thought this is a good idea from a data protection perspective?
You also lying in the face of your users with such a behavior. Shame.
Except almost every database (and most storage devices nowadays) works this way - mark an entry as deleted, then batch delete a lot of entries during garbage collection. It's fundamentally impossible to efficiently erase a record from the middle of a file, except maybe by using an encryption tree, which would still be fairly inefficient.
Actually erasing data is quite hard. Soft deletes doesn't add any new lies, they just move the lies to the upper layer.
Come on. With a manual "marked as deleted" it stays as this forever, it is not deleted and never will, and the "deleted" data lands also in database backups, is still query-able and so on. I do not care if the deleted data stays for a while on disk or in memory as long it will be eventually deleted by the garbage collector and isn't query-able anymore.
It's a terrible post. What it suggests is to turn your head off and follow overly generalised principle. I guess when somebody invent yet another acronym it is a red flag.
Data has its own life cycles in every area it passes through. And it's part of requirements gathering to find those cycles: the dependent systems, the teams, and the questions you need to answer. Mindlessly adding fields won't save you in every situation.
Bonus point: when you start collecting questions while designing your service, you'll discover how mature your colleagues' thinking is.
Agree, although the acronym in the article could be interpreted to mean “you are going to read it, so index it appropriately”, which is sort of bad advice and can lead to overindexing. There is probably something better for “add appropriate and conventional metadata” (the author suggests updated_at, created_at etc)
Not a huge fan of the example of soft delete, i think hard deletes with archive tables (no foreign key enforcement) is a much much better pattern. Takes away from the main point of the article a bit, but glad the author hinted at deleted_at only being used for soft deletes.
I don't like general advice like this, because it's too general. For many, it's probably good advice. For others, not so much.
Anyone who has worked at a small company selling to large B2B SaaS can attest we get like 20 hits a day on a busy day. Most of that is done by one person in one company, who is probably also the only person from said company you've ever talked to.
From that lens, this is all overkill. It's not bad advice, it's just that it will get quoted for scenarios it doesn't apply. Which also apply to K8S, or microservices at large even, and most 'do as I say' tech blogs.
Author is very kind! In practice, many times I saw only the CR/CRU of CRUD getting implemented.
For example: as a company aspires to launch its product, one of the first features implemented in any system is to add a new user. But when the day comes when a customer leaves, suddenly you discover no one implemented off-boarding and cleanup of any sort.
Shipped and supported enough startup apps to learn this the hard way: users will delete things they shouldn’t, and you will be asked to explain or undo it. Soft deletes and basic metadata (created_at, deleted_by, etc.) have saved me multiple times — not for some future feature, just for basic operational sanity.
Assuming you're serious, there are two main reasons. You want to regain storage space after you delete things, and sometimes you want to actually delete things (e.g. to be in compliance with regulations).
curious that both YAGNI and YAGRI arguments could realistically be made for the same fields. guess it boils down to whether someone’s YAGRI is stronger than their colleague’s YAGNI ( :
This is good advice except for deleted_at. Soft deletion is rarely smart. Deleted things just accumulate and every time you query that table is a new opportunity to forget to omit deleted things. Query performance suffers a lot. It's just a needless complexity.
Instead, just for the tables where you want to support soft delete, copy the data somewhere else. Make a table like `deleteds (tablename text not null, data jsonb not null default '{}')` that you can stuff a serialized copy of the rows you delete from other tables (but just the ones you think you want to support soft delete on).
The theory here is: You don't actually want soft delete, you are just being paranoid and you will never go undelete anything. If you actually do want to undelete stuff, you'll end up building a whole feature around it to expose that to the user anyway so that is when you need to actually think through building the feature. In the meantime you can sleep at night, safe in the knowledge that the data you will never go look at anyway is safe in some table that doesn't cause increased runtime cost and development complexity.
I have a different way of thinking about this: data loss. If you are throwing away data about who performed a delete it is a data loss situation. You should think about whether that’s OK. It probably isn’t.
Well in the same vain that we discuss "points" and talk about the merits, its useful to discuss and understand their counter points. I for one did not know about this and thought it was insightful when building a product that hasn't fully been scoped out and is more greenfield
I don't get why all of the big RDBMSes (PostgreSQL, MariaDB/MySQL, SQL Server, Oracle, ...) don't seem to have built in support for soft deletes up front and center?
Where the regular DELETE wouldn't get rid of the data for real but rather you could query the deleted records as well, probably have timestamps for everything as a built in low level feature, vs having to handle this with a bunch of ORMs and having to remember to put AND deleted_at IS NULL in all of your custom views.If we like to talk about in-database processing so much, why don't we just put the actual common features in the DB, so that toggling them on or off doesn't take a bunch of code changes in app, or that you'd even be able to add soft deletes to any legacy app that knows nothing of the concept, on a per table basis or whatever.
Maybe my intuition is wrong, but to me this sounds like a violation of the principle of least power.
"Soft deletes" is just a name for a regular write operation, with specific semantics.
Adding a layer of magic to the DB for this doesn't seem right to me.
And applications could have many different requirements for soft deletes, like the article points out. For example, the simplest version would be just a boolean "deleted" column, but it could also be "deleted_at", "deleted_by", etc.
All of these cases require an bunch of code changes anyway, and the more complex ones could interfere with an implementation of this feature at the database level: such a transparent implementation couldn't access app-specific concerns such as user data, for example.
Adding soft deletes to a legacy app that knows nothing about it would only work for a boolean flag and a maybe date-time value, unless the DBMS would also offer triggers for soft deletes etc?
Seems to me to that this capability would make a DBMS much more complicated.
>Adding soft deletes to a legacy app
As an aside, I've never found this to be worth it since you have to change too much and re-test everything for minimal user benefit and time savings. The effort is way worse if the code is not great in the first place. It can be a great decision to make before everything is written.
Maybe it's worth it for files which are hard to reproduce, but you can also rely on DB backups to get those back. If people are regularly deleting things they're not supposed to, you're better off removing the user-facing delete actions, limiting the action to specific users, etc.
I don't think it's all magic because you have to implement soft deletes in your application layer and it would be nice to have a little DB support for it. It doesn't have to be some big thing, just the ability for selects and such to work only on undeleted elements by default unless you ask for them would be nice so it doesn't pollute your code and make you have to always remember to point the gun away from your foot.
I'd argue that what SQL needs is better facilities for code reuse, metaprogramming and such, it ought to give you the tools that you can make something that lets you add something to the language such that you can add
and it adds to the table definition as well as to the schema that will cause subsequent statements to be rewritten. There's a lot of interesting logic (in the literal sense) in SQL that is hidden by a strange, irregular syntax that is more obvious in other approaches to databases such as Datalog. I think it was little appreciated outside the hardcore semantic web community that you could compile SPARQL + OWL to SQL and get powerful inference facilities. SQL is a great target for metaprogramming precisely because it is not Turing complete and that a code generator doesn't have to think at all about the order that events are sequenced in. It's kinda sad that metaprogramming tools for SQL are almost all pre-Chomsky and pre-dragon book internal DSLs like JooQ and SQLAlchemy which have their charms (JooQ's excellent integration with Java IDEs) but fall short of what could be done with SQL-to-SQL and SQL-to-X transformations.Stored procedures are great but many shops don't use them for various reasons. It doesn't help that they look like a mix of FORTRAN and COBOL and also come in a few variations from the (better) set-based PL/SQL of Oracle to the (worse) Transact-SQL based stored proc of Microsoft SQL and PostgresSQL. The other day I talked with Krisztián Szabó of
https://schemamap.io/
who developed a compiler that writes stored procs that do database synchronization.
On the other hand, if you've got access to the internals of the frickin' database I think you can do something better than the ordinary application level soft updates. For instance a "customer record" might very well be not just a row in one table but maybe 15 rows in four tables that are inserted in a transaction and you want to be able to undelete them as a unit.
Because it's too dependent on business logic.
Different products will handle soft deletes differently. Which queries need to include soft-deleted rows and which don't? What about different levels of soft deletes, e.g. done by the user (can be undone by user) vs. done by an admin (can't be undone by user)?
Implementing soft deletes yourself isn't hard. Yes you'll have to make a bunch of decisions about how they work in every circumstance, but that's the point.
It's just not bothersome enough to deviate from the standard.
If they did this, nobody would use it. They do lots of more useful things that people don't use because it's not portable.
There's a sibling comment about temporal databases. Those solve a very bothersome problem, so a few people use them. That means that there's a chance soft deletes get adopted as a side effect of a much more complex standard.
Temporal tables in SQL server fit this use-case[0], I think.
0: https://learn.microsoft.com/en-us/sql/relational-databases/t...
Available on postgres as an extension. It's a bit jank and doesn't have language integrated clauses like sql server.
These are not decisions that should be taken solely by whoever is programming the backend.
They need to be surfaced to the product owner to decide. There may very well be reasons pieces of data should not be stored. And all of this adds complexity, more things to go wrong.
If the product owner wants to start tracking every change and by who, that can completely change your database requirements.
So have that conversation properly. Then decide it's either not worth it and don't add any of these "extra" fields you "might" need, or decide it is and fully spec it out and how much additional time and effort it will be to do it as a proper feature. But don't do it as some half-built just-in-case "favor" to a future programmer who may very well have to rip it out.
On a personal project, do whatever you want. But on something professional, this stuff needs to be specced out and accounted for. This isn't a programming decision, it's a product decision.
This entirely depends on the company culture. I worked in teams where every small decision is in the hand of the PO and I've worked in teams where a software engineer is a respected professional enabled to make their own technical decisions. I found the second option to create higher quality software faster.
Also not sure what you mean by additional effort? Created_at, updated_at or soft-deletes are part of most proper frameworks. In Spring all you need is an annotation, I've been using those in major projects and implementation cost is around a few seconds with so far zero seconds of maintenance effort in years of development. At least those fields are solved problems.
even better, assume soft deletes and inform them this is how delete works - if there's a hard requirement for hard delete, they will tell you.
the hypothetical future programmer is you in two weeks.
Some things are trivial and nearly free - created_at, updated_at. I don't think engineers need to bring trivialities like this to a "product owner". Own your craft.
I never worked at a place with product owners, but their post made me appreciate my roles where I'm trusted to help design the product myself. Yeesh.
Being unable to even call the shot of whether a database table should have an updated_at or soft-delete sounds like a Dilbertian hellscape to me.
When the product you're developing is governed by regulations and standards you need to comply, owning your craft is doing things by the book, not adding fields on your own because it might be useful later.
So what? I've worked places with lots of regulation. Part of every development job is learning the product domain. In that case devs become comfortable with reading standard/law/regulations and anticipating when software implementation might interact with the areas covered.
Sure there were people who's job was to offload as much compliance work from everyone else; by turning it into internal requirements, participating in design discussion and specializing in ensuring compliance. But trying to isolate the development team from it is just asking for micromanagers.
> So what?
Think before you act. The machine has no brain. Use yours.
> Part of every development job is learning the product domain.
Yes.
> In that case devs become comfortable with reading standard/law/regulations and anticipating when software implementation might interact with the areas covered.
This is what I'm saying, too. A developer needs to think whether what they are doing is OK by the regulation they're flying against. They need to ask for permissions by asking themselves "wait, is this OK by the regulation I'm trying to comply?".
> But trying to isolate the development team from it is just asking for micromanagers.
Nope, I'm all for taking initiatives, and against micromanagement. However, I'm also against "I need no permission because I'm doing something amazing" attitude. So own your craft, "code responsibly".
Oh, I thought you were disagreeing with hamandcheese's point that every little decision doesn't need to go through a product owner before anything happens.
No, not at all. by "the book", I meant regulations, not the management. :)
Although those can be more complicated, and it should be clear what they're for and why they exist. Will this result in an object having an updated_at timestamp elsewhere in a larger application? Is it clear which properties that refers to?
I think the tricky part lies on knowing which things can be done without consulting any product owner. I agree that created_at and updated_at don’t cause any harm. deleted_at on the other hand cannot be decided by engineers only (mainly because of GDPR reasons: if something is expected to be totally deleted, then that must be it). As usual, these kind of things are obvious to engineers with years of experience , not so much to newcomers.
A soft delete might not be, for compliance reasons (GDPR and the like). Otherwise I agree.
I've seen product owners who get blindsided every time by this sort of thing.
On the other hand, in some shops there is a dedicated DBA who is in charge of database schemas and possibly everything else. Before it became fashionable to create a "service layer" where people access the database (now database(s)) throw web endpoints, some organizations would put all the database access into a set of stored procedures managed by the DBA. Maybe that's extreme, but in the real world product owners come and go but the database is forever and deserves to have somebody speaking out for its interests.
> don't do it as some half-built just-in-case "favor" to a future programmer who may very well have to rip it out.
in other words - YAGNI !
A product owner may (probably does not) have these things on their radar, it's up to a developer to inform them of industry best practices like these.
Yes. Adding fields to a DB is not a risk-free thing a programmer should just do.
> And all of this adds complexity, more things to go wrong
That's a little vague given this specific example, which appears to be about maintaining some form of informative logging; though I don't think it necessarily needs to be in the form of an DB table.
The problem with updated_at and updated_by is that a given record could experience multiple updates by multiple people at multiple times, and you'd only have visibility into the most recent.
The logical conclusion here is to log the updates (and creations and deletions and undeletions and such) themselves:
Technically you don't even need the "foo" table anymore, since you can reconstruct its contents by pulling the most recent transaction for a given foo_id and discarding the reconstructed record if the most recent action on it was a deletion. Probably still a good idea to create a view or somesuch for the sake of convenience, but the point of this is that the log itself becomes the record of truth - and while this approach does cost some disk space (due to duplicated data) and read performance (due to the more complex query involved), it's invaluable for tracking down a record's full lifecycle. Even better if you can enforce append-only access to that table.This is a pretty typical approach for things like bookkeeping and inventory management (though usually those are tracking the deltas between the old and new states, instead of recording the updated states directly as the above example would imply).
While I like the YAGRI principle very much, I find that adding
- updated_at
- deleted_at (soft deletes)
- created_by etc
- permission used during CRUD
to every table is a solution weaker than having a separate audit log table.
I feel that mixing audit fields with transactional data in the same table is a violation of the separation of concerns principle.
In the proposed solution, updated_at only captures the last change only. A problem that a separate audit log table is not affected to.
An audit log table often takes a huge amount of space compared to simple fields on the records so there are tradeoffs. Which solution is best depends on how important change logs are.
I kinda agree, but don’t underestimate the power of having things where people are looking.
Put your documentation in doc strings where the function is defined - don’t have a separate file in a separate folder for that. It might separate concerns, but no one is looking there.
Similarly if those fields aren’t nullable, someone trying to add new rows will have to fill in something for those metadata fields - and that something will now very likely be what’s needed, rather than not pushing anything to the audit table.
Obviously your app can outgrow these simple columns, but you’re getting value now.
Event sourcing also works great. You don't need an audit log per se if you already track a history of all commands that introduced changes to your system.
Event sourcing and "the right to be forgotten" are not always easy to marry.
> Event sourcing and "the right to be forgotten" are not always easy to marry.
The absolute basics is to support snapshots and event replay. This is hardly rocket science.
If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
Let's assume we want to remove every message related to user A.
A photo by user B got to be the best of the day because it collected most upvotes. Without the A's vote, it's no longer so. The photo also got to become the best of the month because it was later voted as the top from the best-of-the-day entries, and received a prize. Should we now play the message stream without the A's upvote, things are going to end up radically different, or end up in a processing error.
User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
One way is to ignore the inconsistencies; it deprives you of most of the benefits of event sourcing.
Another way is anonymizing: replace messages about user A with messages about some null user, representing the removed users. This can lead to more paradoxes and message replay inconsistencies.
> If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
That's not how snapshots work. You record the state of your system at a point in time, and then you keep all events that occurred after that point. This means you retain the ability to rebuild the current state from that snapshot by replaying all events. I.e., event sourcing's happy flow.
> User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
Not really. That's just your best attempt at reasoning how the system could work. In the meantime, depending on whether you have a hard requirement on retaining messages from removed users you can either keep them assigned to a deleted user or replace them by deleted messages. This is not a problem caused by event sourcing; it's a problem caused by failing to design a system that meets it's requirements.
Yep. But Event Sourcing comes with its own set of other problems. Also, I don't think this would apply to OP's post: with Event Sourcing you would not even have those DB tables.
The DB tables suggested by OP are a kin to snapshots, whereas each event would require a separate data store. OP is trying to shoehorn event history into the snapshots, which hardly makes any sense.
*_at and *_by fields in SQL are just denormalization + pruning patterns consolidated, right?
Do the long walk:
Make the schema fully auditable (one record per edit) and the tables normalized (it will feel weird). Then suffer with it, discover that normalization leads to performance decrease.
Then discover that pruned auditing records is a good middle ground. Just the last edit and by whom is often enough (ominous foreshadowing).
Fail miserably by discovering that a single missing auditing record can cost a lot.
Blame database engines for making you choose. Adopt an experimental database with full auditing history. Maybe do incremental backups. Maybe both, since you have grown paranoid by now.
Discover that it is not enough again. Find that no silver bullet exists for auditing.
Now you can make a conscious choice about it. Then you won't need acronyms to remember stuff!
Another option is audit info could go to another table or datastore entirely.
If you never use it, that data can be dumped to s3 glacier periodically (e.g. after 90 days).
By losing the foreign key you gain flexibility in what you audit. Maybe audit the operation and not the 20 writes it causes.
Fair enough, but now your application is relying on 100% uptime of AWS and S3 and no network failures in between. And what happens if your transaction goes through, but the request to AWS doesn’t? What happens if another operation mutates the target meanwhile before you can retry with current state? Your app is also slowing down since it needs to send the events to S3 and guarantee they got there. Now you are reinventing two-stage commits. Unless you aren’t actually making an audit log and don’t care if events are guaranteed to be logged?
So like OP said, no silver bullets exist for auditing.
Correct. This is a system design problem. You want this to be transactional and work at scale? That might be hard to achieve. Maybe if the data can be partioned then each node handles its own auditing in a table ad part of the transaction. There are many possibilities. Allowing inconsistently might be OK too depending on what is required.
My current state is have the database be the current state and use logical replication (CDC) to keep the log of changes in case you need it
It is interesting thinking about record changes as a spectrum towards application logs. At some point too much detail is expensive to store, and you must adopt an archival strategy.
Really depends on the app. If you have a low throughput line-of-business kind of application you can probably get away with storing everything.
If you see it from the pure SQL point of view, you are in the "blame database engines and adopt an experimental solution".
It is the point where you give up modeling the audit as part of the systems tables.
The drawbacks of this choice are often related to retrieval. It depends on the engine.
I once maintained a system that kept a fully working log replicated instance delayed by 24h, ready for retrieval queries, in addition to regular disk backups (slow costy retrieval).
I am more developer than DBA, so I can probably speak more about modeling solutions than infra-centric solutions.
Yeah 100% giving up on pure SQL to solve the problem, mainly from the perspective that doing full versioning etc. in SQL is really damn hard.
The problem with this is the audit log is only at the CRUD level which is often too low. Ambiguities can arise. For example if the question is "who published the article" do you look for a create or do you look for an update with published=true? It's even worse when you consider the schema can change over time, so both can be correct but at different points in time. Event sourcing is the way if you want to capture business-level events.
But wait, there's Event Driven Architectures and Event Sourcing, meaning that the events are your log of edits!
Doesn't that also falls on the "blame the database engines and go for an experimental solution"?
I'm not saying databases are blameless. It's just that experiencing the issues they have by yourself is rewarding!
There is also a walk before the long walk of databases. Store things in text files and use basic tools (cat, sed, sh...).
The event driven stuff (like Kafka) reminds me of that. I am not very familiar with it though, just played a little bit with it once or twice.
Kind of, the WAL in postgres is effectively an event log, and many people keep replicas of it for backup reasons, which is auditable, kind of meaning that an EDA/Event source is just a shinier version of that?
Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when. Your main table’s created and updated times can be a function of (or a complement to) the edit table.
It is tempting to supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value). This is getting dangerously and excitingly close to a graph database implemented in a relational database! Implement one at your peril — what you gain in schemaless freedom you also lose in terms of having the underlying database engine no longer enforcing consistency on your behalf.
> This is getting dangerously and excitingly close to a graph database implemented in a relational database!
This feels like a great unresolved tension in database / backend design - or maybe I'm just not sophisticated enough to notice the solutions?
Is the solution event sourcing and using the relational database as a "read model" only? Is that where the truly sophisticated application developers are at? Is it really overkill for everybody not working in finance? Or is there just not a framework that's made it super easy yet?
Users demand flexible schemas - should we tell them no?
> Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when.
Aren't you describing a non-functional approach to event sourcing? I mean, if the whole point of your system is to track events that caused changes, why isn't your system built around handling events that cause changes?
> supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value)
I frankly hate this sort of thing whenever I see it. Software engineers have a tendency to optimize for the wrong things.
Generic relations reduce the number of tables in the database. But who cares about the number of tables in the database? Are we paying per table? Optimize for the data model actually being understandable and consistently enforced (+ bonus points for ease of querying).
One thing I do quite frequently which is related to this (and possibly is a pattern in rails) is to use times in place of Booleans.
So is_deleted would contain a timestamp to represent the deleted_at time for example. This means you can store more information for a small marginal cost. It helps that rails will automatically let you use it as a Boolean and will interpret a timestamp as true.
I consider booleans a code smell. It's not a bug, but it's a suggestion that I'm considering something wrong. I will probably want to replace it with something more meaningful in the future. It might be an enum, a subclass, a timestamp, refactoring, or millions of other things, but the Boolean was probably the wrong thing to do even if I don't know it yet.
The way I think about it: a boolean is usually an answer to a question about the state, not the state itself.
A light switch doesn't have an atomic state, it has a range of motion. The answer to the question "is the switch on?" is a boolean answer to a question whose input state is a range (e.g. is distance between contacts <= epsilon).
[dead]
This seems at first like a controversial idea, but the more I think about it the more I like this thought technology. Merely the idea of asking myself if there's a better way to store a fact like that will potentially improve designs.
The enum idea is often wise; also: for just an example that has probably occurred a hundred thousand times across the world in various businesses...
Original design: store a row that needs to be reported to someone, with an is_reported column that is boolean.
Problem: one day for whatever reason the ReporterService turns out to need to run two of these in parallel. Maybe it's that the reporting is the last step after ingestion in a single service and we need to ingest in parallel. Maybe it's that there are too many reports to different people and the reports themselves are parallelizable (grab 5 clients, grab unreported rows that foreign key to them, report those rows... whoops sometimes two processes choose the same client!)... Maybe it's just that these are run in Kubernetes and if the report happens when you're rolling pods then the request gets retried by both the dying pod and the new pod.
Alternative to boolean: unreported and reported records both live in the `foo` table and then a trigger puts a row for any new Foos into the `foo_unreported` table. This table can now store a lock timestamp, a locker UUID, and denormalize any columns you need (client_id) to select them. The reporter UPDATEs a bunch of rows reserving them, SELECTs whatever it has successfully reserved, reports them, then DELETEs them. It reserves rows where the lock timestamp IS NULL or is less than now minus 5 minutes, and the Reporter itself runs with a 5 minute timeout. The DB will do the barest amount of locking to make sure that two UPDATES don't conflict, there is no risk of deadlock, and the Boolean has turned into whether something exists in a set or not.
A similar trick is used in the classic Python talk “Stop Writing Classes” by @jackdied where a version of The Game of Life is optimized by saying that instead of holding a big 2D array of true/false booleans on a finite gameboard, we'll hold an infinite gameboard with a set of (x,y) pairs of living cells which will internally be backed by a hashmap.
Booleans also force the true/false framing.
E.g. a field called userCannotLoginWithoutOTP.
Then in code "if not userCannotLoginWithoutOTP or otpPresent then..."
Thus may seem easy until you have a few flags to combine and check.
An enum called LoginRequirements with values Password, PasswordAndOTP is one less negation and easier to read.
For me enums win especially when you consider that you can get help from your environment every time you add/remove stuff. Some languages force you to deal with the changes (i.e. rust) or you could add linter rules for other languages. But you're more likely to catch a problem before it arises, rather than deal with ever increasing bool checks. Makes reasoning about states a lot easier.
Tangential: I was recently wishing that bitwise flags had better support in Postgres. For now, bools are just easier to work with
This is all well and good until you need to represent something that happened on Jan 1 1970 00:00 UTC.
32-bit UNIX timestamps are often signed so you can actually go before that, but most UNIX timestamps are 64-bit now, which can represent quite a larger range. And SQL datetime types might have a totally different range.
Not that it really matters; deleted_at times for your database records will rarely predate the existence of said database.
It's not about the scale, it's that `if (0)` will evaluate to `false` in many languages.
In addition to the sibling comment, which is exactly right (you should be using a nullable column here, if you're using SQL, for multiple reasons) I reckon this a design issue in the programming language that is largely unrelated to how you model the database. It's pretty easy to run into bugs especially if you compound it with other quirky APIs, like strcmp: `if (strcmp(a, b)) // forgot to do == 0; accidentally backwards!` -- So really, you just don't have much of a choice other than to tread carefully and enable compiler warnings. Personally in this case I'd use an Optional wrapper around the underlying timestamp type anyways, if I needed to be able to represent the UNIX timestamp at 0 as well as an empty state.
So you're still fine as long as you're not tracking things that were deleted on that exact instant 50 years ago, a safe assumption, for instance, for things that happened in your application that has only existed for less time than that. That said, I haven't ever seen this implemented in a way that casts. It's implemented with scopes in the ORM, usually.
etc.which generates a query with "WHERE deleted_at IS NULL"
1-1-1970 is fine.
Leave it null for non-deleted items.
« Anytime you store Boolean, a kitten dies » Nobody has ever said that but nobody wants any kitten to die so nobody has ever challenged me anytime I use that statement.
This one little change alone can bring such huge benefits later.
A little while back, I had a conversation with a colleague about sorting entries by "updated at" in the user interface, and to my surprise this was not added by the backend team.
Many of these "we are going to need it"s come from experience. For example in the context of data structures (DS), I have made many "mistakes" that I do correctly a second time. These mistakes made writing algorithms for the DS harder, or made the DS have bad performance.
Sadly, it's hard to transfer this underlying breadth of knowledge and intuition for making good tradeoffs. As such, a one-off tip like this is limited in its usefulness.
Database schemas being perfect out-of-the gate was replaced by reliable migrations.
If it's not data that's essential to serving the current functionality, just add a column later. `updated_at` doesn't have to be accurate for your entire dataset; just set it to `NOW()` when you run the migration.
Sure, migrations are bearable (especially ones that only add columns).
But for the example of the "updated_at" column, or "soft delete" functionality, you only find out you need it because the operations team suddenly discovered they needed that functionality on existing production rows because something weird happened.
In C#-land, we just have it as a standard that ~every table inherits from `ITrackable`, and we wrote a little EF plugin to automatically update the appropriate columns.
public interface ITrackable { DateTime CreatedOn {get; set;} DateTime ModifiedOn {get; set;} }
Saves so much time and hassle.
“Reliable migrations” almost seems like an oxymoron. Migrations are complicated, difficult and error prone. I think there’s a good takeaway here around good initial schema design practices. The less you have to morph your schema overtime, the less of those risky migrations need to run.
My experience over the last decade has been different.
Use a popular framework. Run it against your test database. Always keep backups in case something unforseen happens.
Something especially trivial like adding additional columns is a solved problem.
My experience has not been so smooth. Migrations are reasonable, but they're not free and "always keeps backups" sounds like you'd tolerate downtime more than I would.
Even in the best case (e.g. basic column addition), the migration itself can be "noisy neighbors" for other queries. It can cause pressure on downstream systems consuming CDC (and maybe some of those run queries too, and now your load is even higher).
Still depends on what the data represent: you could get yourself in a storm of phone calls from customers if after your latest release there's now a weird note saying their saved document was last updated today.
"HOW DARE YOU MODIFY MY DOCUMENTS WITHOUT MY..."
Somewhat related, but I suggest having both the record updated at, and some kind of "user editing updated at". As I've encountered issues where some data migration ends up touching records and bumping the updated at, which shocks users since they see the UI reshuffle and think they have been hacked when they see the records updated at a time they didn't update them.
I mean this is what audit logs are for I'd say: generally speaking you want to know what was changed, by who and why.
So really you probably just want a reference to the tip of the audit log chain.
The perils of UI design wagging the horse.
I like the heuristics described here. However if these things aren't making it into a product spec where appropriate, then I smell some dysfunction that goes beyond what's being stored by default.
Product need (expressed as spec, design, etc) should highlight the failure cases where we would expect fields like these to be surfaced.
I'd hope that any given buisness shouldn't need someone with production database access on hand to inform as to why/when/how 'thing' was deleted. Really we'd want the user (be it 'boss' or someone else) to be able to access that information in a controlled manner.
"What information do we need when something goes wrong?". Ask it. Drill it. Ask it again.
That said, if you can't get those things, this seems a fine way to be pragmatic.
"Wagging the horse" is a great turn of phrase, better than "putting the cart before the dog."
Yes! Why something happened is incredibly important. Gitlab made this mistake hard. We have a medium sized instance with some complex CI pipelines and often they'll just get cancelled and it doesn't say why or even who by. And anyone can do it! The only option is to ask the entire company "did anyone cancel this?"
Event-sourcing solves this. And with how cheap storage is, it should be more prevalent in the industry. IMO the biggest thing holding it back is that there isn't a framework that's plug-and-play (say like Next.js is to React) that provides people with that ability.
I've been working on one in Typescript (with eventual re-writes in other langs. like Rust and Go), but it's difficult even coming up with conventions.
Event sourcing is an expensive solution and I don't mean from a storage perspective — it burns engineering cognitive horsepower quickly on things that don't matter. Do it if you're in finance or whatever. Having been burned by my own "let's event source" impulse on data change tracking systems, I now prefer less sophisticated solutions. Figuring out how to deal with slow projections, watching a projection rebuild go from minutes to hours to a few days as a system I expected to handle a few events/minute go to 20 events/second. Fancy caches can't save you if you want to use that vaunted ability to reconstruct from scratch. Event schema evolution also presents difficult tradeoffs: when old events stop having meaning or evolve in meaning you either end up adding on new event subtypes and variants leaving old cruft to accumulate, or you do migrations and edit history on really large tables.
I'd counsel anyone considering event sourcing to use more "low power" solutions like audit logs or soft deletes (if really necessary) first if possible.
Appreciate your perspective, and it makes me wish there was some kind of online 'engineers learning from their mistakes' forum (rare to see "I burned myself"). To hear hard won knowledge distilled like this is a nice reminder to spend ones complexity budget wisely.
Care to share what you've been working on?
I agree with this as written, as think it's important to have some degree of forethought when building out the DB to plan for future growth and needs.
That said, the monkey paw of this would be someone reading it and deciding they should capture and save all possible user data, "just in case", which becomes a liability.
Counter point: "Soft Deletion Probably Isn't Worth It" https://brandur.org/soft-deletion
But auditing probably is.
Just curious, how do people feel about this general style of soft deletes currently? Do people still use these in production or prefer to just delete fully or alternatively move deleted rows to a separate tables / schema?
I find the complexity to still feel awkward enough that makes me wonder if deleted_at is worth it. Maybe there are better patterns out there to make this cleaner like triggers to prevent deletion, something else?
As for the article, I couldn't agree more on having timestamps / user ids on all actions. I'd even suggest updated_by to add to the list.
In our product, we have different strategies depending on the requirements. Sometimes, we just delete. Sometimes, we do soft delete with timestamps. Sometimes, we have a history table with or without versioned entities. Sometimes, we have versions in the table. Sometimes, we have an audit log. Sometimes, we use event sourcing (although everyone in the team hates it ;-)
Financial world: records have a "close" or "expire" date which is then purged after some period of time. A deletion doesn't just happen, the record is updated to be "closed" or "expired" and some time after that it's deleted.
Something like a loan could live in a production environment for well over a year after closing, while an internal note may last just a month.
I think soft deletes using timestamptz are a good thing.
Deleting rows directly could mean you're breaking references. For example, say you have a product that the seller wants to delete. Well, what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.
Your backend can selectively query for products, filter out deleted_at for any customer facing queries, but show all products when looking at purchase history.
There are times when deleting rows makes sense, but that's usually because you have a write-heavy table that needs clearing. Yes, soft-deletes requires being careful with WHERE statements filtering out deleted rows, but that's a feature not a bug.
> what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.
You might still want to show to those customers their purchase history including what they bought 25 years ago. For example, my ISP doesn't have anymore that 10 Mb/s fiber optic product I bought im 2000, because it was superseded by 100 Mb/s products and then by 1 Gb/s ones. It's also not my ISP anymore but I use it for the SIM in my phone. That also accumulated a number of product changes along the years.
And think about the inventory of eshops with a zillion products and the archive of the pady orders. Maybe they keep the last few years, maybe everything until the db gets too large.
There can be legal requirements to retain data for a specified time for law enforcement and audits, while at the same time other legal requirements have you requiring to delete data upon customer request.
Doing this with pure 'hard' deletes is not possible, unless you maintain 2 different tables, one of which would still have the soft delete explicit or implicit. You could argue the full db log would contain the data for the former requirement, but while academicly correct this does not fly in practice.
> Maybe there are better patterns out there to make this cleaner
SQL:2011 temporal tables are worth a look.
If you have a good audit log, it really doesn't matter. You can always restore it if need be.
If you have no audit log(or a bad one), like lots of apps, then you have to care a lot.
Personally, I just implement a good audit log and then I just delete with impunity. Worst case scenario, someone(maybe even me) made a mistake and I have to run undo_log_audit() with the id of the audit log entry I want to put back. Nearly zero hassle.
The upside, when something goes wrong, I can tell you who, what and when. I usually have to infer the why, or go ask a human, but it's not usually even difficult to do that.
Can you share more about what makes a good audit log? My company doesn’t currently have one and I’m a little lost on where to start.
Should this be at the application code level, or the ORM, or the database itself?
That depends on where the data you need to keep track of is and your architecture. The important thing is, you want your audit log to be able to tell you:
For any change in the system. Also when storing the audit log, take into account that you might need to undo things that happened(not just deletes). For instance maybe some process went haywire and inserted 100k records it wasn't supposed to. A good audit log, you should be able to run something like undo_log_audit(rec1, rec100k) and it will do the right thing. I'm not saying that code needs to exist day 1, but you should take into account the ability to do that when designing it.Also you need to take into account your regulatory environment. Sometimes it's very very important that your audit logs are write once, and read only afterwards and are stored off machine, etc. Other times it's just for internal use and you can be a little more lax about date integrity of your audit logs.
Our app is heavily database centric. We push into the DB the current unix user, the current PID of the process connecting to the DB, etc(also every user has their own login to the DB so it handles our authentication too). This means our database(Postgres) does all of the audit logging for us. There are plenty of Postgres audit logging extensions. We run 2 of them. One that is trigger based creating entries in a log_audit table(which the undo_log_audit() code uses along with most reporting use cases) and a second one that writes out to syslog(so we can move logs off machine and keep them read only). We are in a regulated industry that gets audited regularly however. Not everyone needs the same level of audit logging.
You need to figure out how you can answer the above questions given your architecture. Normally the "Why" question is hard to answer without talking with a human, but unless you have the who, what and when, it's nearly impossible to even get to the Why part of the question.
It is Postgres specific, but I’ve gotten a lot of mileage out of the advice in this article:
https://supabase.com/blog/postgres-audit
Probably application level in most cases as those other levels probably don’t have all the information you want to include.
Always soft-deletion first. Then it gets exported to a separate archive and only then, after some time and may be attempted to be fully deleted from the initial base.
As an acronym, it's easy to be misremembered as "You ARENT gonna read it" (based on the popularity of yagni) - and have the opposite advice spread..
How about: IIPTWNIRTIoSoDIQMBSNRTL
(It Is Probable That While Not Immediately Required The Implementation of Storage of Data In Question May Be Simpler Now Rather Than Later)
I've gone ahead and included additional detail in the acronym in the event that the clarity is required later, as this would be difficult to retrofit into a shorter, more-established acronym.
Why do I feel like that should be a cheat code in a FPS game?
YRIS: you'll read it someday
> But I've never heard someone complain about a table having too many timestamps.
I do. Each one is 8 bytes. At the billions of rows scale, that adds up. Disk is cheap, but not free; more importantly, memory is not cheap at all.
Five years ago everybody would lough about "soft deletes" or "marked as deleted". Whoever thought this is a good idea from a data protection perspective? You also lying in the face of your users with such a behavior. Shame.
Except almost every database (and most storage devices nowadays) works this way - mark an entry as deleted, then batch delete a lot of entries during garbage collection. It's fundamentally impossible to efficiently erase a record from the middle of a file, except maybe by using an encryption tree, which would still be fairly inefficient.
Actually erasing data is quite hard. Soft deletes doesn't add any new lies, they just move the lies to the upper layer.
Come on. With a manual "marked as deleted" it stays as this forever, it is not deleted and never will, and the "deleted" data lands also in database backups, is still query-able and so on. I do not care if the deleted data stays for a while on disk or in memory as long it will be eventually deleted by the garbage collector and isn't query-able anymore.
It's a terrible post. What it suggests is to turn your head off and follow overly generalised principle. I guess when somebody invent yet another acronym it is a red flag.
Data has its own life cycles in every area it passes through. And it's part of requirements gathering to find those cycles: the dependent systems, the teams, and the questions you need to answer. Mindlessly adding fields won't save you in every situation.
Bonus point: when you start collecting questions while designing your service, you'll discover how mature your colleagues' thinking is.
For `updated_at` and `deleted_at` making them nullable and null until touched is incredibly useful.
Answering queries like how many of these were never updated? Or how many of these were never cancelled?
Agree, although the acronym in the article could be interpreted to mean “you are going to read it, so index it appropriately”, which is sort of bad advice and can lead to overindexing. There is probably something better for “add appropriate and conventional metadata” (the author suggests updated_at, created_at etc)
Not a huge fan of the example of soft delete, i think hard deletes with archive tables (no foreign key enforcement) is a much much better pattern. Takes away from the main point of the article a bit, but glad the author hinted at deleted_at only being used for soft deletes.
I don't like general advice like this, because it's too general. For many, it's probably good advice. For others, not so much.
Anyone who has worked at a small company selling to large B2B SaaS can attest we get like 20 hits a day on a busy day. Most of that is done by one person in one company, who is probably also the only person from said company you've ever talked to.
From that lens, this is all overkill. It's not bad advice, it's just that it will get quoted for scenarios it doesn't apply. Which also apply to K8S, or microservices at large even, and most 'do as I say' tech blogs.
>I don't like general advice like this, because it's too general. For many, it's probably good advice. For others, not so much.
That's true for any other good advice you may have heard of.
Author is very kind! In practice, many times I saw only the CR/CRU of CRUD getting implemented.
For example: as a company aspires to launch its product, one of the first features implemented in any system is to add a new user. But when the day comes when a customer leaves, suddenly you discover no one implemented off-boarding and cleanup of any sort.
Shipped and supported enough startup apps to learn this the hard way: users will delete things they shouldn’t, and you will be asked to explain or undo it. Soft deletes and basic metadata (created_at, deleted_by, etc.) have saved me multiple times — not for some future feature, just for basic operational sanity.
Why can't databases just remember stuff we delete, like a trash can?
Assuming you're serious, there are two main reasons. You want to regain storage space after you delete things, and sometimes you want to actually delete things (e.g. to be in compliance with regulations).
Related from a few years ago: PAGNIs, for Probably Are Gonna Need Its - https://simonwillison.net/2021/Jul/1/pagnis/
OT As the great Alone ("Last Psychiatrist") said, "if you read it, it's for you" - IYRIIFY
Literally just have a good audit log and then you get all of this for free and more.
curious that both YAGNI and YAGRI arguments could realistically be made for the same fields. guess it boils down to whether someone’s YAGRI is stronger than their colleague’s YAGNI ( :
How do you distinguish from "you aren't gonna read it"? The acronym is poorly designed.
This is good advice except for deleted_at. Soft deletion is rarely smart. Deleted things just accumulate and every time you query that table is a new opportunity to forget to omit deleted things. Query performance suffers a lot. It's just a needless complexity.
Instead, just for the tables where you want to support soft delete, copy the data somewhere else. Make a table like `deleteds (tablename text not null, data jsonb not null default '{}')` that you can stuff a serialized copy of the rows you delete from other tables (but just the ones you think you want to support soft delete on).
The theory here is: You don't actually want soft delete, you are just being paranoid and you will never go undelete anything. If you actually do want to undelete stuff, you'll end up building a whole feature around it to expose that to the user anyway so that is when you need to actually think through building the feature. In the meantime you can sleep at night, safe in the knowledge that the data you will never go look at anyway is safe in some table that doesn't cause increased runtime cost and development complexity.
YAGRI proponents organized themselves into a community to develop their...YAGRIculture.
I'll show myself out.
I have a different way of thinking about this: data loss. If you are throwing away data about who performed a delete it is a data loss situation. You should think about whether that’s OK. It probably isn’t.
To every point there is a counterpoint. So what?
Well in the same vain that we discuss "points" and talk about the merits, its useful to discuss and understand their counter points. I for one did not know about this and thought it was insightful when building a product that hasn't fully been scoped out and is more greenfield
[dead]