May 28, 2026

Post-Mortem Action Items Don't Ship Here's Why

feature

Shyam Kapdi
Author

feature

Divya Kathiriya
Contributor

feature

Shailesh Davara
Reviewer

Why Your Post-Mortem Action Items Never Ship

Every time a system goes down, someone writes a post-mortem. The team sits together, walks through what happened, agrees on what needs to change, and writes down a list of action items. Then everyone goes back to work.

Six months later, the same thing breaks again.

We have seen this happen at companies with excellent engineers, mature processes, and leadership that genuinely cares about reliability. The post-mortem is not the problem. The action item list is not the problem either. The problem is what happens or doesn’t happen between that document being written and production actually changing.

This is not a people problem. It is a system problem. And until you fix the system, incidents will keep repeating.

1. The Backlog Is Where Post-Mortem Action Items Go to Die

When an incident ends, there is a short window where everyone feels urgency. The pain is fresh, the customer complaints are still coming in, and the team wants to fix things. During that window, action items get written. Good ones, often.

Then that window closes.

The action items go into a backlog, the same backlog that holds feature requests, tech debt, integrations, and a dozen other things. The product roadmap does not move. Sprint priorities don’t change. The reliability fixes sit there, week after week, never quite making it to the top.

This is not because teams are lazy or don’t care. It is because the system sends a clear signal: shipping features are what get noticed. Reliability fixes are invisible when they work, and only visible when they don’t.

The backlog is where post-mortem action items go to age quietly until the next incident.

We have reviewed hundreds of post-mortems across my career. The pattern is almost always the same. Good diagnosis. Clear action items. No follow-through. Not because the team forgot, but because there was no path from “we agreed to fix this” to “this is actually fixed.”

2. Restoring Service and Fixing the Problem Are Not the Same Thing

There is a difference between stopping the bleeding and fixing the wound.

When an incident happens, the immediate goal is to restore service. Teams are good at this. They roll back a deploy, restart a service, and reroute traffic. The system comes back up. Customers stop complaining. Engineers go to sleep.

That is fixing the incident.

Fixing the system means asking: what made this failure possible in the first place? Was it a missing alert? A process that depends on one person knowing something? A default configuration that should have been changed two years ago?

Most organizations stop at fixing the incident. Fixing the system requires changing something that has worked “well enough” until now… See how we eliminated a single point of failure and automated disaster recovery to stop recurring incidents for a high-volume platform.

You cannot fix a recurring incident by getting better at responding to it. You have to make it stop happening.

The hard truth is that most teams confuse incident response improvements (better runbooks, faster escalation, on-call rotations) with system improvements. Both matter, but only one of them prevents the next incident from happening.

3. How to Classify Post-Mortem Outputs as Real Work

The reason post-mortem action items don’t ship is that they are treated as a separate category of work, something alongside the “real” work, not part of it.

The fix is straightforward, though not easy: reliability work needs to be classified, tracked, and prioritized the same way product work is.

That means when a post-mortem produces an action item, someone has to make a decision:

  • Is this a patch — a one-time fix for a specific failure?

  • Is this a platform change — something that needs to be built or changed in the underlying infrastructure?

  • Is this a process gap — something that requires a team behavior to change?

Each of these needs to go somewhere specific. Not into a general reliability backlog. Into a sprint, with an owner, a deadline, and a review date.

If your engineering organization cannot answer “what happened to the action items from our last three post-mortems,” that is the gap you need to close before anything else.

Track post-mortem action items the same way you track product delivery. If it doesn’t have an owner and a due date, it won’t ship.

4. How to Classify Post-Mortem Outputs as Real Work

The goal of a post-mortem is not to document what went wrong. The goal is to change something so it doesn’t go wrong the same way again.

This sounds obvious. But most organizations run post-mortems as documentation exercises, not change exercises.

A feedback loop that actually works looks like this:

  • Incident happens. Service is restored.

  • Post-mortem identifies root cause and produces specific, actionable outputs, not vague recommendations.

  • Each output is classified and assigned to a team with a clear deadline.

  • One person is responsible for tracking whether these are closed before the next incident review.

  • Leadership reviews open post-mortem actions in the same meeting where they review delivery progress.

That last point matters more than most people realize. If reliability fixes are not in the same room where shipping decisions are made, they will always lose the prioritization argument.

We made it a point to ask about open post-mortem actions in quarterly reviews. Not to pressure teams, but to signal that this work is visible at the top. Engineers notice what leadership pays attention to. If no one above the engineering manager ever asks about reliability work, the message, whether intended or not, is that it is optional.

A feedback loop only works if there is a closed cycle. Open action items are a broken loop.

5. A Recurring Incident Is an Ownership Problem, Not a Technical One

When the same incident happens twice, it is a reliability problem.

When it happens three times, it is an ownership problem.

A recurring incident means that no one with authority decided that fixing it was their job. It may have been discussed. It may even have been assigned. But at some point, the ball was dropped, and no one picked it up.

In most platform engineering organizations, ownership is clear for features and unclear for reliability. Teams own services, but they do not own the conditions that make those services fail. When something breaks at the seam between two teams, it often takes a second or third incident before anyone accepts ownership.

The way to fix this is not to restructure teams every time there is an ambiguous incident. It is to establish clear ownership at the platform level, who owns the alert, who owns the recovery, and who owns the underlying fix.

If you cannot name the person who owns preventing a specific type of failure, you have an ownership gap. That gap will produce incidents until it is filled.

Recurring incidents do not mean your engineers are failing. They mean your system does not have a clear owner for prevention, only for response.

From where I sit, the most important question after any recurring incident is not “what failed,” it is “who is responsible for making sure this category of failure doesn’t happen again?” If the answer is “the team” or “everyone” or “we’re looking into it,” you already know the next incident is coming.

Conclusion

Post-mortems work. The problem is everything that happens after.

Action items need owners. They need deadlines. They need to be treated as production-priority work, not as aspirational notes in a document that no one reads.

The platform does not change by running better post-mortems. It changes when the outputs of those post-mortems are given the same weight as feature delivery.

Until that happens, the incidents will repeat, not because your team isn’t learning, but because the system has no path from learning to change.

That is a leadership problem, not an engineering one. If your team is struggling to bridge the gap between incident response and platform improvement, let’s talk. Our team can help you design automated, self-healing systems that keep your developers shipping.

Frequently Asked Question

Get quick answers to common queries. Explore our FAQs for helpful insights and solutions.

feature

Written by

Shyam Kapdi

Shyam Kapdi is a Business Development Executive at Improwised Technologies, focused on driving growth through client engagement and platform engineering solutions. He helps align business needs with open-source, scalable technologies.

Optimize Your Cloud. Cut Costs. Accelerate Performance.

Struggling with slow deployments and rising cloud costs?

Our tailored platform engineering solutions enhance efficiency, boost speed, and reduce expenses.