Loading...
general

Chaos Engineering Game Day Plan and Report

Plan a chaos engineering game day, capture injected failures and observed outcomes, and turn resilience gaps into tracked action items. Use it to document what was tested, what broke, and what to fix next.

Trusted by frontline teams 15 years of frontline software AI customization in seconds

Built for: Saas · Fintech · E Commerce · Healthcare Tech · Devops

Overview

This template is for planning and reporting a chaos engineering game day. It gives you a place to define the objective, scope, participants, injected failures, expected behavior, observed outcomes, decisions, blockers, and the action items that come out of the exercise.

Use it when you want a controlled resilience test that is easy to review afterward. It works well for service-level experiments such as instance loss, dependency latency, queue buildup, failover, or partial network disruption. The structure helps the facilitator keep the session focused while giving service owners enough context to understand what was tested and why.

Do not use this template for an unplanned production incident, a broad architecture review, or a generic meeting note. It is most useful when there is a specific system under test and a clear plan for what failure will be injected. If the exercise has no defined success criteria, no rollback plan, or no owner for follow-up work, the report will be hard to act on.

The value of the template is in the handoff from experiment to remediation. By separating context, outcome, and action items, it makes resilience gaps visible and trackable instead of leaving them buried in chat logs or scattered notes.

Standards & compliance context

  • Use this template alongside your organization’s change-management and approval process when game days affect production or production-like systems.
  • If the exercise touches regulated workloads, document the environment, scope, and rollback plan so the test can be reviewed as a controlled activity.
  • Do not record secrets, credentials, or sensitive customer data in the notes; reference systems and masked identifiers instead.
  • If your organization requires incident-style evidence retention, keep the report in the approved system of record with access controls applied.
  • For healthcare, financial services, or other regulated industries, confirm that the injected failure and monitoring plan do not violate operational or data-handling policies.

General regulatory context for orientation only — verify current requirements with counsel or the relevant agency before relying on this template for compliance.

How to use this template

  1. Start by writing the game day objective, the service or environment in scope, and the success criteria so everyone knows what the exercise is meant to prove.
  2. List the facilitator, note-taker, service owner, observers, and approvers, and assign a clear owner for each injected failure and follow-up item.
  3. Document each planned injection with the trigger, expected system behavior, rollback plan, and any guardrails that must stay in place during the test.
  4. Run the session by recording agenda items, discussion, decisions, blockers, and observed outcomes as they happen rather than reconstructing them later.
  5. Close the report by converting every gap into an action item with an owner and due date, then schedule the next time the scenario should be retested.

Best practices

  • Define one primary objective per game day so the team can tell whether the exercise actually answered the question you set out to test.
  • Write the expected outcome before the injection starts, or you will end up debating whether the system behaved correctly after the fact.
  • Assign a single owner to each injected failure and each action item so accountability is obvious when follow-up work begins.
  • Capture the exact time, trigger, and observed response for each failure so the report can be used as a reference during future exercises.
  • Record blockers separately from outcomes so unresolved issues do not get mistaken for successful resilience behavior.
  • Keep the scope narrow enough to isolate one dependency chain at a time, especially when testing failover or latency scenarios.
  • Include a rollback or stop condition for every injection so the facilitator can end the exercise safely if the impact exceeds the plan.

What this template typically catches

Issues teams running this template most often surface in practice:

A dependency fails over more slowly than the team expected, causing user-facing latency before any alert fires.
Monitoring shows the symptom but not the root cause, making it hard to tell which injected failure triggered the degradation.
A service recovers, but manual steps are required because the runbook is incomplete or outdated.
The team discovers that ownership of a downstream dependency is unclear, which delays the response and the follow-up work.
Rollback works in staging but is too risky or too slow in the target environment.
The game day reveals that alerts are noisy, missing, or routed to the wrong on-call group.
An action item is identified but not assigned a due date, so the resilience gap persists after the exercise.

Common use cases

Platform SRE running a Kubernetes node-loss drill
A platform team uses the template to plan a node failure injection, capture pod rescheduling behavior, and record whether service health checks and alerts behaved as expected. The report becomes the source of truth for remediation work on autoscaling, readiness probes, and runbooks.
Fintech service owner testing database failover
A service owner documents a controlled primary-database failover to verify application retry logic, connection pooling behavior, and recovery time. The template keeps the discussion tied to observed outcomes and the action items needed before the next release.
E-commerce team simulating a third-party API outage
An engineering team uses the report to test how checkout behaves when a payment or shipping API becomes unavailable. The notes capture fallback behavior, customer-impact risk, and the follow-up work needed to improve degradation paths.
Healthcare tech group reviewing alerting and escalation
A reliability lead runs a game day to see whether paging, escalation, and incident communication work when a critical service degrades. The template helps separate operational context from outcome and keeps the next-time plan visible.

Frequently asked questions

What is this template used for?

This template documents a chaos engineering game day from planning through post-run reporting. It captures the objective, scope, roles, injected failures, observed outcomes, and the action items that follow. Use it when you need a repeatable record of what was tested and what resilience gaps were found.

Is this for a planned game day or an incident review?

It is for a planned resilience exercise, not a live incident postmortem. You can use it to record a controlled failure injection, the system response, and the follow-up work. If you are documenting an outage after the fact, a separate incident report template is usually a better fit.

Who should run and fill out the template?

The game day is usually run by an SRE, platform engineer, or reliability lead, with support from service owners and observers. The facilitator should capture the agenda, injected failures, decisions, blockers, and action items with owner and due date. A note-taker can help keep the report factual and complete.

How often should a chaos game day be scheduled?

Use it on a cadence that matches your release risk and operational maturity, such as after major architecture changes or on a recurring quarterly or monthly schedule. The template works best when the same structure is reused each time so results are comparable. The right cadence is the one your teams can actually review and act on.

What should be included in the scope?

Scope should name the service, environment, dependencies, and the specific failure modes being tested. Keep it narrow enough that the team can observe cause and effect clearly. If the scope is too broad, it becomes hard to tell which injected failure caused which outcome.

How does this template help with compliance or audit needs?

It creates a clear record of what was intentionally tested, who approved it, what happened, and what follow-up work was assigned. That can support internal risk reviews and operational governance. It should not replace formal change management or approval workflows where those are required.

What are the most common mistakes when using a game day plan?

Common mistakes include vague objectives, missing rollback criteria, unclear ownership, and action items without due dates. Another frequent issue is recording only the failure injection and not the observed outcome or blocker. This template is designed to keep the plan and report tied together so the exercise produces usable follow-up work.

Can this be customized for different systems or teams?

Yes. You can adapt the injected failure section for Kubernetes, databases, queues, network dependencies, or third-party APIs, and you can add service-specific success criteria. The template is also easy to tailor for different team structures by changing roles, approvers, and follow-up owners.

How does it compare with ad-hoc notes in a doc or chat thread?

Ad-hoc notes are easy to start but usually lose the connection between the plan, the observed results, and the follow-up actions. This template gives you a consistent structure for agenda, discussion, decisions, and action items so the report is easier to review later. It also makes it simpler to compare one game day to the next.

Ready to use this template?

Get started with MangoApps and use Chaos Engineering Game Day Plan and Report with your team — pricing built for small business.

Ask AI Product Advisor

Hi! I'm the MangoApps Product Advisor. I can help you with:

  • Understanding our 40+ workplace apps
  • Finding the right solution for your needs
  • Answering questions about pricing and features
  • Pointing you to free tools you can try right now

What would you like to know?