Skip to content

Latest commit

 

History

History
30 lines (17 loc) · 1.77 KB

root-cause-analysis.md

File metadata and controls

30 lines (17 loc) · 1.77 KB

Root Cause Analyses

After an outage, security incident or other service disruption a root cause analysis (RCA) SHOULD be performed.

The root cause analysis SHOULD be done as soon as possible after the incident and SHOULD be done as a collaboration between engineers, managers, and parties affected.

RCAs are blameless. We're all human and make mistakes. At NYPL we want to encourage an engineering culture of experimentation, learning, continuous improvement, and making room for mistakes.

RCAs SHOULD be saved in a shared directory where they are accessible by other team members.

RCAs MUST be shared with the revelant Product Manager for approval and sent to Garvita Kapur for review after approval within the team.

More learning about RCAs

Sample RCA Template

Here is the information that SHOULD be included in a RCA. The complexity and length of each answer MAY be determined by the severity of the incident. Each RCA MUST include:

  • A timeline of the event
  • The completion of an analysis of the causes of the event
  • A proposed solution/resolution that addresses the issue in a way that should prevent reoccurence in the future

Sample post-mortem template