Show
Page 2So you want to be an Incident Commander (IC)? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern!) Purpose#If you could boil down the purpose of an Incident Commander to one sentence, it would be:
The Incident Commander is the decision maker during a major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. They become the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. Your job as an Incident Commander is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. You should not be performing any actions or remediations, checking graphs, or investigating logs. Those tasks should be delegated. An IC should also be considering next steps and backup plans at every opportunity, in an effort to avoid getting stuck without any clear options to proceed and to keep things moving towards resolution. Prerequisites#Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!
Responsibilities#Read up on our Different Roles for Incidents to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with. Training Process#The process is fairly loose for now. Here's a list of things you can do to train though,
Graduation#What's the difference between an IC in training and an IC? (This isn't a set up to a joke). Simple, an IC puts themselves on the schedules. Also, don't forget to announce yourself in the IC Slack channel, and get yourself added to our IC mailing list. Handling Incidents#Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one. The language used in each step is discussed in more detail in the "Procedures and Lingo" section below. Size-Up#Sizing-up involves getting an idea of what's going on, and how much impact it's having. This is an information gathering step that will allow you to make good decisions later.
Stabilize#Next step is to stabilize the incident. We need to determine what we can do to fix it, and then execute those actions.
While remediation steps are being carried out, it's important to provide status updates, not just to responders, but other stakeholders within the organization.
Verify#Once remediation actions have been performed, we need to verify that they have been successful or not, and proceed with a backup plan if not.
Deputy#The Deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander’s position if required. Communication Responsibilities#Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands. When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks. After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary. Incident Call Procedures and Lingo#The Steps for Incident Commander provide a detailed description of what you should be doing during an incident. Additionally, aside from following the usual incident call etiquette, there a few extra etiquette guidelines you should follow as IC:
Here are some examples of phrases and patterns you should use during incident calls. Start of Call Announcement#At the start of any major incident call, the Incident Commander should announce the following,
This establishes to everyone on the call what your name is, and that you are now the commander. Identify yourself by name and state that you are the "Incident Commander" and not "IC", as newcomers may not be familiar with the terminology yet. The word "commander" makes it very clear that you're in charge. Start of Incident, IC Not Present#If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following,
If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure) Checking if SME's are Present#During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your Deputy can page one if no one answers.
Assigning Tasks#When you need to give out an assignment or task, you should follow these three steps,
Can someone... Never say "Can someone..." as this leads to the bystander effect. Tasks should always be assigned directly to an individual, and never just thrown out with the hope that someone will pick it up.
Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings. Gaining Consensus (Polling During a Decision)#If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like "I knew that would happen". An IC will use very specific language to be sure that doesn't happen, and to gain implicit consensus of everyone on the response.
If you were to ask "Does everyone agree?", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter. It also means that the information you care about the most (objections to proceeding) are heard loud and clearly. It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page.
Reducing Scope#Once you've identified the cause of an incident, you can take some time to reduce the scope of your call. For example, if you've identified that a bad deploy is the cause, there's no need to keep your network engineering responder on the call. Responders will usually appreciate not having to stick around for something that doesn't involve them, especially when it's 3am. Generally, it's best to list out the people you want to remain on the call (rather than listing those that can leave), as this not only re-affirms who is required, but makes sure you won't forget about anyone who can leave.
There's no need to forcibly remove anyone from the call, leave the choice open. Sometimes responders prefer to remain to see how the incident eventually resolves, since they're already awake anyway. When handling complex incidents, it will sometimes be necessary to spin off a sub-team (or multiple sub-teams) to investigate specific issues in more detail before reporting back. This is to ensure that you can maintain an effective span of control. To do this, you should assign a team leader, give them a specific task (time-boxed in the usual way), and re-affirm that they are your primary contact and that all communication from their team should come via the leader. Use our pre-defined team names of Alpha, Bravo, and Charlie to avoid confusion when creating the teams.
You do not need to prescribe who their team consists of. Either they will have a pre-existing team structure they can utilize, or they should take the initiative to assemble a team on their own. You should pick your team leaders accordingly. Transfer of Command#Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place,
Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call,
The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander. Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command. End of Call Sign-Off#At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where follow-up discussion can take place. It's also customary to thank everyone.
Handling Problems#Things don't always go smoothly on incident response calls, so as an Incident Commander you need to be prepared for instances where the conversation gets derailed, either intentionally or unintentionally. Here are some procedures and lingo you can follow when things get disruptive, in order to get things back on track. Maintaining Order#Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long.
Getting Straight Answers#You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed.
Executive Swoop - Overriding the Incident Commander#
This is an extreme example, but illustrates the concept of "Executive Swoop", whereby someone who would be senior to you during peacetime comes on the call and starts overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. This is rare, but can cripple the response process if it happens. There is a simple question you can ask as an Incident Commander to get things back on track, "Do you wish to take command?",
This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call. Executive Swoop - Anti-Motivation#
It's rare for an executive to maliciously derail an incident response call, usually it is done with the best of intentions. However, these good intentions can still derail your response process and demotivate responders. As an Incident Commander you will need to recognize and respond to these situations. In the case above, it seems motivational, however it assumes responders aren't already working as hard as possible to solve the problem, and adds no value to the response process. You can respond to this by reminding the commenter that these things should be kept until after the incident is over.
The most common case of executive swoop is a request for more information. Unfortunately, when in the middle of an incident, you typically cannot spare the resources to gather such information. As an Incident Commander you should remind the executive of this, and that the incident takes priority.
Note that this isn't phrased as a question, you've already made the decision as Incident Commander, you're just informing the executive of that decision. Executive Swoop - Questioning Severity#
Our severity levels determine the scale of response we give to an incident. Conversations on what severity an incident is can very quickly consume the entire call and doesn't change the fact that there is an incident on-going. We do not discuss incident severity during an incident call, as we treat an incident as the highest severity we think it could be. We can downgrade the severity during the postmortem, however we cannot waste time litigating severities on an incident call. So simply remind folks of this in order to get things back on track:
Sometimes you will have a responder who does not follow instructions and/or is being actively disruptive to your response call. Perhaps this is being done intentionally, or it could even be unintentional (an un-muted microphone while in a loud environment, etc). In either case, you need to resolve the situation and get back to the incident at hand. State the fact that the individual is being disruptive, provide them a way to save face, but also state what will happen if they don't stop. No second chances, if they don't follow through, remove them from the call.
Examples From Pop Culture#PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work. Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note:
Another clip from Apollo 13. Things to note:
|