There are several ways that DevOps engineers can be notified of an incident:
- VictorOps notifications escalation.
- Customer success notification from site usage.
- Customer success notification from a customer complaint.
- Noticing outliers in metrics by Developers/DevOps engineers.
Types of incidents
Minor Service Disruption
When you notice an incident that can easily be fixed in a couple of minutes it is enough to resolve it on your own and confirm if your fix was adequate by checking that the VictorOps alarm has been stopped. If it was something uncommon communicate in the #DevOps channel.
Vendor (AWS) Incident
If our cloud hosting provider suffers an isolated incident in one of our instances we should be able to remain online due to the High Availability setups we've configured. In case of an outage in a single Availability Zone inside of the Ireland region, we should remain online. In case of a widespread outage in the full Ireland Region, we will go offline until the upstream service is restored. In any case, you should:
- Confirm the issue really is the upstream provider.
- Monitor the status updates of the upstream provider and communicate internally so we can inform users.
- If service disruption takes longer than a couple of hours, start assessing the possibility of migrating the affected service to another Region.
If you notice a security breach of any kind, you should:
- Communicate the issue internally so it can be escalated internally and communicated to users. It might require us to directly contact customers to inform them.
- Collect evidence that made you classify this as a security breach.
In case of affected instances:
- Turn them off and create snapshots for future investigation.
- Rotate any credential that might have been present in the instances.
In case of affected credentials, like email phishing or other:
- Rotate any credential that might have been compromised.
- Assume more things have been compromised and investigate other possible affected targets.
These include but are not limited to:
- Loss or theft of personal computing devices used to store or access Hotjar systems.
- Breaches of any Hotjar systems.
- Unintended disclosure of Hotjar sensitive information.
Reacting to Incidents
- Ensure the whole team knows by announcing it on the Hotjar Team and Dev Updates channels. Use @all to attract everyone’s attention.
- Try to identify which services are being affected. If this takes more than a couple of minutes coordinate with other online engineers and ask for help. This might mean initiating a Hangouts chat where you can discuss your findings through the incident without stopping the actual remediation efforts.
- When you've identified the affected services, decide on the severity of the incident:
- Was there a security breach?
- Is customer data affected?
- Is the incident part of a larger vendor, AWS, outage?
- Will a reliable fix be easy to produce?
- Can you do it on your own?
- How long will it take you to deploy it?
- Do you need someone to review your fix before and after you deploy it?
- Do we need to go into maintenance mode in the meantime?
- Are you sure what you are fixing is the actual root cause of the problem?
- Make sure the DevOps team are aware of the issue. If none of them are online, contact them immediately by phone. Most certainly they know about the issue before anyone else, but it's better to verify if you're unsure.
- Create an activity log to track what changes are being made and what is known about the outage. This could be writing small updates in a HipChat channel like #DevOps or a Google docs document. This is very useful for hand-overs and post-mortem creation.
- Discuss in the Hotjar Team channel if we should enter maintenance mode. Maintenance mode should be used if the outage is expected to take more than a few minutes. If it's decided that we should enter maintenance mode, a developer should immediately do so.
- If users contact us on Intercom, use the Incident Reply - Maintenance Mode saved reply if Hotjar is in maintenance mode and Incident Reply - Not in Maintenance Mode saved reply if Hotjar is NOT in maintenance mode.
- Log into status.hotjar.com and create a new incident. Use the Generic Incident Report template and customize the messages as you see necessary.Update the component statuses accordingly. Also, ensure Post this to Twitter is ticked.
- Duplicate the Incident Report Template in-app Intercom auto message inside the Maintenance/Incidents folder and save it as Incident Report - Summary - XXXX-XX-XX. Customize the message and turn it on. The message should start to appear to anyone who tries to log in to Hotjar.
- Update the team on Hotjar Team and Dev Updates channel by sharing a link to the status report and the name of the Report Template you created.
- As we learn more about the incident, it is important we keep updating statuspage.io as well as the Intercom message.
After the Incident is Solved
- Verify that the incident has been indeed resolved.
- Add a "Resolved" update to statuspage.io.
- Turn off the Intercom auto message.
- Update the team on Hotjar Team and Dev Updates channel.
- Make sure we've left maintenance mode if it was enabled.
- If the maintenance needed was much longer than planned, we will prepare an email to explain ourselves.
- Verify that monitoring is in place to detect this issue in the future.
- Assess the possibility of creating a Runbook to allow faster fixing in the future.
- If the incident was long in duration or broad in affected services, create a post-mortem analysis with a detailed timeline so we can better understand root cause and improve the process in the future. VictorOps has an interface for this specifically.