What is an incident?
An incident is declared if we discover an unplanned interruption to, or reduction in quality of, normal service which has lasted for longer than 15 minutes and is affecting at least 5 users. When an incident is declared, we activate our Incident Response Plan.
You can find examples of previous incidents and our response to them on our Status page.
Hotjar's 9-step Incident Response Plan
Our Incident Response Plan gives us a defined process to rapidly identify the root cause of an incident and deploy a fix. It also helps us manage the process of communicating with our users. To be able to effectively tackle incidents, Hotjar has a team of Incident Response Managers (IRMs).
As soon as monitoring detects an incident, our on-call engineers are notified. An IRM will then take responsibility for executing our Incident Response Plan (IRP). Incidents that start out of business hours notify our on-call engineers who will contain the issue and follow a simplified version of the IRP, only calling an IRM if severity requires it. Our aim is to resolve the issues quickly and provide support to our users through transparent communication. The process is as follows:
Step 1: Set up the incident
The IRM will appoint a technical and a support liaison. These three people then constitute the incident response team and will join a video conference call created by the IRM. If the incident is suspected to be a data breach or security breach, then Hotjar’s Data Protection Officer (DPO) will be asked to join the incident response team.
Once the nature of the incident is identified, its severity is then assessed by the team to be either Critical, Major, or Minor. These levels depend on the extent of features affected by the failure, percentage of affected customers, as well as data security.
Step 2: Announce that an incident has occurred and is being investigated
The incident details are shared within Hotjar itself (via Slack).
The incident response team assesses when we have enough accurate information and an incident notification will be published on our public Status page. The incident response team will update this page as we progress through steps 5-7 below. The incident status is also posted to our @hotjar_status Twitter account.
Step 3: Prepare a support plan for affected users
The support liaison will work with the Hotjar support team to support users affected by the incident. Preparations could include creating documentation for users to assist them with working around the problem. The Hotjar support team will also consider reaching out to affected users directly.
Step 4: Work to identify and resolve the incident
The technical liaison will work with the engineering team to identify the scope of the incident and, if possible, the cause of the incident. The immediate focus is to mitigate the situation (i.e. restore service as soon as possible).
As more information about the incident is discovered the incident team will reassess the severity.
Step 5: Announce the incident as identified
Once the scope and cause of the incident has been identified, the support liaison will announce this both internally to Hotjar and externally via our Status page (if the incident was announced to the public).
Step 6: Announce the incident as monitoring
Once a fix has been identified and applied, the support liaison will announce that they are now monitoring the situation to ensure that the fix has worked. This will be undertaken both internally to Hotjar and externally via our Status page (if the incident was announced to the public).
Step 7: Announce the incident as resolved
When the incident response team is satisfied that the incident is no longer affecting users, they will announce that the incident is resolved. This will be undertaken both internally to Hotjar and externally via our Status page (if the incident was announced to the public).
Step 8: Complete customer-facing communications
If the incident was announced externally, via the Status page, the incident response team will now write a public report with more details about the incident. This is referred to on the Status page as the "Postmortem" report.
In some circumstances, the support liaison will work with the Hotjar support team to contact users directly affected by the incident.
Step 9: Follow-up actions
The incident team will schedule an internal Incident postmortem, held (if possible) within 24 hours of resolution of the incident. The postmortem will likely identify further actions for Hotjar to undertake to prevent the incident from occurring again.
The incident response team will update the report/postmortem on our Status page if further details are discovered in the internal postmortem.