Incident Response

When Something Breaks,
Here's Exactly What Happens.

Detection in seconds. Human acknowledgment within SLA. Resolution in hours, not days. A 10-step process that runs every time, without exception.

Get Protected See SLA Guarantees

The Process

10 steps. Every incident, every time.

This is not a marketing document. This is the actual process our engineers follow. Every step has a clear owner, a timing expectation, and a defined output.

Detect< 30 seconds

Monitoring checks fire every 30 seconds from 5 global locations. The moment a threshold is breached — uptime, error rate, response time, security event — a potential incident is flagged and recorded. This happens without any human in the loop.

↳Automated

Classify< 60 seconds

The automated system applies classification rules to determine priority. P1 means service is down or data is at risk. P2 means major degradation. P3 is significant but a workaround exists. P4 is minor. Severity directly determines the speed of everything that follows.

↳Automated + On-call review

AlertImmediate on P1/P2

P1 and P2 incidents trigger simultaneous Slack notification, SMS to on-call mobile, and email to the incident channel. You receive the same alert we do at the same time — there's no 'internal first, client later' delay.

↳You + On-call engineer notified

AcknowledgeWithin SLA window

The on-call engineer confirms the incident is being actively worked. You receive a second notification with the engineer's name and an initial assessment. This is the point where your SLA response time clock stops — not when the fix is deployed.

↳Engineer assigned

InvestigateVariable — typically < 10 min

Root cause analysis begins. Because our team already knows your codebase, infrastructure, and recent change history, investigation is significantly faster than for someone seeing the system for the first time. We check recent deployments, database changes, and traffic patterns first.

↳Active investigation

CommunicateContinuous throughout

Plain-English updates to you at every stage change: we know what it is, we know how to fix it, fix is in staging, fix is deploying, monitoring recovery. No silence. No "still looking." The team that communicates worst during incidents is the one clients leave.

↳You receive updates at every stage

FixVariable — hours, not days

The code, configuration, database, or infrastructure fix is implemented in a staging environment first. Even under pressure, we do not skip testing. Some fixes are applied in minutes. Complex root causes can take hours. We communicate timeline estimates honestly.

↳Staging fix implemented

TestBefore every production deploy

The fix is verified in staging against the conditions that caused the incident. For P1/P2 events, a second engineer reviews the fix before it ships. We verify the error condition is resolved, not just that the service starts.

↳Second-engineer review on P1/P2

Deploy & Verify30-min monitoring window

The fix is deployed to production. We open a 30-minute active monitoring window — watching all 8 monitoring layers for any sign of recurrence or unexpected side effects. The incident is not closed until monitoring is clean for 30 consecutive minutes.

↳Active post-deploy watch

PostmortemWithin 48 hours (P1/P2 only)

For P1 and P2 incidents, we write a postmortem: what happened, root cause, how we responded, what we're changing to prevent recurrence. You receive this as a document within 48 hours. This is how we get better — and how you stay informed.

↳You receive full written report

"The most important thing during an incident is not the fix — it's the communication. A client who knows what's happening and what we're doing about it can handle a 2-hour outage. A client who gets silence for 20 minutes cannot."

Kamrul Hasan—CTO, SocioFi Technology

Response Time Guarantees

How fast we respond, by priority and plan.

Response time is from incident creation to first meaningful human response. Automated notifications don't count.

Priority	Description	Essential	Growth	Scale
P1	Service down, data at risk	<8 hrs (business hours)	<4 hrs (business hours)	<15 min (24/7)
P2	Major feature broken, significant degradation	<8 hrs (business hours)	<4 hrs (business hours)	<1 hr (24/7)
P3	Minor bug, workaround available	Next business day	<8 hrs (business hours)	<4 hrs (business hours)
P4	Cosmetic issue, minor UX problem	Next maintenance window	Next business day	Next business day

Real Example

What incident response looks like in practice.

Anonymized from a real incident, with identifying details changed. This is a typical P2 resolution on the Growth plan.

Incident · Growth Plan · P2

Database connection pool exhausted during traffic spike

A Growth plan client's application began returning slow responses and occasional 503 errors during their morning peak traffic window. The root cause was a connection leak in a background job that had been introduced in a release 3 days earlier.

Resolution timeline