Business Continuity Plan | Maelstrom AI Trust Centre

Purpose

This Business Continuity Plan (BCP) sets out how Maelstrom AI aims to continue delivering zero-knowledge age verification services during significant disruptions, including infrastructure failures, cyber attacks, or other disasters.

Scope

This plan covers:

Maelstrom AI platform services (provii-verifier, provii-issuer)
Development and deployment capabilities
Customer support and communication
Team member availability and coordination

Out of Scope:

End-user devices (customer responsibility)
Relying party integrations (customer responsibility)
Physical offices (we’re fully remote)

Business Impact Analysis

Critical Services

Service: https://verify.provii.app

Business Impact if Down:

Age verification requests fail for all relying parties
User experience degradation (can’t prove age)
Revenue impact if SLA-based contracts exist

Maximum Tolerable Downtime: 4 hours Recovery Time Objective (RTO): 1 hour Recovery Point Objective (RPO): 0 (stateless, no data loss)

Dependencies:

Cloudflare Workers platform
Cloudflare KV (CONFIG, BANS)
Cloudflare KV (challenge and nonce storage with TTL-based expiry)
JWKS on CDN (for verification keys)

Service: https://issuer.provii.app

Business Impact if Down:

New credentials cannot be issued
Existing credentials still valid (verifier unaffected)
Officer-initiated issuance blocked

Maximum Tolerable Downtime: 8 hours (less critical than verifier) Recovery Time Objective (RTO): 2 hours Recovery Point Objective (RPO): 5 minutes (session data in KV)

Dependencies:

Cloudflare Workers platform
Cloudflare KV (multiple namespaces)
Signing keys in KV
YubiKey challenge system

Capability: Ability to deploy fixes and updates

Business Impact if Down:

Cannot deploy security patches
Cannot respond to incidents with code changes
Development velocity impacted

Maximum Tolerable Downtime: 24 hours Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 0 (code in Git)

Dependencies:

GitHub (source control, CI/CD)
Developer workstations
Cloudflare API access (for deployments)

Non-Critical Services

Service	RTO	Impact if Down
Documentation site	24 hours	Inconvenience, but services operate
CDN (static assets)	4 hours	Moderate impact on new integrations
Developer tools	48 hours	Delays development, no customer impact

Disaster Scenarios

Scenario 1: Cloudflare Regional Outage

Likelihood: Low (Cloudflare has >300 PoPs globally)

Impact: Medium (automatic failover to other regions)

Response:

Detection (< 5 min)

Cloudflare monitoring alerts
Customer reports
Status check at status.cloudflare.com

Assessment (< 10 min)

Confirm regional vs. global outage
Estimate impact scope (which endpoints affected)
Check Cloudflare status page for ETR (estimated time to resolution)

Communication (< 15 min)

Post status update (use non-Cloudflare channel if needed)
Direct communication to key customers
Internal team notification

Recovery (automatic)

Cloudflare Workers automatically failover to healthy regions
Durable Objects migrate to available locations
No manual intervention required in most cases

Prevention: Accept this risk (documented in Risk Register - beyond our control)

Scenario 2: Cloudflare Global Platform Outage

Likelihood: Very Rare (last major outage: 2020)

Impact: Severe (complete service unavailability)

Response:

Immediate actions (< 15 min)

Confirm via status.cloudflare.com and multiple monitoring sources
Activate communication plan (external status page via GitHub Pages)
Notify customers via email (templated message)
ISMS Owner reviews options and documents decision rationale

Wait and monitor

No technical recovery possible - we are fully dependent on Cloudflare

Actions:

Monitor Cloudflare status updates
Provide hourly communication to customers
Document outage for post-incident review
Estimate business impact

Post-recovery (< 2 hours after restoration)

Verify all services operational
Check data integrity in KV/Durable Objects
Run smoke tests on critical endpoints
Conduct lessons learned review

Long-Term Mitigation (considered but not implemented):

Multi-cloud deployment (AWS Lambda + Cloudflare Workers)
Decision. Cost and complexity exceed benefit given Cloudflare’s reliability
Reviewed. Annually during BCP review

Scenario 3: Signing Key Compromise

Likelihood: Low (strong controls in place)

Impact: Severe (trust model compromised)

Response: See Incident Response - Signing Key Compromise section

Summary:

Immediate (< 15 min)

Revoke compromised key, update key status to Disabled in KV

Short-term (< 1 hour)

Generate new key pair, update JWKS

Communication (< 2 hours)

Notify relying parties, publish advisory

Recovery (< 24 hours)

Monitor for fraudulent credentials, assess damage

Prevention: Keys stored in Cloudflare Workers secrets; HSM may be evaluated as key management requirements grow

Scenario 4: GitHub Outage or Account Compromise

Likelihood: Low for outage, Medium for account compromise attempts

Impact: High (cannot deploy, CI/CD blocked)

Response:

If GitHub is Down:

Wait for restoration (no alternative deployment mechanism)
Use local development environments
Manual deployment possible via wrangler deploy (requires Cloudflare credentials)
Communicate delay to stakeholders

If Account Compromised:

Immediate (< 10 min)

Revoke all Personal Access Tokens
Reset password, enforce MFA
Review recent commits for malicious changes
Disable compromised account if needed

Investigation (< 1 hour)

Review audit logs
Check for unauthorised repository access
Identify blast radius (which repos accessed)
Assess if secrets were exposed

Recovery (< 4 hours)

Revert malicious commits
Rotate any exposed secrets
Re-enable account with enhanced security
Deploy clean code state

Prevention: MFA required, hardware security keys planned

Scenario 5: Loss of Key Personnel

Likelihood: Low but possible

Impact: Medium (knowledge loss, capability reduction)

Response:

Sudden Unavailability (illness, accident):

Maelstrom AI operates with a single operator (ISMS Owner); there is an acknowledged single-operator bus factor
Documentation enables continuity (this ISMS, code comments, README files, deployment runbooks)
Access recovery procedures are documented, including account recovery paths for Cloudflare and GitHub
Critical procedures documented (deployment, incident response, key rotation)

Planned Departure:

Transition period (4+ weeks)

Knowledge transfer sessions
Documentation review and updates
Shadow on-call rotations
Access credential handover

Access management

Revoke personal credentials on last day
Transfer ownership of critical resources
Update contact information

Mitigation:

Public documentation of all processes (reduces knowledge silos)
Infrastructure as code (no manual “tribal knowledge”)
All critical access held by the ISMS Owner (sole operator); access recovery is addressed in this Business Continuity Plan
Standard tools and practices (not bespoke systems)

Scenario 6: Supply Chain Attack

Likelihood: Medium (attacks on npm/crates.io have occurred)

Impact: Severe (compromised artifacts distributed)

Response: See SLSA Level 3 protections

If Attack Detected:

Immediate (< 30 min)

Halt all deployments
Identify compromised dependency
Assess if production artifacts affected
Roll back to known-good version if needed

Investigation (< 2 hours)

Determine attack vector
Check if artifacts were signed correctly
Review provenance attestations
Scan all builds for malicious code

Recovery (< 8 hours)

Remove or update compromised dependency
Rebuild and re-sign all artifacts
Deploy clean versions
Notify customers if distributed artifacts affected

Prevention:

Hermetic builds (tamper-proof)
Artifact signing (detects tampering)
Security scanning in CI/CD
Dependency pinning (Cargo.lock, package-lock.json)

Recovery Procedures

Service Recovery

For provii-verifier or provii-issuer:

Manual Redeployment

If Workers need manual redeployment:

# Ensure you have latest code
git checkout main
git pull

# Navigate to service directory
cd provii-verifier  # or provii-issuer/worker

# Deploy with wrangler
wrangler deploy --env production

# Verify deployment
curl https://verify.provii.app/health  # or issuer endpoint

Prerequisites:

Cloudflare API token with Workers deployment permissions
wrangler CLI installed locally
Access to source code repository

KV Data Recovery

If KV data is lost or corrupted:

CONFIG namespace:

Restore from backup (weekly KV exports)
Or rebuild from infrastructure-as-code templates
Apply via KV API or Cloudflare dashboard

Audit logs (IS_AUDIT_LOG):

Accept loss if backup unavailable (operational data, not critical)
Focus on restoring service functionality first

Signing keys (IS_KEYS):

CRITICAL. Restore from offline backup (secure location)
If backup unavailable: Emergency key generation + rotation procedure
Notify all relying parties of key change

# Restore KV value example
wrangler kv:key put --binding=CONFIG \
  "config_key" "config_value" \
  --env production

Durable Objects Recovery

If Durable Objects unavailable:

Automatic: Cloudflare migrates Durable Objects to healthy infrastructure

Manual intervention rarely needed, but if required:

KV entries for challenges and nonces have TTL-based auto-deletion
Loss is acceptable - challenges will expire naturally
New challenges created on-demand
No persistent data recovery needed

Development Recovery

If development environment compromised:

Isolate

Disconnect compromised machine from network

Assess

Determine if credentials were compromised Identify malware or unauthorised access

Rotate

Rotate all credentials accessible from compromised machine:

GitHub Personal Access Tokens
Cloudflare API tokens
SSH keys
Any other API keys or secrets

Rebuild

Reinstall operating system if needed
Restore code from GitHub (known-good state)
Verify integrity of local repositories
Re-clone dependencies

Communication Plan

Internal Communication

Emergency Contact Method: ISMS Owner (sole operator) is the primary responder; contact via Signal or direct phone

Escalation Chain:

ISMS Owner (first responder, technical response, and major decisions)
Customer communication handled directly by the ISMS Owner

Communication Channels:

Primary. Signal (encrypted, mobile)
Secondary. Email (for non-urgent coordination)
Tertiary. Direct phone calls (true emergencies)

Customer Communication

Status Communication:

Primary. Status page (if implemented) or GitHub Pages
Backup. Email to registered contacts
Social Media. X/Twitter @proviiwallet for major outages

Communication Templates:

Subject: [Provii Status] Service Degradation - [Service Name]

We are currently experiencing degraded performance on [service name].

Impact: [description]
Estimated Resolution: [timeframe or "investigating"]
Updates: [frequency]

We will provide updates every [30min/1hr] until resolved.

For real time updates, follow @proviiwallet or check status.provii.app

We apologize for any inconvenience.

Subject: [Provii Status] Resolved - [Service Name]

The issue affecting [service name] has been resolved as of [time] UTC.

Impact Duration: [X hours/minutes]
Root Cause: [brief, non-technical explanation]
Prevention: [what we're doing to prevent recurrence]

All services are now operating normally. We apologize for the disruption.

A detailed post-mortem will be published within 5 business days.

Subject: [Provii Security Advisory] [Brief Description]

We are issuing this security advisory regarding [issue].

Impact: [who is affected]
Action Required: [what customers need to do]
Timeline: [by when]

Details: [technical but understandable explanation]

For questions, contact security@maelstrom.au

We take security seriously and apologize for any inconvenience.

Testing and Maintenance

BCP Testing

Annual Full Test (Q4 each year):

Simulate major outage scenario
Test communication procedures
Verify recovery procedures work
Update documentation based on findings

Quarterly Table-Top Exercises:

Walk through scenarios with team
No actual system changes
Verify contact information current
Practice decision-making

Ad-Hoc Testing:

After real incidents (lessons learned)
When significant infrastructure changes
When new team members onboard

Backup Procedures

What We Back Up:

Source Code

Location: GitHub (primary), local clones (secondary) Frequency: Continuous (every commit) Retention: Indefinite (Git history) Recovery: git clone https://github.com/...

Configuration

Location: Infrastructure as code (in Git) Frequency: Every change Retention: Version controlled Recovery: Apply from wrangler.toml or KV API

Signing Keys

Location: Cloudflare KV (primary), offline backup (secondary, encrypted) Frequency: At generation, after rotation Retention: Until rotated + 1 year Recovery: Restore from offline backup

KV Data

Location: Cloudflare KV (replicated) Frequency: Weekly exports (planned - see RISK-2025-M005) Retention: 90 days Recovery: KV bulk upload via API

What We Don’t Back Up:

Ephemeral state (Durable Objects, challenges)
Audit logs (retained 90 days, then discarded; critical security event logs are retained for up to 365 days)
Analytics data (acceptable loss)

Maintenance Windows

Planned Maintenance: Rare (serverless architecture auto-scales)

If Maintenance Required:

Announce: 72 hours advance notice
Schedule: Off-peak hours (02:00-04:00 UTC)
Duration: Target < 30 minutes
Rollback: Prepared before starting
Communicate: Start, progress, completion

Recovery Time Objectives (RTO) / Recovery Point Objectives (RPO)

System/Service	RTO	RPO	Justification
Verifier API	1 hour	0	Critical service, stateless
Issuer API	2 hours	5 min	Important but less critical, minimal state
Development Environment	4 hours	0	All code in Git
Signing Keys	2 hours	0	Recovery from offline backup
KV Configuration	4 hours	7 days	Weekly backups planned
Documentation	24 hours	0	Hosted on GitHub

Continuous Improvement

After Every Incident:

Review BCP effectiveness
Update procedures based on lessons learned
Test new procedures

Annual Review (Q1):

Update contact information
Review RTO/RPO targets
Update disaster scenarios based on threat surface
Validate backup and recovery procedures
Test full plan with tabletop exercise

Next Scheduled Review: 2026-05-15

Incident Response Policy - Responding to security incidents
Risk Register - Business continuity risks
Information Security Policy - Overarching security principles

Document Information

Version. 1.1
Effective Date. 2025-01-13
Last Updated. 2026-05-21
Owner. ISMS Owner
Maintained By. Security Lead
Review Frequency. Annually, after major incidents
Next Review. 2026-11-21
Classification. Public

Purpose

Scope

Business Impact Analysis

Critical Services

Non-Critical Services

Disaster Scenarios

Scenario 1: Cloudflare Regional Outage

Detection (< 5 min)

Assessment (< 10 min)

Communication (< 15 min)

Recovery (automatic)

Scenario 2: Cloudflare Global Platform Outage

Immediate actions (< 15 min)

Wait and monitor

Post-recovery (< 2 hours after restoration)

Scenario 3: Signing Key Compromise

Immediate (< 15 min)

Short-term (< 1 hour)

Communication (< 2 hours)

Recovery (< 24 hours)

Scenario 4: GitHub Outage or Account Compromise

Immediate (< 10 min)

Investigation (< 1 hour)

Recovery (< 4 hours)

Scenario 5: Loss of Key Personnel

Transition period (4+ weeks)

Access management

Scenario 6: Supply Chain Attack

Immediate (< 30 min)

Investigation (< 2 hours)

Recovery (< 8 hours)

Recovery Procedures

Service Recovery

Development Recovery

Isolate

Assess

Rotate

Rebuild

Communication Plan

Internal Communication

Customer Communication

Testing and Maintenance

BCP Testing

Backup Procedures

Source Code

Configuration

Signing Keys

KV Data

Maintenance Windows

Recovery Time Objectives (RTO) / Recovery Point Objectives (RPO)

Continuous Improvement

Related Documents