Business Continuity Plan

Ensuring service continuity during disruptions and disasters

Public

Purpose

This Business Continuity Plan (BCP) sets out how Maelstrom AI aims to continue delivering zero-knowledge age verification services during significant disruptions, including infrastructure failures, cyber attacks, or other disasters.

Scope

This plan covers:

  • Maelstrom AI platform services (provii-verifier, provii-issuer)
  • Development and deployment capabilities
  • Customer support and communication
  • Team member availability and coordination

Out of Scope:

  • End-user devices (customer responsibility)
  • Relying party integrations (customer responsibility)
  • Physical offices (we’re fully remote)

Business Impact Analysis

Critical Services

Service: https://verify.provii.app

Business Impact if Down:

  • Age verification requests fail for all relying parties
  • User experience degradation (can’t prove age)
  • Revenue impact if SLA-based contracts exist

Maximum Tolerable Downtime: 4 hours Recovery Time Objective (RTO): 1 hour Recovery Point Objective (RPO): 0 (stateless, no data loss)

Dependencies:

  • Cloudflare Workers platform
  • Cloudflare KV (CONFIG, BANS)
  • Cloudflare KV (challenge and nonce storage with TTL-based expiry)
  • JWKS on CDN (for verification keys)

Service: https://issuer.provii.app

Business Impact if Down:

  • New credentials cannot be issued
  • Existing credentials still valid (verifier unaffected)
  • Officer-initiated issuance blocked

Maximum Tolerable Downtime: 8 hours (less critical than verifier) Recovery Time Objective (RTO): 2 hours Recovery Point Objective (RPO): 5 minutes (session data in KV)

Dependencies:

  • Cloudflare Workers platform
  • Cloudflare KV (multiple namespaces)
  • Signing keys in KV
  • YubiKey challenge system

Capability: Ability to deploy fixes and updates

Business Impact if Down:

  • Cannot deploy security patches
  • Cannot respond to incidents with code changes
  • Development velocity impacted

Maximum Tolerable Downtime: 24 hours Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 0 (code in Git)

Dependencies:

  • GitHub (source control, CI/CD)
  • Developer workstations
  • Cloudflare API access (for deployments)

Non-Critical Services

ServiceRTOImpact if Down
Documentation site24 hoursInconvenience, but services operate
CDN (static assets)4 hoursModerate impact on new integrations
Developer tools48 hoursDelays development, no customer impact

Disaster Scenarios

Scenario 1: Cloudflare Regional Outage

Likelihood: Low (Cloudflare has >300 PoPs globally)

Impact: Medium (automatic failover to other regions)

Response:

Detection (< 5 min)

  • Cloudflare monitoring alerts
  • Customer reports
  • Status check at status.cloudflare.com

Assessment (< 10 min)

  • Confirm regional vs. global outage
  • Estimate impact scope (which endpoints affected)
  • Check Cloudflare status page for ETR (estimated time to resolution)

Communication (< 15 min)

  • Post status update (use non-Cloudflare channel if needed)
  • Direct communication to key customers
  • Internal team notification

Recovery (automatic)

  • Cloudflare Workers automatically failover to healthy regions
  • Durable Objects migrate to available locations
  • No manual intervention required in most cases

Prevention: Accept this risk (documented in Risk Register - beyond our control)


Scenario 2: Cloudflare Global Platform Outage

Likelihood: Very Rare (last major outage: 2020)

Impact: Severe (complete service unavailability)

Response:

Immediate actions (< 15 min)

  • Confirm via status.cloudflare.com and multiple monitoring sources
  • Activate communication plan (external status page via GitHub Pages)
  • Notify customers via email (templated message)
  • ISMS Owner reviews options and documents decision rationale

Wait and monitor

No technical recovery possible - we are fully dependent on Cloudflare

Actions:

  • Monitor Cloudflare status updates
  • Provide hourly communication to customers
  • Document outage for post-incident review
  • Estimate business impact

Post-recovery (< 2 hours after restoration)

  • Verify all services operational
  • Check data integrity in KV/Durable Objects
  • Run smoke tests on critical endpoints
  • Conduct lessons learned review

Long-Term Mitigation (considered but not implemented):

  • Multi-cloud deployment (AWS Lambda + Cloudflare Workers)
  • Decision. Cost and complexity exceed benefit given Cloudflare’s reliability
  • Reviewed. Annually during BCP review

Scenario 3: Signing Key Compromise

Likelihood: Low (strong controls in place)

Impact: Severe (trust model compromised)

Response: See Incident Response - Signing Key Compromise section

Summary:

Immediate (< 15 min)

Revoke compromised key, update key status to Disabled in KV

Short-term (< 1 hour)

Generate new key pair, update JWKS

Communication (< 2 hours)

Notify relying parties, publish advisory

Recovery (< 24 hours)

Monitor for fraudulent credentials, assess damage

Prevention: Keys stored in Cloudflare Workers secrets; HSM may be evaluated as key management requirements grow


Scenario 4: GitHub Outage or Account Compromise

Likelihood: Low for outage, Medium for account compromise attempts

Impact: High (cannot deploy, CI/CD blocked)

Response:

If GitHub is Down:

  • Wait for restoration (no alternative deployment mechanism)
  • Use local development environments
  • Manual deployment possible via wrangler deploy (requires Cloudflare credentials)
  • Communicate delay to stakeholders

If Account Compromised:

Immediate (< 10 min)

  • Revoke all Personal Access Tokens
  • Reset password, enforce MFA
  • Review recent commits for malicious changes
  • Disable compromised account if needed

Investigation (< 1 hour)

  • Review audit logs
  • Check for unauthorised repository access
  • Identify blast radius (which repos accessed)
  • Assess if secrets were exposed

Recovery (< 4 hours)

  • Revert malicious commits
  • Rotate any exposed secrets
  • Re-enable account with enhanced security
  • Deploy clean code state

Prevention: MFA required, hardware security keys planned


Scenario 5: Loss of Key Personnel

Likelihood: Low but possible

Impact: Medium (knowledge loss, capability reduction)

Response:

Sudden Unavailability (illness, accident):

  • Maelstrom AI operates with a single operator (ISMS Owner); there is an acknowledged single-operator bus factor
  • Documentation enables continuity (this ISMS, code comments, README files, deployment runbooks)
  • Access recovery procedures are documented, including account recovery paths for Cloudflare and GitHub
  • Critical procedures documented (deployment, incident response, key rotation)

Planned Departure:

Transition period (4+ weeks)

  • Knowledge transfer sessions
  • Documentation review and updates
  • Shadow on-call rotations
  • Access credential handover

Access management

  • Revoke personal credentials on last day
  • Transfer ownership of critical resources
  • Update contact information

Mitigation:

  • Public documentation of all processes (reduces knowledge silos)
  • Infrastructure as code (no manual “tribal knowledge”)
  • All critical access held by the ISMS Owner (sole operator); access recovery is addressed in this Business Continuity Plan
  • Standard tools and practices (not bespoke systems)

Scenario 6: Supply Chain Attack

Likelihood: Medium (attacks on npm/crates.io have occurred)

Impact: Severe (compromised artifacts distributed)

Response: See SLSA Level 3 protections

If Attack Detected:

Immediate (< 30 min)

  • Halt all deployments
  • Identify compromised dependency
  • Assess if production artifacts affected
  • Roll back to known-good version if needed

Investigation (< 2 hours)

  • Determine attack vector
  • Check if artifacts were signed correctly
  • Review provenance attestations
  • Scan all builds for malicious code

Recovery (< 8 hours)

  • Remove or update compromised dependency
  • Rebuild and re-sign all artifacts
  • Deploy clean versions
  • Notify customers if distributed artifacts affected

Prevention:

  • Hermetic builds (tamper-proof)
  • Artifact signing (detects tampering)
  • Security scanning in CI/CD
  • Dependency pinning (Cargo.lock, package-lock.json)

Recovery Procedures

Service Recovery

For provii-verifier or provii-issuer:

Manual Redeployment

If Workers need manual redeployment:

# Ensure you have latest code
git checkout main
git pull

# Navigate to service directory
cd provii-verifier  # or provii-issuer/worker

# Deploy with wrangler
wrangler deploy --env production

# Verify deployment
curl https://verify.provii.app/health  # or issuer endpoint

Prerequisites:

  • Cloudflare API token with Workers deployment permissions
  • wrangler CLI installed locally
  • Access to source code repository
KV Data Recovery

If KV data is lost or corrupted:

CONFIG namespace:

  • Restore from backup (weekly KV exports)
  • Or rebuild from infrastructure-as-code templates
  • Apply via KV API or Cloudflare dashboard

Audit logs (IS_AUDIT_LOG):

  • Accept loss if backup unavailable (operational data, not critical)
  • Focus on restoring service functionality first

Signing keys (IS_KEYS):

  • CRITICAL. Restore from offline backup (secure location)
  • If backup unavailable: Emergency key generation + rotation procedure
  • Notify all relying parties of key change
# Restore KV value example
wrangler kv:key put --binding=CONFIG \
  "config_key" "config_value" \
  --env production
Durable Objects Recovery

If Durable Objects unavailable:

Automatic: Cloudflare migrates Durable Objects to healthy infrastructure

Manual intervention rarely needed, but if required:

  • KV entries for challenges and nonces have TTL-based auto-deletion
  • Loss is acceptable - challenges will expire naturally
  • New challenges created on-demand
  • No persistent data recovery needed

Development Recovery

If development environment compromised:

Isolate

Disconnect compromised machine from network

Assess

Determine if credentials were compromised Identify malware or unauthorised access

Rotate

Rotate all credentials accessible from compromised machine:

  • GitHub Personal Access Tokens
  • Cloudflare API tokens
  • SSH keys
  • Any other API keys or secrets

Rebuild

  • Reinstall operating system if needed
  • Restore code from GitHub (known-good state)
  • Verify integrity of local repositories
  • Re-clone dependencies

Communication Plan

Internal Communication

Emergency Contact Method: ISMS Owner (sole operator) is the primary responder; contact via Signal or direct phone

Escalation Chain:

  1. ISMS Owner (first responder, technical response, and major decisions)
  2. Customer communication handled directly by the ISMS Owner

Communication Channels:

  • Primary. Signal (encrypted, mobile)
  • Secondary. Email (for non-urgent coordination)
  • Tertiary. Direct phone calls (true emergencies)

Customer Communication

Status Communication:

  • Primary. Status page (if implemented) or GitHub Pages
  • Backup. Email to registered contacts
  • Social Media. X/Twitter @proviiwallet for major outages

Communication Templates:

Subject: [Provii Status] Service Degradation - [Service Name]

We are currently experiencing degraded performance on [service name].

Impact: [description]
Estimated Resolution: [timeframe or "investigating"]
Updates: [frequency]

We will provide updates every [30min/1hr] until resolved.

For real time updates, follow @proviiwallet or check status.provii.app

We apologize for any inconvenience.
Subject: [Provii Status] Resolved - [Service Name]

The issue affecting [service name] has been resolved as of [time] UTC.

Impact Duration: [X hours/minutes]
Root Cause: [brief, non-technical explanation]
Prevention: [what we're doing to prevent recurrence]

All services are now operating normally. We apologize for the disruption.

A detailed post-mortem will be published within 5 business days.
Subject: [Provii Security Advisory] [Brief Description]

We are issuing this security advisory regarding [issue].

Impact: [who is affected]
Action Required: [what customers need to do]
Timeline: [by when]

Details: [technical but understandable explanation]

For questions, contact security@maelstrom.au

We take security seriously and apologize for any inconvenience.

Testing and Maintenance

BCP Testing

Annual Full Test (Q4 each year):

  • Simulate major outage scenario
  • Test communication procedures
  • Verify recovery procedures work
  • Update documentation based on findings

Quarterly Table-Top Exercises:

  • Walk through scenarios with team
  • No actual system changes
  • Verify contact information current
  • Practice decision-making

Ad-Hoc Testing:

  • After real incidents (lessons learned)
  • When significant infrastructure changes
  • When new team members onboard

Backup Procedures

What We Back Up:

Source Code

Location: GitHub (primary), local clones (secondary) Frequency: Continuous (every commit) Retention: Indefinite (Git history) Recovery: git clone https://github.com/...

Configuration

Location: Infrastructure as code (in Git) Frequency: Every change Retention: Version controlled Recovery: Apply from wrangler.toml or KV API

Signing Keys

Location: Cloudflare KV (primary), offline backup (secondary, encrypted) Frequency: At generation, after rotation Retention: Until rotated + 1 year Recovery: Restore from offline backup

KV Data

Location: Cloudflare KV (replicated) Frequency: Weekly exports (planned - see RISK-2025-M005) Retention: 90 days Recovery: KV bulk upload via API

What We Don’t Back Up:

  • Ephemeral state (Durable Objects, challenges)
  • Audit logs (retained 90 days, then discarded; critical security event logs are retained for up to 365 days)
  • Analytics data (acceptable loss)

Maintenance Windows

Planned Maintenance: Rare (serverless architecture auto-scales)

If Maintenance Required:

  1. Announce: 72 hours advance notice
  2. Schedule: Off-peak hours (02:00-04:00 UTC)
  3. Duration: Target < 30 minutes
  4. Rollback: Prepared before starting
  5. Communicate: Start, progress, completion

Recovery Time Objectives (RTO) / Recovery Point Objectives (RPO)

System/ServiceRTORPOJustification
Verifier API1 hour0Critical service, stateless
Issuer API2 hours5 minImportant but less critical, minimal state
Development Environment4 hours0All code in Git
Signing Keys2 hours0Recovery from offline backup
KV Configuration4 hours7 daysWeekly backups planned
Documentation24 hours0Hosted on GitHub

Continuous Improvement

After Every Incident:

  • Review BCP effectiveness
  • Update procedures based on lessons learned
  • Test new procedures

Annual Review (Q1):

  • Update contact information
  • Review RTO/RPO targets
  • Update disaster scenarios based on threat surface
  • Validate backup and recovery procedures
  • Test full plan with tabletop exercise

Next Scheduled Review: 2026-05-15


  1. Incident Response Policy - Responding to security incidents
  2. Risk Register - Business continuity risks
  3. Information Security Policy - Overarching security principles

Document Information

  • Version. 1.1
  • Effective Date. 2025-01-13
  • Last Updated. 2026-05-21
  • Owner. ISMS Owner
  • Maintained By. Security Lead
  • Review Frequency. Annually, after major incidents
  • Next Review. 2026-11-21
  • Classification. Public