Purpose
This Business Continuity Plan (BCP) sets out how Maelstrom AI aims to continue delivering zero-knowledge age verification services during significant disruptions, including infrastructure failures, cyber attacks, or other disasters.
Scope
This plan covers:
- Maelstrom AI platform services (provii-verifier, provii-issuer)
- Development and deployment capabilities
- Customer support and communication
- Team member availability and coordination
Out of Scope:
- End-user devices (customer responsibility)
- Relying party integrations (customer responsibility)
- Physical offices (we’re fully remote)
Business Impact Analysis
Critical Services
Service: https://verify.provii.app
Business Impact if Down:
- Age verification requests fail for all relying parties
- User experience degradation (can’t prove age)
- Revenue impact if SLA-based contracts exist
Maximum Tolerable Downtime: 4 hours Recovery Time Objective (RTO): 1 hour Recovery Point Objective (RPO): 0 (stateless, no data loss)
Dependencies:
- Cloudflare Workers platform
- Cloudflare KV (CONFIG, BANS)
- Cloudflare KV (challenge and nonce storage with TTL-based expiry)
- JWKS on CDN (for verification keys)
Service: https://issuer.provii.app
Business Impact if Down:
- New credentials cannot be issued
- Existing credentials still valid (verifier unaffected)
- Officer-initiated issuance blocked
Maximum Tolerable Downtime: 8 hours (less critical than verifier) Recovery Time Objective (RTO): 2 hours Recovery Point Objective (RPO): 5 minutes (session data in KV)
Dependencies:
- Cloudflare Workers platform
- Cloudflare KV (multiple namespaces)
- Signing keys in KV
- YubiKey challenge system
Capability: Ability to deploy fixes and updates
Business Impact if Down:
- Cannot deploy security patches
- Cannot respond to incidents with code changes
- Development velocity impacted
Maximum Tolerable Downtime: 24 hours Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 0 (code in Git)
Dependencies:
- GitHub (source control, CI/CD)
- Developer workstations
- Cloudflare API access (for deployments)
Non-Critical Services
| Service | RTO | Impact if Down |
|---|---|---|
| Documentation site | 24 hours | Inconvenience, but services operate |
| CDN (static assets) | 4 hours | Moderate impact on new integrations |
| Developer tools | 48 hours | Delays development, no customer impact |
Disaster Scenarios
Scenario 1: Cloudflare Regional Outage
Likelihood: Low (Cloudflare has >300 PoPs globally)
Impact: Medium (automatic failover to other regions)
Response:
Detection (< 5 min)
- Cloudflare monitoring alerts
- Customer reports
- Status check at status.cloudflare.com
Assessment (< 10 min)
- Confirm regional vs. global outage
- Estimate impact scope (which endpoints affected)
- Check Cloudflare status page for ETR (estimated time to resolution)
Communication (< 15 min)
- Post status update (use non-Cloudflare channel if needed)
- Direct communication to key customers
- Internal team notification
Recovery (automatic)
- Cloudflare Workers automatically failover to healthy regions
- Durable Objects migrate to available locations
- No manual intervention required in most cases
Prevention: Accept this risk (documented in Risk Register - beyond our control)
Scenario 2: Cloudflare Global Platform Outage
Likelihood: Very Rare (last major outage: 2020)
Impact: Severe (complete service unavailability)
Response:
Immediate actions (< 15 min)
- Confirm via status.cloudflare.com and multiple monitoring sources
- Activate communication plan (external status page via GitHub Pages)
- Notify customers via email (templated message)
- ISMS Owner reviews options and documents decision rationale
Wait and monitor
No technical recovery possible - we are fully dependent on Cloudflare
Actions:
- Monitor Cloudflare status updates
- Provide hourly communication to customers
- Document outage for post-incident review
- Estimate business impact
Post-recovery (< 2 hours after restoration)
- Verify all services operational
- Check data integrity in KV/Durable Objects
- Run smoke tests on critical endpoints
- Conduct lessons learned review
Long-Term Mitigation (considered but not implemented):
- Multi-cloud deployment (AWS Lambda + Cloudflare Workers)
- Decision. Cost and complexity exceed benefit given Cloudflare’s reliability
- Reviewed. Annually during BCP review
Scenario 3: Signing Key Compromise
Likelihood: Low (strong controls in place)
Impact: Severe (trust model compromised)
Response: See Incident Response - Signing Key Compromise section
Summary:
Immediate (< 15 min)
Revoke compromised key, update key status to Disabled in KV
Short-term (< 1 hour)
Generate new key pair, update JWKS
Communication (< 2 hours)
Notify relying parties, publish advisory
Recovery (< 24 hours)
Monitor for fraudulent credentials, assess damage
Prevention: Keys stored in Cloudflare Workers secrets; HSM may be evaluated as key management requirements grow
Scenario 4: GitHub Outage or Account Compromise
Likelihood: Low for outage, Medium for account compromise attempts
Impact: High (cannot deploy, CI/CD blocked)
Response:
If GitHub is Down:
- Wait for restoration (no alternative deployment mechanism)
- Use local development environments
- Manual deployment possible via
wrangler deploy(requires Cloudflare credentials) - Communicate delay to stakeholders
If Account Compromised:
Immediate (< 10 min)
- Revoke all Personal Access Tokens
- Reset password, enforce MFA
- Review recent commits for malicious changes
- Disable compromised account if needed
Investigation (< 1 hour)
- Review audit logs
- Check for unauthorised repository access
- Identify blast radius (which repos accessed)
- Assess if secrets were exposed
Recovery (< 4 hours)
- Revert malicious commits
- Rotate any exposed secrets
- Re-enable account with enhanced security
- Deploy clean code state
Prevention: MFA required, hardware security keys planned
Scenario 5: Loss of Key Personnel
Likelihood: Low but possible
Impact: Medium (knowledge loss, capability reduction)
Response:
Sudden Unavailability (illness, accident):
- Maelstrom AI operates with a single operator (ISMS Owner); there is an acknowledged single-operator bus factor
- Documentation enables continuity (this ISMS, code comments, README files, deployment runbooks)
- Access recovery procedures are documented, including account recovery paths for Cloudflare and GitHub
- Critical procedures documented (deployment, incident response, key rotation)
Planned Departure:
Transition period (4+ weeks)
- Knowledge transfer sessions
- Documentation review and updates
- Shadow on-call rotations
- Access credential handover
Access management
- Revoke personal credentials on last day
- Transfer ownership of critical resources
- Update contact information
Mitigation:
- Public documentation of all processes (reduces knowledge silos)
- Infrastructure as code (no manual “tribal knowledge”)
- All critical access held by the ISMS Owner (sole operator); access recovery is addressed in this Business Continuity Plan
- Standard tools and practices (not bespoke systems)
Scenario 6: Supply Chain Attack
Likelihood: Medium (attacks on npm/crates.io have occurred)
Impact: Severe (compromised artifacts distributed)
Response: See SLSA Level 3 protections
If Attack Detected:
Immediate (< 30 min)
- Halt all deployments
- Identify compromised dependency
- Assess if production artifacts affected
- Roll back to known-good version if needed
Investigation (< 2 hours)
- Determine attack vector
- Check if artifacts were signed correctly
- Review provenance attestations
- Scan all builds for malicious code
Recovery (< 8 hours)
- Remove or update compromised dependency
- Rebuild and re-sign all artifacts
- Deploy clean versions
- Notify customers if distributed artifacts affected
Prevention:
- Hermetic builds (tamper-proof)
- Artifact signing (detects tampering)
- Security scanning in CI/CD
- Dependency pinning (Cargo.lock, package-lock.json)
Recovery Procedures
Service Recovery
For provii-verifier or provii-issuer:
Manual Redeployment
If Workers need manual redeployment:
# Ensure you have latest code
git checkout main
git pull
# Navigate to service directory
cd provii-verifier # or provii-issuer/worker
# Deploy with wrangler
wrangler deploy --env production
# Verify deployment
curl https://verify.provii.app/health # or issuer endpointPrerequisites:
- Cloudflare API token with Workers deployment permissions
- wrangler CLI installed locally
- Access to source code repository
KV Data Recovery
If KV data is lost or corrupted:
CONFIG namespace:
- Restore from backup (weekly KV exports)
- Or rebuild from infrastructure-as-code templates
- Apply via KV API or Cloudflare dashboard
Audit logs (IS_AUDIT_LOG):
- Accept loss if backup unavailable (operational data, not critical)
- Focus on restoring service functionality first
Signing keys (IS_KEYS):
- CRITICAL. Restore from offline backup (secure location)
- If backup unavailable: Emergency key generation + rotation procedure
- Notify all relying parties of key change
# Restore KV value example
wrangler kv:key put --binding=CONFIG \
"config_key" "config_value" \
--env productionDurable Objects Recovery
If Durable Objects unavailable:
Automatic: Cloudflare migrates Durable Objects to healthy infrastructure
Manual intervention rarely needed, but if required:
- KV entries for challenges and nonces have TTL-based auto-deletion
- Loss is acceptable - challenges will expire naturally
- New challenges created on-demand
- No persistent data recovery needed
Development Recovery
If development environment compromised:
Isolate
Disconnect compromised machine from network
Assess
Determine if credentials were compromised Identify malware or unauthorised access
Rotate
Rotate all credentials accessible from compromised machine:
- GitHub Personal Access Tokens
- Cloudflare API tokens
- SSH keys
- Any other API keys or secrets
Rebuild
- Reinstall operating system if needed
- Restore code from GitHub (known-good state)
- Verify integrity of local repositories
- Re-clone dependencies
Communication Plan
Internal Communication
Emergency Contact Method: ISMS Owner (sole operator) is the primary responder; contact via Signal or direct phone
Escalation Chain:
- ISMS Owner (first responder, technical response, and major decisions)
- Customer communication handled directly by the ISMS Owner
Communication Channels:
- Primary. Signal (encrypted, mobile)
- Secondary. Email (for non-urgent coordination)
- Tertiary. Direct phone calls (true emergencies)
Customer Communication
Status Communication:
- Primary. Status page (if implemented) or GitHub Pages
- Backup. Email to registered contacts
- Social Media. X/Twitter @proviiwallet for major outages
Communication Templates:
Subject: [Provii Status] Service Degradation - [Service Name]
We are currently experiencing degraded performance on [service name].
Impact: [description]
Estimated Resolution: [timeframe or "investigating"]
Updates: [frequency]
We will provide updates every [30min/1hr] until resolved.
For real time updates, follow @proviiwallet or check status.provii.app
We apologize for any inconvenience. Subject: [Provii Status] Resolved - [Service Name]
The issue affecting [service name] has been resolved as of [time] UTC.
Impact Duration: [X hours/minutes]
Root Cause: [brief, non-technical explanation]
Prevention: [what we're doing to prevent recurrence]
All services are now operating normally. We apologize for the disruption.
A detailed post-mortem will be published within 5 business days. Subject: [Provii Security Advisory] [Brief Description]
We are issuing this security advisory regarding [issue].
Impact: [who is affected]
Action Required: [what customers need to do]
Timeline: [by when]
Details: [technical but understandable explanation]
For questions, contact security@maelstrom.au
We take security seriously and apologize for any inconvenience. Testing and Maintenance
BCP Testing
Annual Full Test (Q4 each year):
- Simulate major outage scenario
- Test communication procedures
- Verify recovery procedures work
- Update documentation based on findings
Quarterly Table-Top Exercises:
- Walk through scenarios with team
- No actual system changes
- Verify contact information current
- Practice decision-making
Ad-Hoc Testing:
- After real incidents (lessons learned)
- When significant infrastructure changes
- When new team members onboard
Backup Procedures
What We Back Up:
Source Code
Location: GitHub (primary), local clones (secondary)
Frequency: Continuous (every commit)
Retention: Indefinite (Git history)
Recovery: git clone https://github.com/...
Configuration
Location: Infrastructure as code (in Git) Frequency: Every change Retention: Version controlled Recovery: Apply from wrangler.toml or KV API
Signing Keys
Location: Cloudflare KV (primary), offline backup (secondary, encrypted) Frequency: At generation, after rotation Retention: Until rotated + 1 year Recovery: Restore from offline backup
KV Data
Location: Cloudflare KV (replicated) Frequency: Weekly exports (planned - see RISK-2025-M005) Retention: 90 days Recovery: KV bulk upload via API
What We Don’t Back Up:
- Ephemeral state (Durable Objects, challenges)
- Audit logs (retained 90 days, then discarded; critical security event logs are retained for up to 365 days)
- Analytics data (acceptable loss)
Maintenance Windows
Planned Maintenance: Rare (serverless architecture auto-scales)
If Maintenance Required:
- Announce: 72 hours advance notice
- Schedule: Off-peak hours (02:00-04:00 UTC)
- Duration: Target < 30 minutes
- Rollback: Prepared before starting
- Communicate: Start, progress, completion
Recovery Time Objectives (RTO) / Recovery Point Objectives (RPO)
| System/Service | RTO | RPO | Justification |
|---|---|---|---|
| Verifier API | 1 hour | 0 | Critical service, stateless |
| Issuer API | 2 hours | 5 min | Important but less critical, minimal state |
| Development Environment | 4 hours | 0 | All code in Git |
| Signing Keys | 2 hours | 0 | Recovery from offline backup |
| KV Configuration | 4 hours | 7 days | Weekly backups planned |
| Documentation | 24 hours | 0 | Hosted on GitHub |
Continuous Improvement
After Every Incident:
- Review BCP effectiveness
- Update procedures based on lessons learned
- Test new procedures
Annual Review (Q1):
- Update contact information
- Review RTO/RPO targets
- Update disaster scenarios based on threat surface
- Validate backup and recovery procedures
- Test full plan with tabletop exercise
Next Scheduled Review: 2026-05-15
Related Documents
- Incident Response Policy - Responding to security incidents
- Risk Register - Business continuity risks
- Information Security Policy - Overarching security principles
Document Information
- Version. 1.1
- Effective Date. 2025-01-13
- Last Updated. 2026-05-21
- Owner. ISMS Owner
- Maintained By. Security Lead
- Review Frequency. Annually, after major incidents
- Next Review. 2026-11-21
- Classification. Public