Status: pre-launch. This evidence reflects implemented code and deployed infrastructure. Provii is not yet serving end-user production traffic, so production operational metrics and audit history are not yet available.
Business Continuity & Disaster Recovery Evidence
Author: Maelstrom AI Date: 2026-02-14 Controls Covered: UC-052, UC-062, UC-064, UC-122, UC-165, UC-166, UC-167, UC-168, UC-169, UC-170, UC-171, UC-172
Executive Summary
Maelstrom AI’s business continuity and disaster recovery posture is primarily enabled by the serverless, globally distributed architecture built on Cloudflare’s edge platform. The system is designed to provide:
- High Availability. Expected 99.9%+ uptime through 300+ global Points of Presence (PoPs) (Cloudflare-published infrastructure data)
- Automatic Failover. Geographic distribution with automatic traffic routing
- Minimal Data Loss Risk. Stateless services with RPO near zero for critical systems
- Rapid Recovery. RTO of 1 hour for verifier API, 2 hours for issuer service, <4 hours for KV data
- Code Backup. Complete source code in Git with unlimited retention
- Configuration Backup. Infrastructure-as-code approach with version control
- Automated KV Backups. ✅ NEW - Hourly full + daily full + weekly complete backups via provii-backup (RPO <1 hour, RTO <4 hours)
- Durable Object Backups. ✅ NEW - Daily/weekly snapshots with restore capabilities
- Encryption & Compression. AES-256-GCM encryption, 70-80% size reduction
- Cost-Effective. <$0.01/month for complete infrastructure backup
Key Finding: The serverless architecture provides strong inherent business continuity capabilities by design, without requiring additional failover infrastructure. ✅ UPDATED: Automated KV/DO backup system implemented (January 2025), closing GAP-H006. Remaining gap: Formal BCP tabletop testing (planned Q1 2026).
Control Mapping
UC-165: Business Impact Analysis (BIA)
Status: ✅ Implemented
Evidence Location: security/business-continuity.mdx (lines 26-96)
Critical Services Identified:
- Verifier API (verify.provii.app)
- RTO: 1 hour
- RPO: 0 (stateless)
- Maximum Tolerable Downtime: 4 hours
- Business Impact: Age verification fails for all relying parties, revenue impact
- Issuer API (issuer.provii.app)
- RTO: 2 hours
- RPO: 5 minutes (session data in KV)
- Maximum Tolerable Downtime: 8 hours
- Business Impact: New credentials cannot be issued (existing ones still work)
- Development & Deployment Capability
- RTO: 4 hours
- RPO: 0 (code in Git)
- Maximum Tolerable Downtime: 24 hours
- Business Impact: Cannot deploy security patches or respond to incidents
Non-Critical Services:
- Documentation site: RTO 24 hours
- CDN (static assets): RTO 4 hours
- Developer tools: RTO 48 hours
Dependencies Mapped:
- Cloudflare Workers platform (critical)
- Cloudflare KV (critical for signing keys, config)
- Cloudflare KV (critical for challenges, nonces)
- GitHub (critical for deployment)
- JWKS on CDN (critical for verification)
UC-166: Disaster Recovery Plan (DRP)
Status: ✅ Implemented
Evidence Location: security/business-continuity.mdx (lines 98-327)
Disaster Scenarios Documented:
- Cloudflare Regional Outage
- Likelihood: Low
- Impact: Medium (automatic failover)
- Response: Detection < 5 min, Assessment < 10 min, Communication < 15 min
- Recovery: Automatic failover to healthy regions
- Cloudflare Global Platform Outage
- Likelihood: Very Rare
- Impact: Severe (complete unavailability)
- Response: Communication plan activated, wait for Cloudflare restoration
- Mitigation: Risk accepted (multi-cloud deployment evaluated but rejected due to cost/complexity)
- Signing Key Compromise
- Likelihood: Low
- Impact: Severe (trust model compromised)
- Response: Revoke key < 15 min, Generate new keys < 1 hour, Assess damage < 4 hours
- Reference: Detailed in
security/incident-response.mdx(lines 509-539)
- GitHub Outage or Account Compromise
- Likelihood: Low (outage), Medium (compromise attempts)
- Impact: High (cannot deploy)
- Response: Manual deployment via wrangler if GitHub down, account recovery if compromised
- Loss of Key Personnel
- Likelihood: Low
- Impact: Medium (knowledge loss)
- Mitigation: documentation, multiple admin access, infrastructure-as-code
- Supply Chain Attack
- Likelihood: Medium
- Impact: Severe
- Response: Halt deployments, identify compromised dependency, rollback
- Prevention: SLSA Level 3 controls (see
developers/supply-chain-security.mdx)
Recovery Procedures Documented (lines 328-424):
- Manual redeployment procedures
- KV data recovery
- Durable Objects recovery (automatic)
- Development environment recovery
- Credential rotation procedures
UC-167: High Availability Architecture
Status: ✅ Implemented Evidence Location: Infrastructure architecture, BCP documentation
High Availability Implementation:
- Global Edge Distribution
- Platform: Cloudflare Workers
- Points of Presence: 300+ globally
- Evidence:
security/business-continuity.mdx(line 102) - Source: Cloudflare public documentation
- Automatic Geographic Failover
- Mechanism: Cloudflare Workers automatically route to healthy regions
- Durable Objects: Automatic migration to available locations
- Evidence:
security/business-continuity.mdx(lines 127-130)
- Stateless Service Design
- Verifier API: Completely stateless (minimal data loss risk during failover)
- Issuer API: Minimal state (session data in KV with replication)
- Recovery: No manual intervention required for most regional outages
- Load Balancing
- Implementation: Automatic by Cloudflare edge network
- Traffic routing: Based on latency, health, geographic proximity
- Availability Monitoring
- Cloudflare Workers Logs (shipped to Grafana Loki): Real-time monitoring
- Status page: status.cloudflare.com
- Evidence:
security/business-continuity.mdx(lines 109-111)
- Supplier SLA Reference
- Cloudflare Enterprise SLA: 99.99% uptime (supplier-held)
- Evidence:
security/supplier-management.md(lines 20) - Internal availability objective: 99.9%+ (best-effort; no contractual SLA at this tier)
Alternative Processing Sites (UC-172):
- Status: ✅ Implemented
- Implementation: Cloudflare global network provides automatic alternative sites
- Failover: Automatic geographic failover built into platform
- Evidence: Unified Control Matrix line 4118
UC-168: Data Backup Procedures
Status: ✅ IMPLEMENTED
Evidence Location: trust/evidence/business-continuity/provii-backup-evidence.md
Implementation: Automated backup system via provii-backup deployed January 2025
What IS Backed Up:
- Source Code
- Location: GitHub (primary), local clones (secondary)
- Frequency: Continuous (every commit)
- Retention: Indefinite (Git history)
- Recovery:
git clone https://github.com/... - Evidence:
security/business-continuity.mdx(lines 534-539)
- Infrastructure Configuration
- Location: Infrastructure as code (in Git)
- Frequency: Every change
- Retention: Version controlled
- Recovery: Apply from wrangler.toml or KV API
- Evidence:
security/business-continuity.mdx(lines 541-546)
- Signing Keys
- Location: Cloudflare KV (primary), offline backup (secondary, encrypted)
- Frequency: At generation, after rotation
- Retention: Until rotated + 1 year
- Recovery: Restore from offline backup
- Evidence:
security/business-continuity.mdx(lines 548-553) - Asset Register:
security/asset-register.mdx(line 21)
- KV Data ✅ NOW AUTOMATED
- Implementation. provii-backup with automated cron-triggered backups
- Coverage. 26 KV namespaces (all Issuer, Verifier, Admin Portal KVs)
- Frequency:
- Hourly incremental (changed keys only)
- Daily full snapshots (2am UTC)
- Weekly complete backups (Sunday 3am UTC)
- Storage. Cloudflare R2 (
provii-backupsbucket) - Encryption. AES-256-GCM with key rotation support
- Compression. MessagePack + Gzip (70-80% size reduction)
- Retention:
- Incremental: 7 days
- Daily: 30 days
- Weekly: 90 days
- RPO. <1 hour (hourly backups)
- Recovery Methods:
- Full restore (all namespaces)
- Point-in-time restore (to any specific timestamp)
- Selective restore (per-namespace with diff preview)
- Cost. <$0.01/month
- Evidence:
provii-backup/README.mdprovii-backup/wrangler.tomltrust/evidence/business-continuity/provii-backup-evidence.md
- Durable Objects ✅ NOW AUTOMATED
- Coverage. 11 Durable Objects (Admin Portal + Verifier API)
- Frequency. Daily and weekly backups (via snapshot endpoints)
- Storage. Included in provii-backup R2 backups
- Recovery. Restore via DO
/restoreendpoints
- R2 Metadata ✅ NOW AUTOMATED
- Coverage. 2 R2 buckets (metadata only, not duplicating objects)
- Frequency. Weekly complete backups
- Storage. Included in provii-backup backups
What is NOT Backed Up (by design):
- Ephemeral state (challenges, short-lived sessions) - acceptable loss
- Audit logs - retained 90 days then discarded; critical security event logs are retained for up to 365 days
- Analytics data - acceptable loss
- R2 object contents (only metadata backed up to avoid duplication)
- Evidence:
security/business-continuity.mdx(lines 563-566)
Backup Encryption:
- KV/DO/R2 backups. AES-256-GCM with unique IVs per backup
- Signing keys. Encrypted offline backups
- Code. Public repositories (open source)
- Key storage. Cloudflare Secrets Store (isolated from backup data)
Off-site Storage:
- Git repositories: GitHub (different infrastructure than production)
- KV/DO backups: R2 (separate from production KV, geo-distributed)
- Offline key backups: Secure location (separate from Cloudflare)
Immutability:
- Git history: Immutable by design
- KV backups: R2 Object Lock can be enabled (optional enhancement)
Monitoring:
- Slack notifications for backup success/failure
- Structured
console.logevents shipped via Cloudflare Workers Logs to Grafana Loki under theprovii-backuplabelset - Worker logs:
wrangler tail provii-backup
Status: GAP-H006 (Automated KV Backups) CLOSED ✅
UC-169: Recovery Time and Point Objectives
Status: ✅ Implemented
Evidence Location: security/business-continuity.mdx (lines 581-590)
Documented RTO/RPO by System:
| System/Service | RTO | RPO | Justification | Status |
|---|---|---|---|---|
| Verifier API | 1 hour | 0 | Critical service, stateless | ✅ |
| Issuer API | 2 hours | 5 min | Important but less critical, minimal state | ✅ |
| Development Environment | 4 hours | 0 | All code in Git | ✅ |
| Signing Keys | 2 hours | 0 | Recovery from offline backup | ✅ |
| KV Data | <4 hours | <1 hour | Automated hourly backups via provii-backup | ✅ IMPROVED |
| Durable Objects | <4 hours | <24 hours | Daily/weekly snapshots via provii-backup | ✅ |
| Documentation | 24 hours | 0 | Hosted on GitHub | ✅ |
RTO Achievement Evidence:
- Stateless services: Near-instant failover via Cloudflare
- Redeployment: Automated via GitHub Actions, manual via wrangler (~15 min)
- Key recovery: 2-hour procedure documented (lines 548-553)
- KV restoration. Tested <4 hour RTO via provii-backup selective/full restore
- DO restoration. Daily snapshots enable <4 hour recovery
RPO Achievement Evidence:
- Verifier API: Stateless (no data to lose)
- Issuer API: KV replication reduces data loss window to minutes
- Code: Continuous backup via Git (zero loss)
- Configuration: Version controlled (zero loss)
- KV data. Hourly full backups = <1 hour RPO (168x improvement from original 7-day plan)
- DO data. Daily snapshots = <24 hour RPO
Gap Analysis:
KV data: Weekly backup planned but not automated (7-day RPO vs. desired <24 hours)CLOSED ✅Remediation: Tracked in risk register (RISK-2025-M005)RESOLVED via provii-backup implementation
UC-170: Incident Communication Plan
Status: ✅ ENHANCED (Status Page Component Implemented)
Evidence Location: security/business-continuity.mdx (lines 427-504)
Internal Communication:
- Emergency method: Signal group (all team members)
- Escalation chain:
- Security Lead (first responder)
- Developer (technical support)
- ISMS Owner (major decisions, customer communication)
- Channels: Primary (Signal), Secondary (Email), Tertiary (Direct calls)
- Evidence: Lines 430-442
Customer Communication:
- Primary: Status page (status.provii.app) ✅ DEPLOYED
- Backup: Email to registered contacts
- Social media: X/Twitter @proviiwallet for major outages
- Evidence: Lines 444-448
Communication Templates Documented:
- Service Degradation template (lines 452-469)
- Service Restored template (lines 471-485)
- Security Advisory template (lines 487-503)
Status Page: ✅ IMPLEMENTED
- URL. status.provii.app
- Platform. Cloudflare Workers (provii-status)
- Features:
- Real-time health monitoring for 4 services (Production environment (pre-launch) + Sandbox Verify/Issuer)
- Auto-refresh every 60 seconds
- Response time and HTTP status code display
- Colour-coded status indicators (green/red/orange)
- Public API endpoint (
/api/status) for programmatic access - Worker-to-worker service bindings (direct internal communication)
- Cost. $0/month (Cloudflare Workers free tier)
- Monitored Services:
- Production environment (pre-launch) Verify (verify.provii.app)
- Production environment (pre-launch) Issuer (issuer.provii.app)
- Sandbox Verify (sandbox-verify.provii.app)
- Sandbox Issuer (sandbox-issuer.provii.app)
- Evidence.
trust/evidence/business-continuity/status-page-evidence.md - Configuration.
provii-status/wrangler.toml - Documentation.
provii-status/README.md - GAP-M001. ✅ CLOSED (Status page implemented, exceeds original requirements)
Communication Timeline Commitments:
- P0/P1 incidents: Status update within 15 minutes (monitor via status.provii.app)
- P2 incidents: Update within 24 hours if customer-facing
- Regular updates: Every 30min-1hr during active incidents (reflected on status page)
UC-171: Tabletop Exercises and Drills
Status: 📋 Planned
Evidence Location: security/business-continuity.mdx (lines 509-528)
Testing Schedule Defined:
- Annual Full Test (Q4 each year)
- Simulate major outage scenario
- Test communication procedures
- Verify recovery procedures work
- Update documentation based on findings
- Quarterly Table-Top Exercises
- Walk through scenarios with team
- No actual system changes
- Verify contact information current
- Practice decision-making
- Ad-Hoc Testing
- After real incidents (lessons learned)
- When significant infrastructure changes
- When new team members onboard
Planned Scenarios:
- Signing key compromise (from
security/incident-response.mdx, lines 567-574) - Supply chain attack
- Account takeover
- Service outage
- Vulnerability disclosure
Next Scheduled Exercise: Q1 2026 (tabletop), Q3 2026 (full test)
GAP: No testing has occurred yet (system is new). First exercise scheduled.
UC-172: Alternative Processing Sites
Status: ✅ Implemented Evidence: Cloudflare global distribution
See UC-167 High Availability Architecture section above for full details.
UC-052: Business Continuity and Disaster Recovery (General)
Status: ✅ Implemented
Evidence Location: security/business-continuity.mdx (entire document, 628 lines)
BCP Document Contents:
- Purpose and scope (lines 7-23)
- Business impact analysis (lines 26-96)
- Disaster scenarios and responses (lines 98-327)
- Recovery procedures (lines 328-424)
- Communication plan (lines 427-504)
- Testing and maintenance (lines 509-528)
- Backup procedures (lines 530-566)
- RTO/RPO targets (lines 581-590)
- Continuous improvement (lines 593-608)
Document Metadata:
- Version: 1.0
- Effective Date: 2025-01-13
- Owner: ISMS Owner
- Maintained By: ISMS Owner
- Review Frequency: Annually, after major incidents
- Next Review: 2026-11-21
- Classification: Public
Related Documents:
- Incident Response Policy (incident-response.mdx)
- Risk Register (risk-register.mdx)
- Information Security Policy (information-security-policy.mdx)
UC-062: Backup and Recovery
Status: ✅ IMPLEMENTED Evidence: See UC-168 section above
Primary Evidence:
trust/evidence/business-continuity/provii-backup-evidence.md- provii-backup documentationprovii-backup/README.md- Technical implementationprovii-backup/wrangler.toml- Configuration and cron triggers
Additional Evidence:
security/data-retention.mdx(lines 19-39) - Retention periodssecurity/asset-register.mdx(lines 42-61) - KV namespaces
Backup Testing:
- Git recovery: Routine (developers clone daily)
- Key recovery: Not yet tested (planned in tabletop exercises)
- KV recovery. ✅ Pre-production tested (dry-run restore validated)
- Automated backup verification. Planned quarterly restore drills (UC-122)
Implementation Summary:
- 26 KV namespaces automatically backed up hourly (incremental) and daily (full)
- 11 Durable Objects backed up daily/weekly
- Encryption: AES-256-GCM
- Compression: 70-80% size reduction
- Cost: <$0.01/month
- RPO: <1 hour
- RTO: <4 hours (tested)
- Monitoring: Slack alerts + Cloudflare Workers Logs (shipped to Grafana Loki)
UC-064: Capacity Management and Availability
Status: ✅ Implemented Evidence: Serverless auto-scaling
Capacity Management:
- Platform: Cloudflare Workers (serverless)
- Scaling: Automatic based on demand
- No capacity planning needed: Platform scales to handle traffic
- Evidence:
security/business-continuity.mdx(line 570)
Availability Monitoring:
- Cloudflare Workers Logs (shipped to Grafana Loki): Real-time metrics
- Error rates, response times tracked
- Evidence:
security/change-management.mdx(lines 195-199)
Availability Objective:
- Internal target: 99.9%+ (best-effort; no contractual SLA at this tier)
- Cloudflare supplier SLA: 99.99% uptime (supplier-held)
- Evidence:
security/supplier-management.md(line 20)
Planned Maintenance:
- Rare due to serverless architecture
- Procedure documented: 72-hour notice, off-peak hours, <30 min duration
- Evidence:
security/business-continuity.mdx(lines 568-578)
UC-122: Data Backup and Recovery Testing
Status: 📋 Planned Evidence Location: BCP testing section
Testing Plan:
- Quarterly table-top exercises (verify procedures)
- Annual full test (actual recovery simulation)
- Ad-hoc testing after infrastructure changes
Current Status:
- No testing completed yet (new system)
- First tabletop: Q1 2026
- First full test: Q3 2026
Gap: Backup recovery testing not yet conducted. Remediation in progress.
Deployment and Change Management Evidence
Deployment Automation
Evidence Location: security/change-management.mdx
Automated Deployment (lines 171-178):
# Triggered by merge to main
git checkout main
git merge feature/description
git push origin main
# GitHub Actions runs wrangler deploy
Manual Emergency Deployment (lines 180-185):
cd provii-verifier # or provii-issuer
wrangler deploy --env production
Deployment Verification (lines 187-191):
- Health check passes
- Smoke tests pass
- Monitoring shows normal operation
- Rollback prepared if needed
Rollback Procedures (lines 213-240):
- Manual rollback: Revert commit or redeploy known-good version
- Target: <5 minutes for critical rollbacks
- No automatic rollback (improvement opportunity)
Change Types (lines 22-95):
- Standard changes: Automated via CI/CD
- Normal changes: Code review + Security Lead signoff
- Emergency changes: ISMS Owner or Security Lead approval, expedited review
Risk Register Evidence
Business Continuity Risks Documented:
- RISK-2025-M001: Cloudflare Service Disruption
- Impact: Major (4) - Complete service unavailability
- Likelihood: Unlikely (2)
- Treatment: Accept + BCP
- Evidence:
security/risk-register.mdx(lines 107-133)
- RISK-2025-H002: Supply Chain Compromise
- Impact: Major (4)
- Likelihood: Possible (3)
- Treatment: Mitigate (SLSA Level 3)
- Evidence:
security/risk-register.mdx(lines 75-102)
Risk Treatment Evidence:
- Business continuity plan implemented
- Cloudflare SLA 99.99%
- Global edge distribution (300+ PoPs)
- Automatic failover
- Stateless/zero-state architecture (minimal data loss risk during outages)
Incident Response Integration
Business Continuity Incidents:
Evidence from security/incident-response.mdx
Service Outage Response (lines 293-299):
- Update status page immediately
- Determine root cause (attack vs infrastructure)
- Enable enhanced protections if attack
- Engage Cloudflare support if infrastructure
- Implement workarounds
Cloud Provider Outage (lines 541-549):
- Confirm outage via status.cloudflare.com
- Update customers (non-Cloudflare channel)
- Monitor Cloudflare updates
- Document for BCP review
- Accept risk (no alternative action possible)
Incident Severity Levels (lines 46-108):
- P0 (Critical): <15 min response, signing key compromise, complete outage
- P1 (High): <1 hour response, partial outage, unauthorised access
- P2 (Medium): <4 hours response, non-critical vulnerabilities
- P3 (Low): <24 hours response, minor issues
Supply Chain Security and Continuity
SLSA Level 3 Implementation:
Evidence from developers/supply-chain-security.mdx
Build Security (lines 36-102):
- Ephemeral environments: Fresh GitHub-hosted runners
- Hermetic builds: Locked dependencies (Cargo.lock, package-lock.json)
- Minimal permissions: Read-only default
- Security audits: cargo audit, npm audit (fail on HIGH)
Artifact Signing (lines 123-171):
- Keyless signing via Sigstore
- OIDC token-based certificates
- Transparency logging in Rekor
- Proves artifacts built by GitHub Actions
Supply Chain Attack Mitigation (lines 208-219):
- Compromised developer machine: Isolated build runners
- Malicious dependency: Security audits fail build
- Tampered artifact: Checksums + signatures detect
- Compromised signing key: No long-lived keys
- Build environment persistence: Ephemeral runners
Continuity Relevance:
- Build system is resilient to developer workstation failures
- Automated builds reduce key person dependency
- GitHub redundancy provides build system availability
- Supply chain controls prevent malicious code from disrupting service
Asset Management and Business Continuity
Critical Assets for Business Continuity:
Evidence from security/asset-register.mdx
Cryptographic Assets (lines 17-26):
- CRYPTO-001: Production Signing Keys (Cloudflare KV, Restricted)
- CRYPTO-002: Production Verification Keys (JWKS on CDN, Public)
- CRYPTO-003: Development Signing Keys (Cloudflare KV dev, Internal)
- CRYPTO-005: HMAC Secrets (Cloudflare KV, Restricted)
Infrastructure Assets (lines 30-40):
- INFRA-001: Cloudflare Account (2 admins, MFA required)
- INFRA-002: GitHub Organisation (team members, admin limited)
- INFRA-003: Verifier API Worker (deployed via CI/CD)
- INFRA-004: Issuer API Worker (deployed via CI/CD)
- INFRA-005: Static site serving (Cloudflare Workers Assets, deployed via Git)
KV Namespaces (lines 42-50):
- KV-001: VERIFIER_CONFIG (retained for the operational lifetime of the service or until superseded by configuration update; legal basis: legitimate interests for service operation)
- KV-002: ISSUER_KEYS (until rotation + 1 year)
- KV-003: ISSUER_AUDIT_LOG (90 days; critical security event logs retained for up to 365 days)
- KV-004: IS_CONFIG (retained for the operational lifetime of the service or until superseded by configuration update; legal basis: legitimate interests for service operation)
- KV-005: BANS (variable retention)
Code & IP (lines 54-62):
- All repositories public (open source)
- Complete backup via Git
- No proprietary secrets in code
Operational Data (lines 67-73):
- DATA-001: Audit logs including IP addresses (Cloudflare Workers Logs shipped to Grafana Loki, Cloudflare KV, 90 days; critical security event logs retained for up to 365 days)
- DATA-002: Operational telemetry (Cloudflare Workers Logs shipped to Grafana Loki, 90 days)
- DATA-004: CI/CD logs (GitHub Actions, 90 days)
Data Retention and Disposal
Retention Periods:
Evidence from security/data-retention.mdx
Operational Data (lines 22-29):
- Audit logs (including IP addresses): 90 days; critical security event logs are retained for up to 365 days
- Analytics data: 90 days
- Challenge state: 5 minutes (auto-expires)
- Nonce records: 5 minutes (auto-expires)
Development Data (lines 33-38):
- Source code: Indefinite (Git)
- CI/CD logs: 90 days (GitHub Actions)
- Build artifacts: 90 days (signed releases: indefinite)
Business Records (lines 42-47):
- Contracts: 7 years after expiration
- Financial records: 7 years
- ISMS documents: Current + 3 years
- Incident reports: 3 years
Automated Deletion (lines 109-131):
- Audit logs (including IP addresses): Automatic 90-day expiry (Cloudflare Workers Logs in Grafana Loki + KV TTL); critical security event logs are retained for up to 365 days
- Challenges/Nonces: 5-minute TTL (KV)
Disposal Procedures (lines 67-107):
- Digital data: Delete from KV
- Cryptographic keys: Cryptographic erasure (overwrite with random)
- Offline backups: Physical destruction (shred or degauss)
- Devices: Full disk encryption key deletion + secure wipe
Supplier Management and Business Continuity
Critical Suppliers:
Evidence from security/supplier-management.md
Cloudflare (lines 13-33):
- Services: Workers, KV, Durable Objects, Pages, Analytics, DDoS protection
- Criticality: High - Complete service dependency
- Security: SOC 2 Type II, ISO 27001 certified (supplier-held, via Cloudflare)
- SLA: 99.99% uptime (Enterprise)
- Monitoring: status.cloudflare.com, security advisories, annual contract review
GitHub (lines 35-53):
- Services: Source control, CI/CD (Actions), artifact hosting, security scanning
- Criticality: High - Development dependency
- Security: SOC 2 Type II, Advanced Security features
- Monitoring: GitHub status page, security advisories, Dependabot alerts
Vendor Risk Assessment (lines 80-96):
- Cloudflare: High criticality, Low risk (strong security)
- GitHub: High criticality, Low risk (strong security)
Vendor Incident Response (lines 111-118):
- Assess impact to Maelstrom AI
- Activate incident response if needed
- Communicate with vendor
- Document in incident register
- Review vendor relationship
- Update risk assessment
Gaps and Recommendations
Identified Gaps
KV Data Backups (UC-168)✅ CLOSED
Current: Weekly exports planned but not automatedImpact: 7-day RPO for KV dataRecommendation: Implement automated weekly KV exports- Status. IMPLEMENTED via provii-backup (January 2025)
- Achievement. Hourly full + daily full + weekly complete backups, RPO <1 hour
Tracked: RISK-2025-M005RESOLVED
Status Page (UC-170)✅ CLOSED
Current: Planned but not implementedImpact: Customer communication relies on email/social mediaRecommendation: Implement status.provii.app- Status. IMPLEMENTED via provii-status (deployed)
- Achievement. Real-time monitoring at status.provii.app, $0/month cost
- Evidence.
trust/evidence/business-continuity/status-page-evidence.md Tracked: GAP-M001CLOSED
- BCP Testing (UC-171, UC-122)
- Current: Testing schedule defined but not executed
- Impact: Procedures untested, potential gaps undiscovered
- Recommendation: Conduct first tabletop exercise Q1 2026
- Priority: High
- Next action: Schedule and execute tabletop exercise
- Backup Recovery Testing (UC-122)
- Current: Key recovery procedure documented but not tested
- Impact: Unknown issues in recovery process
- Recommendation: Test key recovery in controlled environment
- Priority: Medium
- Next action: Include in Q3 2026 full BCP test
- Automatic Rollback (UC-166)
- Current: Manual rollback only
- Impact: 5-minute rollback target may be challenging
- Recommendation: Implement automated rollback for failed deployments
- Priority: Low (current process acceptable)
Strengths
- Inherent High Availability: Cloudflare’s global distribution provides strong availability without requiring additional failover infrastructure
- Stateless Architecture: Minimal data loss risk, fast recovery
- Infrastructure as Code: All configuration in Git, easily reproducible
- Documentation: BCP, incident response, and recovery procedures well-documented
- Supply Chain Security: SLSA Level 3 reduces risk of malicious code disrupting service
- Automated Deployment: Fast deployment capability supports rapid recovery
Control Status Summary
| Control | ID | Status | Evidence Quality | Gaps |
|---|---|---|---|---|
| Business Impact Analysis | UC-165 | ✅ Implemented | High | None |
| Disaster Recovery Plan | UC-166 | ✅ Implemented | High | Testing needed |
| High Availability Architecture | UC-167 | ✅ Implemented | High | None |
| Data Backup Procedures | UC-168 | ✅ Implemented | High | |
| RTO/RPO Objectives | UC-169 | ✅ Implemented | High | None |
| Incident Communication Plan | UC-170 | ✅ Enhanced | High | |
| Tabletop Exercises | UC-171 | 📋 Planned | Low | Not conducted |
| Alternative Processing Sites | UC-172 | ✅ Implemented | High | None |
| General BC/DR | UC-052 | ✅ Implemented | High | Testing needed |
| Backup and Recovery | UC-062 | ✅ Implemented | High | |
| Capacity Management | UC-064 | ✅ Implemented | High | None |
| Backup Testing | UC-122 | 📋 Planned | Low | Not conducted |
Overall Assessment: ✅ UPDATED - 10 of 12 controls fully implemented (83%), 0 partially implemented, 2 planned (17%). Strong foundation with UC-062 (Backup and Recovery) and UC-170 (Incident Communication) now fully implemented via provii-backup and provii-status. Remaining gap: Formal BCP tabletop testing (UC-122, UC-171).
References
Documentation
security/business-continuity.mdx- Primary BCP documentsecurity/incident-response.mdx- Incident proceduressecurity/change-management.mdx- Deployment and rollbacksecurity/risk-register.mdx- BC/DR riskssecurity/supplier-management.md- Vendor dependenciessecurity/data-retention.mdx- Retention and disposalsecurity/asset-register.mdx- Critical assetsdevelopers/supply-chain-security.mdx- Build resilience
Configuration
- Infrastructure-as-code: wrangler.toml files (held in the application repositories, not this documentation repository)
- Deployment automation:.github/workflows files (held in the application repositories, not this documentation repository)
External Resources
- Cloudflare Status: https://status.cloudflare.com
- Cloudflare Documentation: Worker, KV, Durable Objects availability guarantees
- GitHub Status: https://githubstatus.com
Conclusion
Maelstrom AI’s business continuity posture is designed to be resilient through the serverless, globally distributed architecture. The platform benefits from Cloudflare’s 99.99% supplier SLA and automatic geographic failover; Maelstrom AI provides availability on a best-effort basis with no contractual SLA at this tier.
Key strengths:
- Exceptional high availability (UC-167)
- Well-documented disaster recovery procedures (UC-166)
- Clear RTO/RPO targets (UC-169)
- Stateless design minimises data loss risk
- Infrastructure-as-code enables rapid recovery
Key improvements completed:
- ✅ Automated KV data backups (UC-168) - provii-backup deployed
- ✅ Status page implementation (UC-170) - status.provii.app deployed
- ✅ Real-time service monitoring with $0/month cost (vs. $500/year budgeted)
Remaining improvements needed:
- Conduct first BCP testing exercises (UC-171, UC-122)
- Test key recovery procedures (UC-122)
All critical controls are now fully implemented. The remaining gaps are testing and validation activities, which will enhance operational maturity but do not represent functional gaps.
Evidence Collection Complete Date: 2026-02-14 Author: Maelstrom AI Next Actions: Address identified gaps, schedule Q1 2026 tabletop exercise