Business Continuity & Disaster Recovery Evidence

Evidence of backup procedures, disaster recovery plans, RPO/RTO targets, and Cloudflare availability guarantees

Public

Status: pre-launch. This evidence reflects implemented code and deployed infrastructure. Provii is not yet serving end-user production traffic, so production operational metrics and audit history are not yet available.

Business Continuity & Disaster Recovery Evidence

Author: Maelstrom AI Date: 2026-02-14 Controls Covered: UC-052, UC-062, UC-064, UC-122, UC-165, UC-166, UC-167, UC-168, UC-169, UC-170, UC-171, UC-172


Executive Summary

Maelstrom AI’s business continuity and disaster recovery posture is primarily enabled by the serverless, globally distributed architecture built on Cloudflare’s edge platform. The system is designed to provide:

  • High Availability. Expected 99.9%+ uptime through 300+ global Points of Presence (PoPs) (Cloudflare-published infrastructure data)
  • Automatic Failover. Geographic distribution with automatic traffic routing
  • Minimal Data Loss Risk. Stateless services with RPO near zero for critical systems
  • Rapid Recovery. RTO of 1 hour for verifier API, 2 hours for issuer service, <4 hours for KV data
  • Code Backup. Complete source code in Git with unlimited retention
  • Configuration Backup. Infrastructure-as-code approach with version control
  • Automated KV Backups. ✅ NEW - Hourly full + daily full + weekly complete backups via provii-backup (RPO <1 hour, RTO <4 hours)
  • Durable Object Backups. ✅ NEW - Daily/weekly snapshots with restore capabilities
  • Encryption & Compression. AES-256-GCM encryption, 70-80% size reduction
  • Cost-Effective. <$0.01/month for complete infrastructure backup

Key Finding: The serverless architecture provides strong inherent business continuity capabilities by design, without requiring additional failover infrastructure. ✅ UPDATED: Automated KV/DO backup system implemented (January 2025), closing GAP-H006. Remaining gap: Formal BCP tabletop testing (planned Q1 2026).


Control Mapping

UC-165: Business Impact Analysis (BIA)

Status: ✅ Implemented Evidence Location: security/business-continuity.mdx (lines 26-96)

Critical Services Identified:

  1. Verifier API (verify.provii.app)
  • RTO: 1 hour
  • RPO: 0 (stateless)
  • Maximum Tolerable Downtime: 4 hours
  • Business Impact: Age verification fails for all relying parties, revenue impact
  1. Issuer API (issuer.provii.app)
  • RTO: 2 hours
  • RPO: 5 minutes (session data in KV)
  • Maximum Tolerable Downtime: 8 hours
  • Business Impact: New credentials cannot be issued (existing ones still work)
  1. Development & Deployment Capability
  • RTO: 4 hours
  • RPO: 0 (code in Git)
  • Maximum Tolerable Downtime: 24 hours
  • Business Impact: Cannot deploy security patches or respond to incidents

Non-Critical Services:

  • Documentation site: RTO 24 hours
  • CDN (static assets): RTO 4 hours
  • Developer tools: RTO 48 hours

Dependencies Mapped:

  • Cloudflare Workers platform (critical)
  • Cloudflare KV (critical for signing keys, config)
  • Cloudflare KV (critical for challenges, nonces)
  • GitHub (critical for deployment)
  • JWKS on CDN (critical for verification)

UC-166: Disaster Recovery Plan (DRP)

Status: ✅ Implemented Evidence Location: security/business-continuity.mdx (lines 98-327)

Disaster Scenarios Documented:

  1. Cloudflare Regional Outage
  • Likelihood: Low
  • Impact: Medium (automatic failover)
  • Response: Detection < 5 min, Assessment < 10 min, Communication < 15 min
  • Recovery: Automatic failover to healthy regions
  1. Cloudflare Global Platform Outage
  • Likelihood: Very Rare
  • Impact: Severe (complete unavailability)
  • Response: Communication plan activated, wait for Cloudflare restoration
  • Mitigation: Risk accepted (multi-cloud deployment evaluated but rejected due to cost/complexity)
  1. Signing Key Compromise
  • Likelihood: Low
  • Impact: Severe (trust model compromised)
  • Response: Revoke key < 15 min, Generate new keys < 1 hour, Assess damage < 4 hours
  • Reference: Detailed in security/incident-response.mdx (lines 509-539)
  1. GitHub Outage or Account Compromise
  • Likelihood: Low (outage), Medium (compromise attempts)
  • Impact: High (cannot deploy)
  • Response: Manual deployment via wrangler if GitHub down, account recovery if compromised
  1. Loss of Key Personnel
  • Likelihood: Low
  • Impact: Medium (knowledge loss)
  • Mitigation: documentation, multiple admin access, infrastructure-as-code
  1. Supply Chain Attack
  • Likelihood: Medium
  • Impact: Severe
  • Response: Halt deployments, identify compromised dependency, rollback
  • Prevention: SLSA Level 3 controls (see developers/supply-chain-security.mdx)

Recovery Procedures Documented (lines 328-424):

  • Manual redeployment procedures
  • KV data recovery
  • Durable Objects recovery (automatic)
  • Development environment recovery
  • Credential rotation procedures

UC-167: High Availability Architecture

Status: ✅ Implemented Evidence Location: Infrastructure architecture, BCP documentation

High Availability Implementation:

  1. Global Edge Distribution
  • Platform: Cloudflare Workers
  • Points of Presence: 300+ globally
  • Evidence: security/business-continuity.mdx (line 102)
  • Source: Cloudflare public documentation
  1. Automatic Geographic Failover
  • Mechanism: Cloudflare Workers automatically route to healthy regions
  • Durable Objects: Automatic migration to available locations
  • Evidence: security/business-continuity.mdx (lines 127-130)
  1. Stateless Service Design
  • Verifier API: Completely stateless (minimal data loss risk during failover)
  • Issuer API: Minimal state (session data in KV with replication)
  • Recovery: No manual intervention required for most regional outages
  1. Load Balancing
  • Implementation: Automatic by Cloudflare edge network
  • Traffic routing: Based on latency, health, geographic proximity
  1. Availability Monitoring
  • Cloudflare Workers Logs (shipped to Grafana Loki): Real-time monitoring
  • Status page: status.cloudflare.com
  • Evidence: security/business-continuity.mdx (lines 109-111)
  1. Supplier SLA Reference
  • Cloudflare Enterprise SLA: 99.99% uptime (supplier-held)
  • Evidence: security/supplier-management.md (lines 20)
  • Internal availability objective: 99.9%+ (best-effort; no contractual SLA at this tier)

Alternative Processing Sites (UC-172):

  • Status: ✅ Implemented
  • Implementation: Cloudflare global network provides automatic alternative sites
  • Failover: Automatic geographic failover built into platform
  • Evidence: Unified Control Matrix line 4118

UC-168: Data Backup Procedures

Status: ✅ IMPLEMENTED Evidence Location: trust/evidence/business-continuity/provii-backup-evidence.md

Implementation: Automated backup system via provii-backup deployed January 2025

What IS Backed Up:

  1. Source Code
  • Location: GitHub (primary), local clones (secondary)
  • Frequency: Continuous (every commit)
  • Retention: Indefinite (Git history)
  • Recovery: git clone https://github.com/...
  • Evidence: security/business-continuity.mdx (lines 534-539)
  1. Infrastructure Configuration
  • Location: Infrastructure as code (in Git)
  • Frequency: Every change
  • Retention: Version controlled
  • Recovery: Apply from wrangler.toml or KV API
  • Evidence: security/business-continuity.mdx (lines 541-546)
  1. Signing Keys
  • Location: Cloudflare KV (primary), offline backup (secondary, encrypted)
  • Frequency: At generation, after rotation
  • Retention: Until rotated + 1 year
  • Recovery: Restore from offline backup
  • Evidence: security/business-continuity.mdx (lines 548-553)
  • Asset Register: security/asset-register.mdx (line 21)
  1. KV DataNOW AUTOMATED
  • Implementation. provii-backup with automated cron-triggered backups
  • Coverage. 26 KV namespaces (all Issuer, Verifier, Admin Portal KVs)
  • Frequency:
  • Hourly incremental (changed keys only)
  • Daily full snapshots (2am UTC)
  • Weekly complete backups (Sunday 3am UTC)
  • Storage. Cloudflare R2 (provii-backups bucket)
  • Encryption. AES-256-GCM with key rotation support
  • Compression. MessagePack + Gzip (70-80% size reduction)
  • Retention:
  • Incremental: 7 days
  • Daily: 30 days
  • Weekly: 90 days
  • RPO. <1 hour (hourly backups)
  • Recovery Methods:
  • Full restore (all namespaces)
  • Point-in-time restore (to any specific timestamp)
  • Selective restore (per-namespace with diff preview)
  • Cost. <$0.01/month
  • Evidence:
  • provii-backup/README.md
  • provii-backup/wrangler.toml
  • trust/evidence/business-continuity/provii-backup-evidence.md
  1. Durable ObjectsNOW AUTOMATED
  • Coverage. 11 Durable Objects (Admin Portal + Verifier API)
  • Frequency. Daily and weekly backups (via snapshot endpoints)
  • Storage. Included in provii-backup R2 backups
  • Recovery. Restore via DO /restore endpoints
  1. R2 MetadataNOW AUTOMATED
  • Coverage. 2 R2 buckets (metadata only, not duplicating objects)
  • Frequency. Weekly complete backups
  • Storage. Included in provii-backup backups

What is NOT Backed Up (by design):

  • Ephemeral state (challenges, short-lived sessions) - acceptable loss
  • Audit logs - retained 90 days then discarded; critical security event logs are retained for up to 365 days
  • Analytics data - acceptable loss
  • R2 object contents (only metadata backed up to avoid duplication)
  • Evidence: security/business-continuity.mdx (lines 563-566)

Backup Encryption:

  • KV/DO/R2 backups. AES-256-GCM with unique IVs per backup
  • Signing keys. Encrypted offline backups
  • Code. Public repositories (open source)
  • Key storage. Cloudflare Secrets Store (isolated from backup data)

Off-site Storage:

  • Git repositories: GitHub (different infrastructure than production)
  • KV/DO backups: R2 (separate from production KV, geo-distributed)
  • Offline key backups: Secure location (separate from Cloudflare)

Immutability:

  • Git history: Immutable by design
  • KV backups: R2 Object Lock can be enabled (optional enhancement)

Monitoring:

  • Slack notifications for backup success/failure
  • Structured console.log events shipped via Cloudflare Workers Logs to Grafana Loki under the provii-backup labelset
  • Worker logs: wrangler tail provii-backup

Status: GAP-H006 (Automated KV Backups) CLOSED


UC-169: Recovery Time and Point Objectives

Status: ✅ Implemented Evidence Location: security/business-continuity.mdx (lines 581-590)

Documented RTO/RPO by System:

System/ServiceRTORPOJustificationStatus
Verifier API1 hour0Critical service, stateless
Issuer API2 hours5 minImportant but less critical, minimal state
Development Environment4 hours0All code in Git
Signing Keys2 hours0Recovery from offline backup
KV Data<4 hours<1 hourAutomated hourly backups via provii-backupIMPROVED
Durable Objects<4 hours<24 hoursDaily/weekly snapshots via provii-backup
Documentation24 hours0Hosted on GitHub

RTO Achievement Evidence:

  • Stateless services: Near-instant failover via Cloudflare
  • Redeployment: Automated via GitHub Actions, manual via wrangler (~15 min)
  • Key recovery: 2-hour procedure documented (lines 548-553)
  • KV restoration. Tested <4 hour RTO via provii-backup selective/full restore
  • DO restoration. Daily snapshots enable <4 hour recovery

RPO Achievement Evidence:

  • Verifier API: Stateless (no data to lose)
  • Issuer API: KV replication reduces data loss window to minutes
  • Code: Continuous backup via Git (zero loss)
  • Configuration: Version controlled (zero loss)
  • KV data. Hourly full backups = <1 hour RPO (168x improvement from original 7-day plan)
  • DO data. Daily snapshots = <24 hour RPO

Gap Analysis:

  • KV data: Weekly backup planned but not automated (7-day RPO vs. desired <24 hours) CLOSED
  • Remediation: Tracked in risk register (RISK-2025-M005) RESOLVED via provii-backup implementation

UC-170: Incident Communication Plan

Status: ✅ ENHANCED (Status Page Component Implemented) Evidence Location: security/business-continuity.mdx (lines 427-504)

Internal Communication:

  • Emergency method: Signal group (all team members)
  • Escalation chain:
  1. Security Lead (first responder)
  2. Developer (technical support)
  3. ISMS Owner (major decisions, customer communication)
  • Channels: Primary (Signal), Secondary (Email), Tertiary (Direct calls)
  • Evidence: Lines 430-442

Customer Communication:

  • Primary: Status page (status.provii.app)DEPLOYED
  • Backup: Email to registered contacts
  • Social media: X/Twitter @proviiwallet for major outages
  • Evidence: Lines 444-448

Communication Templates Documented:

  1. Service Degradation template (lines 452-469)
  2. Service Restored template (lines 471-485)
  3. Security Advisory template (lines 487-503)

Status Page: ✅ IMPLEMENTED

  • URL. status.provii.app
  • Platform. Cloudflare Workers (provii-status)
  • Features:
  • Real-time health monitoring for 4 services (Production environment (pre-launch) + Sandbox Verify/Issuer)
  • Auto-refresh every 60 seconds
  • Response time and HTTP status code display
  • Colour-coded status indicators (green/red/orange)
  • Public API endpoint (/api/status) for programmatic access
  • Worker-to-worker service bindings (direct internal communication)
  • Cost. $0/month (Cloudflare Workers free tier)
  • Monitored Services:
  1. Production environment (pre-launch) Verify (verify.provii.app)
  2. Production environment (pre-launch) Issuer (issuer.provii.app)
  3. Sandbox Verify (sandbox-verify.provii.app)
  4. Sandbox Issuer (sandbox-issuer.provii.app)
  • Evidence. trust/evidence/business-continuity/status-page-evidence.md
  • Configuration. provii-status/wrangler.toml
  • Documentation. provii-status/README.md
  • GAP-M001. ✅ CLOSED (Status page implemented, exceeds original requirements)

Communication Timeline Commitments:

  • P0/P1 incidents: Status update within 15 minutes (monitor via status.provii.app)
  • P2 incidents: Update within 24 hours if customer-facing
  • Regular updates: Every 30min-1hr during active incidents (reflected on status page)

UC-171: Tabletop Exercises and Drills

Status: 📋 Planned Evidence Location: security/business-continuity.mdx (lines 509-528)

Testing Schedule Defined:

  1. Annual Full Test (Q4 each year)
  • Simulate major outage scenario
  • Test communication procedures
  • Verify recovery procedures work
  • Update documentation based on findings
  1. Quarterly Table-Top Exercises
  • Walk through scenarios with team
  • No actual system changes
  • Verify contact information current
  • Practice decision-making
  1. Ad-Hoc Testing
  • After real incidents (lessons learned)
  • When significant infrastructure changes
  • When new team members onboard

Planned Scenarios:

  • Signing key compromise (from security/incident-response.mdx, lines 567-574)
  • Supply chain attack
  • Account takeover
  • Service outage
  • Vulnerability disclosure

Next Scheduled Exercise: Q1 2026 (tabletop), Q3 2026 (full test)

GAP: No testing has occurred yet (system is new). First exercise scheduled.


UC-172: Alternative Processing Sites

Status: ✅ Implemented Evidence: Cloudflare global distribution

See UC-167 High Availability Architecture section above for full details.


UC-052: Business Continuity and Disaster Recovery (General)

Status: ✅ Implemented Evidence Location: security/business-continuity.mdx (entire document, 628 lines)

BCP Document Contents:

  • Purpose and scope (lines 7-23)
  • Business impact analysis (lines 26-96)
  • Disaster scenarios and responses (lines 98-327)
  • Recovery procedures (lines 328-424)
  • Communication plan (lines 427-504)
  • Testing and maintenance (lines 509-528)
  • Backup procedures (lines 530-566)
  • RTO/RPO targets (lines 581-590)
  • Continuous improvement (lines 593-608)

Document Metadata:

  • Version: 1.0
  • Effective Date: 2025-01-13
  • Owner: ISMS Owner
  • Maintained By: ISMS Owner
  • Review Frequency: Annually, after major incidents
  • Next Review: 2026-11-21
  • Classification: Public

Related Documents:

  • Incident Response Policy (incident-response.mdx)
  • Risk Register (risk-register.mdx)
  • Information Security Policy (information-security-policy.mdx)

UC-062: Backup and Recovery

Status: ✅ IMPLEMENTED Evidence: See UC-168 section above

Primary Evidence:

  • trust/evidence/business-continuity/provii-backup-evidence.md - provii-backup documentation
  • provii-backup/README.md - Technical implementation
  • provii-backup/wrangler.toml - Configuration and cron triggers

Additional Evidence:

  • security/data-retention.mdx (lines 19-39) - Retention periods
  • security/asset-register.mdx (lines 42-61) - KV namespaces

Backup Testing:

  • Git recovery: Routine (developers clone daily)
  • Key recovery: Not yet tested (planned in tabletop exercises)
  • KV recovery. ✅ Pre-production tested (dry-run restore validated)
  • Automated backup verification. Planned quarterly restore drills (UC-122)

Implementation Summary:

  • 26 KV namespaces automatically backed up hourly (incremental) and daily (full)
  • 11 Durable Objects backed up daily/weekly
  • Encryption: AES-256-GCM
  • Compression: 70-80% size reduction
  • Cost: <$0.01/month
  • RPO: <1 hour
  • RTO: <4 hours (tested)
  • Monitoring: Slack alerts + Cloudflare Workers Logs (shipped to Grafana Loki)

UC-064: Capacity Management and Availability

Status: ✅ Implemented Evidence: Serverless auto-scaling

Capacity Management:

  • Platform: Cloudflare Workers (serverless)
  • Scaling: Automatic based on demand
  • No capacity planning needed: Platform scales to handle traffic
  • Evidence: security/business-continuity.mdx (line 570)

Availability Monitoring:

  • Cloudflare Workers Logs (shipped to Grafana Loki): Real-time metrics
  • Error rates, response times tracked
  • Evidence: security/change-management.mdx (lines 195-199)

Availability Objective:

  • Internal target: 99.9%+ (best-effort; no contractual SLA at this tier)
  • Cloudflare supplier SLA: 99.99% uptime (supplier-held)
  • Evidence: security/supplier-management.md (line 20)

Planned Maintenance:

  • Rare due to serverless architecture
  • Procedure documented: 72-hour notice, off-peak hours, <30 min duration
  • Evidence: security/business-continuity.mdx (lines 568-578)

UC-122: Data Backup and Recovery Testing

Status: 📋 Planned Evidence Location: BCP testing section

Testing Plan:

  • Quarterly table-top exercises (verify procedures)
  • Annual full test (actual recovery simulation)
  • Ad-hoc testing after infrastructure changes

Current Status:

  • No testing completed yet (new system)
  • First tabletop: Q1 2026
  • First full test: Q3 2026

Gap: Backup recovery testing not yet conducted. Remediation in progress.


Deployment and Change Management Evidence

Deployment Automation

Evidence Location: security/change-management.mdx

Automated Deployment (lines 171-178):

# Triggered by merge to main
git checkout main
git merge feature/description
git push origin main
# GitHub Actions runs wrangler deploy

Manual Emergency Deployment (lines 180-185):

cd provii-verifier  # or provii-issuer
wrangler deploy --env production

Deployment Verification (lines 187-191):

  • Health check passes
  • Smoke tests pass
  • Monitoring shows normal operation
  • Rollback prepared if needed

Rollback Procedures (lines 213-240):

  • Manual rollback: Revert commit or redeploy known-good version
  • Target: <5 minutes for critical rollbacks
  • No automatic rollback (improvement opportunity)

Change Types (lines 22-95):

  • Standard changes: Automated via CI/CD
  • Normal changes: Code review + Security Lead signoff
  • Emergency changes: ISMS Owner or Security Lead approval, expedited review

Risk Register Evidence

Business Continuity Risks Documented:

  1. RISK-2025-M001: Cloudflare Service Disruption
  • Impact: Major (4) - Complete service unavailability
  • Likelihood: Unlikely (2)
  • Treatment: Accept + BCP
  • Evidence: security/risk-register.mdx (lines 107-133)
  1. RISK-2025-H002: Supply Chain Compromise
  • Impact: Major (4)
  • Likelihood: Possible (3)
  • Treatment: Mitigate (SLSA Level 3)
  • Evidence: security/risk-register.mdx (lines 75-102)

Risk Treatment Evidence:

  • Business continuity plan implemented
  • Cloudflare SLA 99.99%
  • Global edge distribution (300+ PoPs)
  • Automatic failover
  • Stateless/zero-state architecture (minimal data loss risk during outages)

Incident Response Integration

Business Continuity Incidents: Evidence from security/incident-response.mdx

Service Outage Response (lines 293-299):

  • Update status page immediately
  • Determine root cause (attack vs infrastructure)
  • Enable enhanced protections if attack
  • Engage Cloudflare support if infrastructure
  • Implement workarounds

Cloud Provider Outage (lines 541-549):

  • Confirm outage via status.cloudflare.com
  • Update customers (non-Cloudflare channel)
  • Monitor Cloudflare updates
  • Document for BCP review
  • Accept risk (no alternative action possible)

Incident Severity Levels (lines 46-108):

  • P0 (Critical): <15 min response, signing key compromise, complete outage
  • P1 (High): <1 hour response, partial outage, unauthorised access
  • P2 (Medium): <4 hours response, non-critical vulnerabilities
  • P3 (Low): <24 hours response, minor issues

Supply Chain Security and Continuity

SLSA Level 3 Implementation: Evidence from developers/supply-chain-security.mdx

Build Security (lines 36-102):

  • Ephemeral environments: Fresh GitHub-hosted runners
  • Hermetic builds: Locked dependencies (Cargo.lock, package-lock.json)
  • Minimal permissions: Read-only default
  • Security audits: cargo audit, npm audit (fail on HIGH)

Artifact Signing (lines 123-171):

  • Keyless signing via Sigstore
  • OIDC token-based certificates
  • Transparency logging in Rekor
  • Proves artifacts built by GitHub Actions

Supply Chain Attack Mitigation (lines 208-219):

  • Compromised developer machine: Isolated build runners
  • Malicious dependency: Security audits fail build
  • Tampered artifact: Checksums + signatures detect
  • Compromised signing key: No long-lived keys
  • Build environment persistence: Ephemeral runners

Continuity Relevance:

  • Build system is resilient to developer workstation failures
  • Automated builds reduce key person dependency
  • GitHub redundancy provides build system availability
  • Supply chain controls prevent malicious code from disrupting service

Asset Management and Business Continuity

Critical Assets for Business Continuity: Evidence from security/asset-register.mdx

Cryptographic Assets (lines 17-26):

  • CRYPTO-001: Production Signing Keys (Cloudflare KV, Restricted)
  • CRYPTO-002: Production Verification Keys (JWKS on CDN, Public)
  • CRYPTO-003: Development Signing Keys (Cloudflare KV dev, Internal)
  • CRYPTO-005: HMAC Secrets (Cloudflare KV, Restricted)

Infrastructure Assets (lines 30-40):

  • INFRA-001: Cloudflare Account (2 admins, MFA required)
  • INFRA-002: GitHub Organisation (team members, admin limited)
  • INFRA-003: Verifier API Worker (deployed via CI/CD)
  • INFRA-004: Issuer API Worker (deployed via CI/CD)
  • INFRA-005: Static site serving (Cloudflare Workers Assets, deployed via Git)

KV Namespaces (lines 42-50):

  • KV-001: VERIFIER_CONFIG (retained for the operational lifetime of the service or until superseded by configuration update; legal basis: legitimate interests for service operation)
  • KV-002: ISSUER_KEYS (until rotation + 1 year)
  • KV-003: ISSUER_AUDIT_LOG (90 days; critical security event logs retained for up to 365 days)
  • KV-004: IS_CONFIG (retained for the operational lifetime of the service or until superseded by configuration update; legal basis: legitimate interests for service operation)
  • KV-005: BANS (variable retention)

Code & IP (lines 54-62):

  • All repositories public (open source)
  • Complete backup via Git
  • No proprietary secrets in code

Operational Data (lines 67-73):

  • DATA-001: Audit logs including IP addresses (Cloudflare Workers Logs shipped to Grafana Loki, Cloudflare KV, 90 days; critical security event logs retained for up to 365 days)
  • DATA-002: Operational telemetry (Cloudflare Workers Logs shipped to Grafana Loki, 90 days)
  • DATA-004: CI/CD logs (GitHub Actions, 90 days)

Data Retention and Disposal

Retention Periods: Evidence from security/data-retention.mdx

Operational Data (lines 22-29):

  • Audit logs (including IP addresses): 90 days; critical security event logs are retained for up to 365 days
  • Analytics data: 90 days
  • Challenge state: 5 minutes (auto-expires)
  • Nonce records: 5 minutes (auto-expires)

Development Data (lines 33-38):

  • Source code: Indefinite (Git)
  • CI/CD logs: 90 days (GitHub Actions)
  • Build artifacts: 90 days (signed releases: indefinite)

Business Records (lines 42-47):

  • Contracts: 7 years after expiration
  • Financial records: 7 years
  • ISMS documents: Current + 3 years
  • Incident reports: 3 years

Automated Deletion (lines 109-131):

  • Audit logs (including IP addresses): Automatic 90-day expiry (Cloudflare Workers Logs in Grafana Loki + KV TTL); critical security event logs are retained for up to 365 days
  • Challenges/Nonces: 5-minute TTL (KV)

Disposal Procedures (lines 67-107):

  • Digital data: Delete from KV
  • Cryptographic keys: Cryptographic erasure (overwrite with random)
  • Offline backups: Physical destruction (shred or degauss)
  • Devices: Full disk encryption key deletion + secure wipe

Supplier Management and Business Continuity

Critical Suppliers: Evidence from security/supplier-management.md

Cloudflare (lines 13-33):

  • Services: Workers, KV, Durable Objects, Pages, Analytics, DDoS protection
  • Criticality: High - Complete service dependency
  • Security: SOC 2 Type II, ISO 27001 certified (supplier-held, via Cloudflare)
  • SLA: 99.99% uptime (Enterprise)
  • Monitoring: status.cloudflare.com, security advisories, annual contract review

GitHub (lines 35-53):

  • Services: Source control, CI/CD (Actions), artifact hosting, security scanning
  • Criticality: High - Development dependency
  • Security: SOC 2 Type II, Advanced Security features
  • Monitoring: GitHub status page, security advisories, Dependabot alerts

Vendor Risk Assessment (lines 80-96):

  • Cloudflare: High criticality, Low risk (strong security)
  • GitHub: High criticality, Low risk (strong security)

Vendor Incident Response (lines 111-118):

  1. Assess impact to Maelstrom AI
  2. Activate incident response if needed
  3. Communicate with vendor
  4. Document in incident register
  5. Review vendor relationship
  6. Update risk assessment

Gaps and Recommendations

Identified Gaps

  1. KV Data Backups (UC-168)CLOSED
  • Current: Weekly exports planned but not automated
  • Impact: 7-day RPO for KV data
  • Recommendation: Implement automated weekly KV exports
  • Status. IMPLEMENTED via provii-backup (January 2025)
  • Achievement. Hourly full + daily full + weekly complete backups, RPO <1 hour
  • Tracked: RISK-2025-M005 RESOLVED
  1. Status Page (UC-170)CLOSED
  • Current: Planned but not implemented
  • Impact: Customer communication relies on email/social media
  • Recommendation: Implement status.provii.app
  • Status. IMPLEMENTED via provii-status (deployed)
  • Achievement. Real-time monitoring at status.provii.app, $0/month cost
  • Evidence. trust/evidence/business-continuity/status-page-evidence.md
  • Tracked: GAP-M001 CLOSED
  1. BCP Testing (UC-171, UC-122)
  • Current: Testing schedule defined but not executed
  • Impact: Procedures untested, potential gaps undiscovered
  • Recommendation: Conduct first tabletop exercise Q1 2026
  • Priority: High
  • Next action: Schedule and execute tabletop exercise
  1. Backup Recovery Testing (UC-122)
  • Current: Key recovery procedure documented but not tested
  • Impact: Unknown issues in recovery process
  • Recommendation: Test key recovery in controlled environment
  • Priority: Medium
  • Next action: Include in Q3 2026 full BCP test
  1. Automatic Rollback (UC-166)
  • Current: Manual rollback only
  • Impact: 5-minute rollback target may be challenging
  • Recommendation: Implement automated rollback for failed deployments
  • Priority: Low (current process acceptable)

Strengths

  1. Inherent High Availability: Cloudflare’s global distribution provides strong availability without requiring additional failover infrastructure
  2. Stateless Architecture: Minimal data loss risk, fast recovery
  3. Infrastructure as Code: All configuration in Git, easily reproducible
  4. Documentation: BCP, incident response, and recovery procedures well-documented
  5. Supply Chain Security: SLSA Level 3 reduces risk of malicious code disrupting service
  6. Automated Deployment: Fast deployment capability supports rapid recovery

Control Status Summary

ControlIDStatusEvidence QualityGaps
Business Impact AnalysisUC-165✅ ImplementedHighNone
Disaster Recovery PlanUC-166✅ ImplementedHighTesting needed
High Availability ArchitectureUC-167✅ ImplementedHighNone
Data Backup ProceduresUC-168ImplementedHighKV automation CLOSED
RTO/RPO ObjectivesUC-169✅ ImplementedHighNone
Incident Communication PlanUC-170EnhancedHighStatus page CLOSED
Tabletop ExercisesUC-171📋 PlannedLowNot conducted
Alternative Processing SitesUC-172✅ ImplementedHighNone
General BC/DRUC-052✅ ImplementedHighTesting needed
Backup and RecoveryUC-062ImplementedHighTesting, automation CLOSED
Capacity ManagementUC-064✅ ImplementedHighNone
Backup TestingUC-122📋 PlannedLowNot conducted

Overall Assessment: ✅ UPDATED - 10 of 12 controls fully implemented (83%), 0 partially implemented, 2 planned (17%). Strong foundation with UC-062 (Backup and Recovery) and UC-170 (Incident Communication) now fully implemented via provii-backup and provii-status. Remaining gap: Formal BCP tabletop testing (UC-122, UC-171).


References

Documentation

  • security/business-continuity.mdx - Primary BCP document
  • security/incident-response.mdx - Incident procedures
  • security/change-management.mdx - Deployment and rollback
  • security/risk-register.mdx - BC/DR risks
  • security/supplier-management.md - Vendor dependencies
  • security/data-retention.mdx - Retention and disposal
  • security/asset-register.mdx - Critical assets
  • developers/supply-chain-security.mdx - Build resilience

Configuration

  • Infrastructure-as-code: wrangler.toml files (held in the application repositories, not this documentation repository)
  • Deployment automation:.github/workflows files (held in the application repositories, not this documentation repository)

External Resources

  1. Cloudflare Status: https://status.cloudflare.com
  2. Cloudflare Documentation: Worker, KV, Durable Objects availability guarantees
  3. GitHub Status: https://githubstatus.com

Conclusion

Maelstrom AI’s business continuity posture is designed to be resilient through the serverless, globally distributed architecture. The platform benefits from Cloudflare’s 99.99% supplier SLA and automatic geographic failover; Maelstrom AI provides availability on a best-effort basis with no contractual SLA at this tier.

Key strengths:

  • Exceptional high availability (UC-167)
  • Well-documented disaster recovery procedures (UC-166)
  • Clear RTO/RPO targets (UC-169)
  • Stateless design minimises data loss risk
  • Infrastructure-as-code enables rapid recovery

Key improvements completed:

  • ✅ Automated KV data backups (UC-168) - provii-backup deployed
  • ✅ Status page implementation (UC-170) - status.provii.app deployed
  • ✅ Real-time service monitoring with $0/month cost (vs. $500/year budgeted)

Remaining improvements needed:

  • Conduct first BCP testing exercises (UC-171, UC-122)
  • Test key recovery procedures (UC-122)

All critical controls are now fully implemented. The remaining gaps are testing and validation activities, which will enhance operational maturity but do not represent functional gaps.


Evidence Collection Complete Date: 2026-02-14 Author: Maelstrom AI Next Actions: Address identified gaps, schedule Q1 2026 tabletop exercise