Critical computer systems are those where failure must be either prevented entirely or tightly controlled. These systems support functions that are essential to **safety**, **business continuity**, **security**, or other high-impact domains. Ensuring the reliability of such systems requires proven engineering methods approaches that are not just theoretical, but time-tested in real-world conditions. These methods follow a clear principle: **Safety is not added later it is built in from the start and carried through the entire project.** ## Eliminating Hazards It's best to eliminate hazards by design. In circuit (a), a switch failure could short the battery—risking fire. In (b), the switch is moved to the motor side. If it fails, it only shorts the motor, which is safer. This simple change prevents danger instead of just reacting to it. ![[Pasted image 20251111225350.png]] *Circuit example* Designing out hazards from the start will prevent costly failures and dangerous situations down the line ## NASA's Mars Climate Orbiter Disaster Take the Mars Climate Orbiter failure in 1999: after **10 months of travel** and hefty cost of **125 million dollars**, the probe ended up **burned up in the Martian atmosphere** because two **systems used different units**. One calculated thrust in pounds-force, while the other assumed metric units(newtons per square meter). No validation checks caught the mismatch, and the system kept accumulating incorrect data. This failure shows a lack of proper hazard analysis. NASA and Lockheed Martin each assumed the other had handled it. What should’ve been done: **a detailed hazard review**, **clear agreement on units**, **error detection mechanisms**, and **fallback controls to catch and mitigate failures early**. ## STPA (Systems Theoretic Process Analysis) Another way to design out hazards is STPA (Systems Theoretic Process Analysis). Traditionally, when designing for safety, the approach is straightforward: "If component A might fail, we'll add Control X to prevent it." This means the cause of the issue is assumed to be the component itself, forcing engineers to look at every component separately. However, the STPA (Systems-Theoretic Process Analysis) method goes further. STPA encourages looking at the whole system and the interactions between controls. It suggests you should look at incidents as control failures, not component failures. This means that even Control X, when viewed through the lens of the whole system's communication, might fail, be delayed, incorrect, or completely missing. STPA focuses on how different controls interact with each other - for example, Control X might work perfectly, but when it interacts with Control Y, they might create unexpected behaviors or conflicts. Instead of relying solely on controls working perfectly in isolation, STPA encourages designing with safety constraints, additional checks, and fallback mechanisms to ensure the system remains safe, even when controls don't function as expected or when their interactions create unforeseen problems. TL;DR: While traditional methods assume accidents are caused by component failures, STPA assumes accidents come from control failures and control interactions, including: - Unsafe interactions between functioning components - Control-to-control interactions that create unexpected behaviors - Emergent behaviors from system complexity - Control structure relationships and hierarchies --- ## Enter STPA-Sec! Applied to Autonomous Drone Delivery Imagine an autonomous drone designed to deliver packages in a dense urban environment. The goal is to avoid collisions, deliver to the correct location, and land safely even if systems fail. we are going to apply STPA not only for safety but also thinking about Losses and Hazards related to security ### Hazard: The drone drops the package in the wrong location or crashes due to a misinterpreted sensor signal. ### Misalignments That Can Cause the Hazard: * **GPS signal is weak** → *Drone perception system*: Trusts faulty GPS data without cross-checking * **Sensor reports malfunction** → *Control logic*: Makes decisions without verifying sensor health * **Operator updates mission mid-flight** → *Human operator*: Assumes the drone can immediately adapt safely * **Communication drops in flight** → *Wireless comm module*: Drone continues without updated commands * **Package latch reports "locked" incorrectly** → *Hardware module*: Drone proceeds to release based on incorrect state ### STPA-Inspired Constraints and Safety Measures: 1. **Sensor Cross-Validation**: Require at least two independent sensors (e.g., GPS + vision) to agree before location is trusted. 2. **Fallback Modes**: If sensors are degraded, the drone hovers and signals for human intervention. 3. **Operator Confirmation**: Interface requires explicit confirmation before mission changes take effect. 4. **Comms Redundancy**: Drone automatically returns to base if communication is lost for more than 10 seconds. 5. **Package Lock Verification**: Drone will not release unless physical and software confirmation both succeed. ### Narrative Flow: * The drone takes off with a package and heads toward its target. * Mid-flight, urban interference corrupts GPS. The drone detects the inconsistency between GPS and visual landmarks. * Instead of proceeding, it enters a hover-safe state and pings the operator. * Communication drops briefly; the drone holds position using onboard logic. * The operator reconnects, reviews the error, and authorizes a safe return to base. * Throughout the event, hazard is avoided not because any one system was perfect, but because the design anticipated misalignment and responded with layered safety constraints. ### Key Insight: The hazard didn’t emerge from one failure it emerged from **interacting assumptions across systems**. STPA helps identify where these assumptions might misalign and guides the design of **constraints, feedback, and failsafes** to eliminate hazards before they lead to loss. --- To build even more resilient systems, we must also consider how failures can still occur despite these safeguards, this is where [[Fault Trees and Threat Trees use cases in Security|Fault Trees and Threat Trees]] comes in. --- ### Sources - [Security engineering by Dr Ross Anderson](https://www.cl.cam.ac.uk/archive/rja14/book.html) - [Safeware: System Safety and Computers](http://sunnyday.mit.edu/book.html) - [MIT Partnership for Systems Approaches to Safety and Security (PSASS)](https://psas.scripts.mit.edu/home/)