Stateful Recovery on VxWorks 7 RTPs: Checkpointing, Certification, and Partition-Aware Design
In safety-critical embedded systems, restart is easy.
Recovery is hard.
VxWorks 7 fundamentally improved robustness with Real-Time Processes (RTPs), MMU isolation, and certification-ready kernels. Yet one gap remains unchanged since VxWorks 6:
When an RTP restarts, all application state is lost.
This article revisits a legacy VxWorks checkpointing and task recovery mechanism and performs a full, explicit redesign for VxWorks 7 RTP-based systems, extending it with:
- Certification arguments (DO-178C / IEC 61508)
- A concrete middleware architecture
- A direct comparison with ARINC 653 partition restart semantics
The goal is not theory, but deployable, auditable recovery.
π§ Why RTP Restart Is Not Enough #
VxWorks 7 RTPs provide:
- Address-space isolation
- Fault containment
- Deterministic restart
But the restart model is stateless.
| Failure Event | Native VxWorks 7 Outcome |
|---|---|
| Task exception | Task terminated |
| RTP fault | RTP restarted |
| Application state | Lost |
| Control continuity | Broken |
For flight control, power systems, robotics, or long-running edge AI pipelines, this is often unacceptable.
The legacy VxWorks checkpointing design fills this gap by introducing task-level state persistence, without kernel modification.
π§© Overall Architecture: RTP-Scoped Self-Recovery #
The redesigned system embeds the checkpoint mechanism inside each RTP.
+------------------------------------------------+
| RTP |
| |
| +------------------+ |
| | Application | |
| | Tasks | |
| +------------------+ |
| |
| +------------------+ |
| | Checkpoint | |
| | Middleware | |
| | - Memory registry| |
| | - Object pools | |
| | - Recovery FSM | |
| +------------------+ |
| |
+------------------------------------------------+
| VxWorks 7 Kernel (MMU, Scheduler, Health Mon.) |
+------------------------------------------------+
This creates a stateful RTP:
- Tasks can fail and resume
- RTPs can restart and restore
- Kernel remains untouched
π§± Checkpoint Content (RTP-Aware) #
The original five checkpoint categories remain intact, but are reinterpreted under RTP ownership rules.
1. Task Control Blocks (TCB) #
Tasks still use WIND_TCB, but:
- IDs are RTP-local
- Kernel queue pointers are invalid after restart
Checkpoint Strategy
- Store logical task state only
- Reset kernel-managed fields
- Normalize restored tasks to
READY
Restore flow:
taskInitExcStk(tcb, entry, stackBase, stackSize);
taskActivate(taskId);
This avoids undefined kernel references while preserving execution context.
2. Execution and Exception Stacks #
In VxWorks 7:
- RTP stacks are MMU-protected
- Addresses are deterministic per RTP instance
Stacks are:
- Allocated from RTP memory pools
- Copied to persistent storage at checkpoint
- Restored before task activation
This works cleanly on both 32-bit and 64-bit RTPs.
3. Global and Dynamic Memory #
RTP ownership simplifies checkpointing.
Global Variables #
Explicitly registered:
addGlobalVar(&systemState, sizeof(systemState));
Dynamic Memory #
Wrapped allocator (unchanged from 2013):
void* myMalloc(size_t size) {
void* p = alloc(size + 4);
*(int*)p = size;
return (char*)p + 4;
}
On restore, memory is rebuilt identically inside the RTP.
4. Kernel Objects (Semaphores, Queues) #
Object pools are mandatory in RTP systems.
Pre-allocation at RTP startup #
SEM_ID semPool[MAX_SEM];
for (int i = 0; i < MAX_SEM; i++)
semPool[i] = semBCreate(SEM_Q_FIFO, SEM_EMPTY);
Checkpoint stores:
- Logical state (empty/full)
- Queue depth
- Ownership relationships
Restore replays state, not creation.
5. Files and Devices #
RTPs isolate file descriptor tables.
Checkpoint records:
- Path
- Flags
- Offset
Restore logic:
fd = open(path, flags);
lseek(fd, offset, SEEK_SET);
Devices remain kernel-resident; RTP only replays configuration.
π οΈ Middleware Design (Concrete Layout) #
A minimal, auditable middleware layout:
checkpoint/
βββ ckpt_core.c # checkpoint FSM
βββ ckpt_mem.c # memory registry
βββ ckpt_task.c # task snapshot/restore
βββ ckpt_ipc.c # sem/msgQ pools
βββ ckpt_file.c # fd replay
βββ ckpt_storage.c # Flash/NVRAM backend
βββ ckpt_api.h
βββ ckpt_config.h
Key APIs #
ckptInit();
ckptRegisterTask(taskId);
ckptRegisterGlobal(void* addr, size_t size);
ckptCheckpointNow();
ckptRestore();
No kernel hooks. No private symbols. Certification-friendly.
β±οΈ Checkpoint Timing and Health Monitoring #
Checkpointing is cooperative.
Safe points:
- Control-loop boundaries
- After IPC receive
- State machine transitions
In VxWorks 7:
- Health Monitor detects anomaly
- Middleware decides rollback vs restart
- Deterministic recovery path
π Certification Alignment #
DO-178C (Avionics) #
| DO-178C Objective | Mapping |
|---|---|
| Determinism | Explicit checkpoint points |
| Error containment | RTP isolation |
| No unintended functionality | No kernel modification |
| Verification | Simics + replayable recovery |
Supports Levels BβA when combined with partitioning.
IEC 61508 (Industrial Safety) #
| IEC 61508 Concept | Mapping |
|---|---|
| Fault detection | Health monitor + exceptions |
| Fault reaction | Task rollback |
| Safe state | Checkpoint-defined |
| Diagnostic coverage | Explicit state capture |
Middleware qualifies as application-level safety mechanism.
π RTP Checkpointing vs ARINC 653 Partition Restart #
| Aspect | ARINC 653 | RTP + Checkpoint |
|---|---|---|
| Recovery unit | Partition | Task |
| Restart type | Cold | Stateful |
| State retention | None | Full |
| Flexibility | Low | High |
| Certification | Strong | Strong (with argument) |
Key Insight: ARINC 653 guarantees isolation. RTP checkpointing guarantees continuity.
They are complementaryβnot competing.
π§ͺ Validation and Tooling #
-
Original PowerPC prototype: ms-level recovery
-
VxWorks 7 adds:
- MMU safety
- RTP restart hooks
- Simics full-system checkpoints
Simics checkpoints validate runtime checkpoint correctness under controlled fault injection.
β οΈ Known Constraints (Engineering Reality) #
- No transparent socket rollback
- No ISR rollback
- Cooperative checkpointing required
- Runtime overhead ~10β20%
All constraints are explicit, documentable, and certifiable.
π Final Takeaway #
VxWorks 7 gave us isolation. The legal VxWorks design gave us memory of the past.
Combined, they enable something rare in embedded systems:
A system that fails, heals, and continues β without forgetting who it was.
This is not a workaround. This is stateful resilience by design.
If your system must survive faults without losing its mission, this architecture is no longer optional β itβs inevitable.