Stateful Recovery on VxWorks 7 RTPs: Checkpointing, Certification, and Partition-Aware Design

Table of Contents

Stateful Recovery on VxWorks 7 RTPs: Checkpointing, Certification, and Partition-Aware Design

In safety-critical embedded systems, restart is easy.

Recovery is hard.

VxWorks 7 fundamentally improved robustness with Real-Time Processes (RTPs), MMU isolation, and certification-ready kernels. Yet one gap remains unchanged since VxWorks 6:

When an RTP restarts, all application state is lost.

This article revisits a legacy VxWorks checkpointing and task recovery mechanism and performs a full, explicit redesign for VxWorks 7 RTP-based systems, extending it with:

Certification arguments (DO-178C / IEC 61508)
A concrete middleware architecture
A direct comparison with ARINC 653 partition restart semantics

The goal is not theory, but deployable, auditable recovery.

🧠 Why RTP Restart Is Not Enough
#

VxWorks 7 RTPs provide:

Address-space isolation
Fault containment
Deterministic restart

But the restart model is stateless.

Failure Event	Native VxWorks 7 Outcome
Task exception	Task terminated
RTP fault	RTP restarted
Application state	Lost
Control continuity	Broken

For flight control, power systems, robotics, or long-running edge AI pipelines, this is often unacceptable.

The legacy VxWorks checkpointing design fills this gap by introducing task-level state persistence, without kernel modification.

🧩 Overall Architecture: RTP-Scoped Self-Recovery
#

The redesigned system embeds the checkpoint mechanism inside each RTP.

+------------------------------------------------+
| RTP                                            |
|                                                |
|  +------------------+                          |
|  | Application      |                          |
|  | Tasks            |                          |
|  +------------------+                          |
|                                                |
|  +------------------+                          |
|  | Checkpoint       |                          |
|  | Middleware       |                          |
|  |  - Memory registry|                         |
|  |  - Object pools  |                          |
|  |  - Recovery FSM  |                          |
|  +------------------+                          |
|                                                |
+------------------------------------------------+
| VxWorks 7 Kernel (MMU, Scheduler, Health Mon.) |
+------------------------------------------------+

This creates a stateful RTP:

Tasks can fail and resume
RTPs can restart and restore
Kernel remains untouched

🧱 Checkpoint Content (RTP-Aware)
#

The original five checkpoint categories remain intact, but are reinterpreted under RTP ownership rules.

1. Task Control Blocks (TCB)
#

Tasks still use WIND_TCB, but:

IDs are RTP-local
Kernel queue pointers are invalid after restart

Checkpoint Strategy

Store logical task state only
Reset kernel-managed fields
Normalize restored tasks to READY

Restore flow:

taskInitExcStk(tcb, entry, stackBase, stackSize);
taskActivate(taskId);

This avoids undefined kernel references while preserving execution context.

2. Execution and Exception Stacks
#

In VxWorks 7:

RTP stacks are MMU-protected
Addresses are deterministic per RTP instance

Stacks are:

Allocated from RTP memory pools
Copied to persistent storage at checkpoint
Restored before task activation

This works cleanly on both 32-bit and 64-bit RTPs.

3. Global and Dynamic Memory
#

RTP ownership simplifies checkpointing.

Global Variables
#

Explicitly registered:

addGlobalVar(&systemState, sizeof(systemState));

Dynamic Memory
#

Wrapped allocator (unchanged from 2013):

void* myMalloc(size_t size) {
    void* p = alloc(size + 4);
    *(int*)p = size;
    return (char*)p + 4;
}

On restore, memory is rebuilt identically inside the RTP.

4. Kernel Objects (Semaphores, Queues)
#

Object pools are mandatory in RTP systems.

Pre-allocation at RTP startup
#

SEM_ID semPool[MAX_SEM];
for (int i = 0; i < MAX_SEM; i++)
    semPool[i] = semBCreate(SEM_Q_FIFO, SEM_EMPTY);

Checkpoint stores:

Logical state (empty/full)
Queue depth
Ownership relationships

Restore replays state, not creation.

5. Files and Devices
#

RTPs isolate file descriptor tables.

Checkpoint records:

Path
Flags
Offset

Restore logic:

fd = open(path, flags);
lseek(fd, offset, SEEK_SET);

Devices remain kernel-resident; RTP only replays configuration.

🛠️ Middleware Design (Concrete Layout)
#

A minimal, auditable middleware layout:

checkpoint/
├── ckpt_core.c        # checkpoint FSM
├── ckpt_mem.c         # memory registry
├── ckpt_task.c        # task snapshot/restore
├── ckpt_ipc.c         # sem/msgQ pools
├── ckpt_file.c        # fd replay
├── ckpt_storage.c    # Flash/NVRAM backend
├── ckpt_api.h
└── ckpt_config.h

Key APIs
#

ckptInit();
ckptRegisterTask(taskId);
ckptRegisterGlobal(void* addr, size_t size);
ckptCheckpointNow();
ckptRestore();

No kernel hooks. No private symbols. Certification-friendly.

⏱️ Checkpoint Timing and Health Monitoring
#

Checkpointing is cooperative.

Safe points:

Control-loop boundaries
After IPC receive
State machine transitions

In VxWorks 7:

Health Monitor detects anomaly
Middleware decides rollback vs restart
Deterministic recovery path

📜 Certification Alignment
#

DO-178C (Avionics)
#

DO-178C Objective	Mapping
Determinism	Explicit checkpoint points
Error containment	RTP isolation
No unintended functionality	No kernel modification
Verification	Simics + replayable recovery

Supports Levels B–A when combined with partitioning.

IEC 61508 (Industrial Safety)
#

IEC 61508 Concept	Mapping
Fault detection	Health monitor + exceptions
Fault reaction	Task rollback
Safe state	Checkpoint-defined
Diagnostic coverage	Explicit state capture

Middleware qualifies as application-level safety mechanism.

🆚 RTP Checkpointing vs ARINC 653 Partition Restart
#

Aspect	ARINC 653	RTP + Checkpoint
Recovery unit	Partition	Task
Restart type	Cold	Stateful
State retention	None	Full
Flexibility	Low	High
Certification	Strong	Strong (with argument)

Key Insight: ARINC 653 guarantees isolation. RTP checkpointing guarantees continuity.

They are complementary—not competing.

🧪 Validation and Tooling
#

Original PowerPC prototype: ms-level recovery
VxWorks 7 adds:
- MMU safety
- RTP restart hooks
- Simics full-system checkpoints

Simics checkpoints validate runtime checkpoint correctness under controlled fault injection.

⚠️ Known Constraints (Engineering Reality)
#

No transparent socket rollback
No ISR rollback
Cooperative checkpointing required
Runtime overhead ~10–20%

All constraints are explicit, documentable, and certifiable.

🏁 Final Takeaway
#

VxWorks 7 gave us isolation. The legal VxWorks design gave us memory of the past.

Combined, they enable something rare in embedded systems:

A system that fails, heals, and continues — without forgetting who it was.

This is not a workaround. This is stateful resilience by design.

If your system must survive faults without losing its mission, this architecture is no longer optional — it’s inevitable.

Memory Management in VxWorks Explained

1 September 2025·803 words·4 mins

VxWorks RTOS Memory Management Embedded Systems Real-Time MMU RTP

The Ultimate VxWorks Programming Guide

8 August 2025·647 words·4 mins

VxWorks RTOS Embedded Systems RTP Device Drivers

Designing a High-Reliability VxWorks BSP: From Reset Vector to VxBus

31 January 2026·959 words·5 mins

VxWorks BSP RTOS Embedded Systems Device Tree VxBus

🧠 Why RTP Restart Is Not Enough #

🧩 Overall Architecture: RTP-Scoped Self-Recovery #

🧱 Checkpoint Content (RTP-Aware) #

1. Task Control Blocks (TCB) #

2. Execution and Exception Stacks #

3. Global and Dynamic Memory #

Global Variables #

Dynamic Memory #

4. Kernel Objects (Semaphores, Queues) #

Pre-allocation at RTP startup #

5. Files and Devices #

🛠️ Middleware Design (Concrete Layout) #

Key APIs #

⏱️ Checkpoint Timing and Health Monitoring #

📜 Certification Alignment #

DO-178C (Avionics) #

IEC 61508 (Industrial Safety) #

🆚 RTP Checkpointing vs ARINC 653 Partition Restart #

🧪 Validation and Tooling #

⚠️ Known Constraints (Engineering Reality) #

🏁 Final Takeaway #

Related