Skip to main content

Stateful Recovery on VxWorks 7 RTPs: Checkpointing, Certification, and Partition-Aware Design

·877 words·5 mins
VxWorks RTOS Fault Tolerance RTP Safety-Critical
Table of Contents

Stateful Recovery on VxWorks 7 RTPs: Checkpointing, Certification, and Partition-Aware Design

In safety-critical embedded systems, restart is easy.

Recovery is hard.

VxWorks 7 fundamentally improved robustness with Real-Time Processes (RTPs), MMU isolation, and certification-ready kernels. Yet one gap remains unchanged since VxWorks 6:

When an RTP restarts, all application state is lost.

This article revisits a legacy VxWorks checkpointing and task recovery mechanism and performs a full, explicit redesign for VxWorks 7 RTP-based systems, extending it with:

  • Certification arguments (DO-178C / IEC 61508)
  • A concrete middleware architecture
  • A direct comparison with ARINC 653 partition restart semantics

The goal is not theory, but deployable, auditable recovery.


🧠 Why RTP Restart Is Not Enough
#

VxWorks 7 RTPs provide:

  • Address-space isolation
  • Fault containment
  • Deterministic restart

But the restart model is stateless.

Failure Event Native VxWorks 7 Outcome
Task exception Task terminated
RTP fault RTP restarted
Application state Lost
Control continuity Broken

For flight control, power systems, robotics, or long-running edge AI pipelines, this is often unacceptable.

The legacy VxWorks checkpointing design fills this gap by introducing task-level state persistence, without kernel modification.


🧩 Overall Architecture: RTP-Scoped Self-Recovery
#

The redesigned system embeds the checkpoint mechanism inside each RTP.

+------------------------------------------------+
| RTP                                            |
|                                                |
|  +------------------+                          |
|  | Application      |                          |
|  | Tasks            |                          |
|  +------------------+                          |
|                                                |
|  +------------------+                          |
|  | Checkpoint       |                          |
|  | Middleware       |                          |
|  |  - Memory registry|                         |
|  |  - Object pools  |                          |
|  |  - Recovery FSM  |                          |
|  +------------------+                          |
|                                                |
+------------------------------------------------+
| VxWorks 7 Kernel (MMU, Scheduler, Health Mon.) |
+------------------------------------------------+

This creates a stateful RTP:

  • Tasks can fail and resume
  • RTPs can restart and restore
  • Kernel remains untouched

🧱 Checkpoint Content (RTP-Aware)
#

The original five checkpoint categories remain intact, but are reinterpreted under RTP ownership rules.


1. Task Control Blocks (TCB)
#

Tasks still use WIND_TCB, but:

  • IDs are RTP-local
  • Kernel queue pointers are invalid after restart

Checkpoint Strategy

  • Store logical task state only
  • Reset kernel-managed fields
  • Normalize restored tasks to READY

Restore flow:

taskInitExcStk(tcb, entry, stackBase, stackSize);
taskActivate(taskId);

This avoids undefined kernel references while preserving execution context.


2. Execution and Exception Stacks
#

In VxWorks 7:

  • RTP stacks are MMU-protected
  • Addresses are deterministic per RTP instance

Stacks are:

  • Allocated from RTP memory pools
  • Copied to persistent storage at checkpoint
  • Restored before task activation

This works cleanly on both 32-bit and 64-bit RTPs.


3. Global and Dynamic Memory
#

RTP ownership simplifies checkpointing.

Global Variables
#

Explicitly registered:

addGlobalVar(&systemState, sizeof(systemState));

Dynamic Memory
#

Wrapped allocator (unchanged from 2013):

void* myMalloc(size_t size) {
    void* p = alloc(size + 4);
    *(int*)p = size;
    return (char*)p + 4;
}

On restore, memory is rebuilt identically inside the RTP.


4. Kernel Objects (Semaphores, Queues)
#

Object pools are mandatory in RTP systems.

Pre-allocation at RTP startup
#

SEM_ID semPool[MAX_SEM];
for (int i = 0; i < MAX_SEM; i++)
    semPool[i] = semBCreate(SEM_Q_FIFO, SEM_EMPTY);

Checkpoint stores:

  • Logical state (empty/full)
  • Queue depth
  • Ownership relationships

Restore replays state, not creation.


5. Files and Devices
#

RTPs isolate file descriptor tables.

Checkpoint records:

  • Path
  • Flags
  • Offset

Restore logic:

fd = open(path, flags);
lseek(fd, offset, SEEK_SET);

Devices remain kernel-resident; RTP only replays configuration.


πŸ› οΈ Middleware Design (Concrete Layout)
#

A minimal, auditable middleware layout:

checkpoint/
β”œβ”€β”€ ckpt_core.c        # checkpoint FSM
β”œβ”€β”€ ckpt_mem.c         # memory registry
β”œβ”€β”€ ckpt_task.c        # task snapshot/restore
β”œβ”€β”€ ckpt_ipc.c         # sem/msgQ pools
β”œβ”€β”€ ckpt_file.c        # fd replay
β”œβ”€β”€ ckpt_storage.c    # Flash/NVRAM backend
β”œβ”€β”€ ckpt_api.h
└── ckpt_config.h

Key APIs
#

ckptInit();
ckptRegisterTask(taskId);
ckptRegisterGlobal(void* addr, size_t size);
ckptCheckpointNow();
ckptRestore();

No kernel hooks. No private symbols. Certification-friendly.


⏱️ Checkpoint Timing and Health Monitoring
#

Checkpointing is cooperative.

Safe points:

  • Control-loop boundaries
  • After IPC receive
  • State machine transitions

In VxWorks 7:

  • Health Monitor detects anomaly
  • Middleware decides rollback vs restart
  • Deterministic recovery path

πŸ“œ Certification Alignment
#

DO-178C (Avionics)
#

DO-178C Objective Mapping
Determinism Explicit checkpoint points
Error containment RTP isolation
No unintended functionality No kernel modification
Verification Simics + replayable recovery

Supports Levels B–A when combined with partitioning.


IEC 61508 (Industrial Safety)
#

IEC 61508 Concept Mapping
Fault detection Health monitor + exceptions
Fault reaction Task rollback
Safe state Checkpoint-defined
Diagnostic coverage Explicit state capture

Middleware qualifies as application-level safety mechanism.


πŸ†š RTP Checkpointing vs ARINC 653 Partition Restart
#

Aspect ARINC 653 RTP + Checkpoint
Recovery unit Partition Task
Restart type Cold Stateful
State retention None Full
Flexibility Low High
Certification Strong Strong (with argument)

Key Insight: ARINC 653 guarantees isolation. RTP checkpointing guarantees continuity.

They are complementaryβ€”not competing.


πŸ§ͺ Validation and Tooling
#

  • Original PowerPC prototype: ms-level recovery

  • VxWorks 7 adds:

    • MMU safety
    • RTP restart hooks
    • Simics full-system checkpoints

Simics checkpoints validate runtime checkpoint correctness under controlled fault injection.


⚠️ Known Constraints (Engineering Reality)
#

  • No transparent socket rollback
  • No ISR rollback
  • Cooperative checkpointing required
  • Runtime overhead ~10–20%

All constraints are explicit, documentable, and certifiable.


🏁 Final Takeaway
#

VxWorks 7 gave us isolation. The legal VxWorks design gave us memory of the past.

Combined, they enable something rare in embedded systems:

A system that fails, heals, and continues β€” without forgetting who it was.

This is not a workaround. This is stateful resilience by design.

If your system must survive faults without losing its mission, this architecture is no longer optional β€” it’s inevitable.

Related

Memory Management in VxWorks Explained
·803 words·4 mins
VxWorks RTOS Memory Management Embedded Systems Real-Time MMU RTP
The Ultimate VxWorks Programming Guide
·647 words·4 mins
VxWorks RTOS Embedded Systems RTP Device Drivers
Designing a High-Reliability VxWorks BSP: From Reset Vector to VxBus
·959 words·5 mins
VxWorks BSP RTOS Embedded Systems Device Tree VxBus