Methods and Tools of Fault Injection

2025-01-06 17:38:31 digiproto 4

Reliability is an important attribute for evaluating software quality, and the requirements for reliability are especially strict for Safety Critical Systems (SCS), as failures can cause significant harm to life, property, or the environment. For example, in the automotive industry, ISO 26262 requires that ASIL D systems have a failure rate of less than 10 FIT (Failures in Time), meaning that the number of failures per 1 billion hours of operation for every thousand units should be fewer than 10.


Among the various methods for testing software reliability, one common approach is fault injection testing. This involves selecting a fault injection tool based on a chosen fault model, artificially injecting faults into the system, and collecting the system’s response data for reliability analysis.


Academically, fault injection technology can be divided into three main categories based on the method: hardware-based fault injection, simulation-based fault injection, and software-based fault injection. In practical applications, fault injection methods are generally divided into simulation-based and prototype-based fault injection, depending on the application scenario.


This article will detail the methods and tools of various fault injection technologies from both academic and practical application perspectives.


1.Academic Field

In academic research, fault injection is classified into hardware-based, simulation-based, and software-based categories. Since there are also sub-categories in simulation-based fault injection in practical applications, this section will only discuss hardware and software-based methods.

1.png

▲Fault Injection System


A fault injection system typically consists of a target system and a fault input system. From a physical perspective, a controller is a program that can run independently on the target system or a computer. A fault injector is a customized hardware or software that supports different fault types, locations, times, and some adapted hardware semantics or software structures, with the injected content coming from a fault library. In the diagram, the fault library is a separate component with greater flexibility and portability.


In the initial selection of software or hardware fault injection, the choice is mainly influenced by the type of fault. For faults such as stuck-at faults (e.g., forcing a specific point in the software to remain at a constant value), a hardware injection method is preferable; for data corruption faults, software injection is sufficient. Other faults, such as bit flips in storage units, can be injected using either method, with cost, accuracy, and repeatability influencing the final choice.


Fault types suitable for hardware fault injection:

· Open or short circuit faults

· Bridge faults

· Stuck-at faults (where a signal in a circuit is uncontrollable and fixed at a single value)

· Stray current faults (currents flowing due to leakage outside the designed circuit)

· Surge faults (where voltage temporarily exceeds normal operating levels)


Fault types suitable for software fault injection:

· Storage data corruption (e.g., in registers, memory, or disks)

· Communication data loss (e.g., in buses and communication networks)

· Software defects (machine-level or higher-level software defects)


1.1 Hardware Fault Injection

Hardware fault injection requires additional hardware to introduce faults into the target system’s hardware. Based on the type and location of the fault injection, it can be divided into contact and non-contact categories.


Contact Fault Injection: This method involves direct contact with the target system's hardware pins, also known as "pin-level injection," and is the most common form of hardware fault injection. Active probes or plug-and-play technologies are used to change the current and voltage at the pins, causing minimal impact on the target system.


Non-Contact Fault Injection: This method introduces faults by generating heavy ion radiation that passes through the depletion zone of target devices. While it mimics natural physical phenomena, it is difficult to control the precise timing and location of fault injection due to the uncontrollable nature of heavy ion emission or electromagnetic fields.


Common hardware fault injection tools include:


Messaline: Developed by the System Analysis and Architecture Laboratory (LAAC-CNRS) at the National Center for Scientific Research (CNRS), Messaline has been successfully applied in centralized interlocking systems for railway control and distributed systems in the "Esprit Delta-4" project.

2.png

▲Messaline System Architecture


FIST (Fault Injection System for Study of Transient Fault Effect): Developed by Chalmers University of Technology in Sweden, FIST is used to study transient fault effects and supports both contact and non-contact methods to simulate transient faults within target systems.

3.png

▲FIST Environment Setup


MARS (Maintainable Real-Time System): Developed by Vienna University of Technology, MARS uses both heavy ion radiation and electromagnetic fields for non-contact fault injection, in addition to the FIST tool.


1.2 Software Fault Injection

Software fault injection has gained considerable attention in recent years due to its ability to test parts of the system that are inaccessible to hardware fault injection, such as applications and operating systems. While software fault injection is more flexible, it comes with risks, including limited access to certain locations, increased system workload, and the potential alteration of the original software structure.


Software fault injection can be divided into compile-time and runtime fault injection based on when the fault is introduced. Some commonly used tools include:


Ferrari (Fault and Error Automatic Real-Time Injection): Developed by the University of Texas at Austin, Ferrari injects faults into CPUs, memory, and buses. It is composed of an initializer and activator, user information, fault injectors, and data collection analyzers.


Ftape (Fault Tolerance and Performance Evaluator): Developed by the University of Illinois, Ftape injects faults by flipping individual bits in the CPU, memory, and disk subsystems' accessible registers.

4.png

▲Ferrari Environment Setup


Doctor (Integrated Software Fault Injection Environment): Developed by the University of Michigan, Doctor triggers fault injection using timeout, trap, and code modification methods, allowing the injection of faults in CPU, memory, and network communications.


Xception: Developed by the University of Coimbra, Xception injects more realistic faults through advanced debugging and performance monitoring functions in modern processors. Faults are triggered based on specific address accesses, offering better repeatability. Faults can be triggered by events such as:

1. Fetching an opcode or loading operands from a specified address

2. Storing operands to a specified address

3. A specific duration after startup

4. Combinations of the above fault triggers


1.3 Comparison of Software and Hardware Fault Injection

5.png


The comparison between hardware and software fault injection shows that the main differences lie in accessible fault injection points, cost, and interference levels. Hardware methods can inject faults into chip pins and internal components (such as combinational circuits and registers) that software fault injection cannot address. These are useful for evaluating low-level error detection and shielding mechanisms. Software fault injection can directly modify software state levels (e.g., memory, registers), making it more suitable for testing higher-level mechanisms, but it can also incur higher interference costs when run directly on the target system.


2.Practical Application

In practical applications, fault injection methods can be divided into simulation-based and prototype-based fault injection based on the application scenario.


Prototype-based Fault Injection: This involves injecting faults into software or hardware and is only suitable for research simulating faults. For large and complex application scenarios, such as those in the automotive, aerospace, and aviation industries, the high cost of hardware and software production makes simulation-based fault injection methods, which are more cost-effective and efficient, more popular.


Simulation-based Fault Injection: This method involves creating a "digital twin" model of the target system and introducing faults into the model to observe the results. This approach does not require specific hardware devices and does not damage the target system, making it useful for evaluating fault tolerance mechanisms and system reliability. However, developing simulation models for large and complex devices is a massive task, which many companies find challenging.


SkyEye, a fully digital real-time simulation software, effectively addresses the disadvantages of simulation-based fault injection. As a hardware-behavior-level simulation platform based on visual modeling, SkyEye makes it easy to construct complex models and quickly implement a digital twin model.


Using the fully digital real-time simulation model built with SkyEye, users can perform fault injections in the following modes:


1. Physical Layer Fault Mode includes open-circuit control, short-circuit control, signal crosstalk, noise signals, serial and parallel impedance control, etc., to simulate common line faults on communication buses.


In SkyEye, bus devices are designed as independent modules. These devices are created and connected to the memory bus within the virtual target hardware script. In open-circuit fault mode, the bus fault handling module is designed via external data stimulation software. This module establishes a communication interface with the data stimulation software in the bus device module, allowing it to send open-circuit commands to the bus device. Upon receiving such commands, the device will cease to process any address read/write requests from the memory bus.


The design of SkyEye's open-circuit fault simulation is as follows:

6.png


By utilizing the API interface provided by Python scripts, the memory monitoring interface is invoked to implement memory monitoring functionality. When accessing the monitored memory, registered callback interfaces are executed to perform the required functions.


Fault simulation can be achieved by registering fault simulation callback interfaces through this API. When a memory address is set to a disconnection fault, any processor access to this address triggers fault information output via the callback interface. Users can also pause the project execution by invoking SkyEye's runtime control interface.


2.Electrical Layer Fault Mode includes output amplitude adjustment, duty cycle adjustment, signal delay adjustment, rising and falling edge tuning, slope adjustment, and glitch simulation.


In real hardware, electrical layer fault injection is achieved via a bus signal generator composed of high-speed DAC chips, SRAM, and operational amplifiers. The adjustment of output voltage amplitude is realized through DACs and operational amplifiers, duty cycle control is achieved via an FPGA controlling the DAC, and signal delay is implemented by buffering sampled data.


In the SkyEye virtual simulation environment, users can modify level values, edge timings, duty cycles, etc., by configuring bus registers to adjust the output voltage amplitude. Electrical layer fault injection encompasses register fault injection and memory fault injection.

7.png

▲Register Fault Injection


8.png

▲Memory Fault Injection


3.Protocol Layer Fault Mode includes command word, data word, and status word parity errors, encoding errors, and data substitution.


Protocol layer fault injection is implemented based on different bus protocols. By analyzing the data transmitted over the bus, the type of transmission information and corresponding fault injection processing can be determined. Similar to physical layer fault modes, protocol layer fault injection can also utilize data stimulation software provided by SkyEye to inject bus communication protocol data.


Reference source:

[1]  https://course.ece.cmu.edu/~ece846/docs/faultInjectionSurvey.pdf

[2] 车建华, 何钦铭, 陈建海, 等. 基于软件模拟的虚拟机系统故障插入工具[J].

浙江大学学报 (工学版), 2011, 4: 004.

[3] 徐仁佐. 软件可靠性工程[M]. 北京:清华大学出版社,2007.

[4] 张玲玲,王林章.基于故障剖面的安全关键系统可靠性测试与评估[J].计算机与数字工程,2014,42(12):2304-2310,2320.

[5] 麻彦东.面向虚拟化系统的故障注入平台的研究与设计[D].哈尔滨:哈尔滨工业大学,2015.


电话咨询
在线咨询
解决方案
QQ客服