You are on page 1of 6

Introduction to Embedded Computer Systems 4 System Failures and their Effects

Any embedded computer system, no matter how well designed will go wrong some-time during its life, due to either a: hardware fault software fault transient fault permanent fault. The designer must be concerned with the consequences of such faults, and how the system copes with the faults. Real-time system failures will happen. Designers must concern themselves with the consequences of such faults and failures, and why these problems arose in the first place. All real-time software must be designed in a professional manner, to handle both foreseen and unforeseen program malfunctions (exceptions). It can be useful to think of real-time embedded computer systems operating within domains of behaviour: The Operational Domain (Designed for normal operation) The totality of points of the state space which the system might visit in the course of its normal operation, where it must display the attributes specified by the requirements; Domain of Tolerable Stress (Designed for fault-tolerant operation) The totality of points of the state space in which the system must survive without damage, and from which it must be able to recover its normal behaviour on return to the operational domain; Domain of Excess Stress The area outside the operational and tolerable stress domains where the system's behaviour and safety cannot be guaranteed; The system must therefore be externally protected against encountering excess stress

Tutor Version

Page 1 of 6

intro2ecs_04_tv.docx

Introduction to Embedded Computer Systems 4

Domains of Behaviour

Operational Domain

Excess Stress Domain Tolerable Stress Domain From which safe recovery is possible.

Service Region

NonService Region

Extra operations could be incorporated into the specifications.

The rest of the state space.

Normal operations as specified in the requirements.

Tutor Version

Page 2 of 6

intro2ecs_04_tv.docx

Introduction to Embedded Computer Systems 4


Testability Part of the designers function is to distinguish between the domains of behaviour, and determine the causes for transition between domains. Test cases have to be selected for: the average working situations; boundary situations; stress situations which take the system into the 'out of bounds'. Software must be designed for ease of testing (i.e. its testability must be high). Software Packages If software packages are 'bought in' for a real-time system, their quality must be carefully assessed. E.g.: Users of a well known PC operating system often experience unpredictable behaviour, including total hang-up. Could this piece of software be trusted for real-time embedded computer applications? Controlling an aircraft? Coping with System Faults Three options are open to the designer of a real-time system: 1. 2. where no recovery action is possible, the system is put into a 'fail-safe' condition; the system continues to operate, but with reduced service; This may be achieved, say, by reducing response times or by servicing only the "good" elements of the system. Such systems are said to offer "graceful degradation". "good" = those that are operating correctly. 3. full and safe system performance is maintained in the presence of faults. Full fault-tolerance.

Tutor Version

Page 3 of 6

intro2ecs_04_tv.docx

Introduction to Embedded Computer Systems 4


Interfacing The range of devices which interface to embedded processors is very large. Signals may be analogue, requiring conversion to digital before being supplied to the processor, or digital signals which can be used with conversion. Transducers may be used which encode physical quantities in the form of voltage, current or frequency. In all but the smallest of real-time systems, the size of the hardware and its cost is dominated by the interface electronics.

A popular system design strategy, which tries to cope with failures, involves replicating the processor with the system.
SWIFT Banking system: Master Slave

When the processor itself is the major item in the system, the back-up processor method of coping with failures is both feasible and sensible. However, using this approach with Input/Output (I/O) dominated systems introduces much complexity and makes much less sense.
Which transducer do you believe is one has gone out of calibration? Do you need a third? Majority voting? Main Processor Interface Transducer Interface Transducer

Conventional exception handling schemes are usually concerned with detecting internal (program) problems: stack overflow; array bound violations; arithmetic overflows.

Tutor Version

Page 4 of 6

intro2ecs_04_tv.docx

Introduction to Embedded Computer Systems 4

Program Exceptions
Stack Overflow RAM

Stack

X
Attempt to place data outside of stack boundary.

Array Bound Violation

RAM

ARRAY

Array

Attempt to place data outside of array boundary.

Arithmetic Overflow
Using 2s compliment arithmetic, 1 byte can store integers in the range -128 to +127. (The MSB is used to represent the sign.) A number outside of these values will overflow, leading to incorrect results. Decimal: Binary: -128 1000 0000 -1 1111 1111 0 0000 0000 1 0000 0001 127 0111 1111

Tutor Version

Page 5 of 6

intro2ecs_04_tv.docx

Introduction to Embedded Computer Systems 4


A new range of problems arise with real-time systems: sensor failure; illegal actions by the operator; electrical interference.. Detecting such faults is one thing, deciding what to do subsequently can be an even more difficult problem Exception handling strategies need very careful consideration in order to avoid system or environmental damage, or injury to personnel, when faults occur. Consider the problems involved in designing software to control a nuclear reactor.
"Fail safe" "Manual override" "Duplication of software" "Check and double check readings" "Fast response" "Loss of power back-up systems"

Tutor Version

Page 6 of 6

intro2ecs_04_tv.docx

You might also like