High reliability fault tolerant digital systems in nanometric technologies: characterization and design methodologies
Partners
- Università degli Studi di ROMA "Tor Vergata" (coordinator)
- Politecnico di MILANO
- Universita' di BOLOGNA
- Universita' di PADOVA
- Politecnico di Torino
Abstract
Last generation technologies called DSM, Deep Submicron, with a core
voltage of 1.2V, up to 11 metallization levels and a gate length less
than 80 nanometers, have led to the development of very complex
electronic devices on single chip, the "System on Programmable Chip"
(SoPC). Each SoPC contains an FPGA, one or more microprocessors
realized directly on silicon or implemented on an FPGA using a HDL
macro, some logical circuits, some wired logic and, eventually, some
arithmetic processors, all connected together. These devices allow a
very big computing power and an easy flexibility, since we can
reprogram both the SoPC blocks and interconnecting lines. Moreover
reconfigurable architectures having a coarser grain than FPGA have
been recently proposed which obviate some limits of FPGA, as power
needs and delays originated by interconnections, especially on
external buses In the same time these extremely advanced
characteristics have as a drawback a decrease of the reliability of
the obtained circuit caused by: a) permanent faults caused the ageing
of the materials used to realize the chip, the break of the signal
lines due to the electromigration phenomena, or the transistor gate
rupture; b) transient faults, known as Single Event Effects,
SEE. These effects are mainly caused by ionizing radiation of
different origin and their frequency will rapidly grow up for devices
and systems of new generation. They have been widely studied for
applications related to space environment and in aircraft electronics:
in DSM circuit these effects will be present also at sea level,
together with other effects like Multiple Bit Upset (MBU), the
crosstalk between closed lines and the noise on the core voltages. To
face the effects of these faults fault avoidance or fault tolerant
techniques can be used. The former operates directly on the
manufacturing process, as for instance the Silicon on Insulator
technological process, but it has high costs and often cannot be used
for the performances reduction of the obtained system, since the
better reliability often requires the use of older technologies. The
fault tolerance approach implies a careful analysis of the type of
fault and of the resource interested, having as target the correct
functionality of the whole SoPC without reducing system performances
since fault tolerance depends on the proposed architecture. The
research program faces the problem as a whole. First of all the
architectures of SoPCs will be studied in order to define the platform
that will be used to develop the techniques for increasing the
reliability of the SoPC, considering possible faults. Afterwards the
fault models and their simulation on SoPC will be exhaustively
studied, also using the results obtained with experimental tests under
radiation: the program foresees tests under neutrons and heavy ions in
international laboratories where typical environment can be created in
order to determining rules for future qualification. At the same time
error detection and correction (EDAC) techniques for macro blocks
within SoPC will be studied: fault injection techniques will be
considered evaluated also by the radiation tests. EDAC methods,
studied and developed for each macro block, will be integrated in a
unique framework with the goal of increasing the the fault tolerance
of the systems. Alternative architectures, eventually reconfigurable,
will be proposed, considering the goal of alleviate the costs of the
required redundancy without reducing the system performances.This goal
can be pursued using, as long as it is possible, COTS, Commercial Off
The Shelf, devices and avoiding multiplation techniques. Final result
of the project will be a system, a case study chosen during the
project based on a SoPC, which is able to demonstrate that the
developed methodologies increase the total reliability implemented
using SoPC without reducing performances.