High reliability fault tolerant digital systems in nanometric technologies: characterization and design methodologies



Last generation technologies called DSM, Deep Submicron, with a core voltage of 1.2V, up to 11 metallization levels and a gate length less than 80 nanometers, have led to the development of very complex electronic devices on single chip, the "System on Programmable Chip" (SoPC). Each SoPC contains an FPGA, one or more microprocessors realized directly on silicon or implemented on an FPGA using a HDL macro, some logical circuits, some wired logic and, eventually, some arithmetic processors, all connected together. These devices allow a very big computing power and an easy flexibility, since we can reprogram both the SoPC blocks and interconnecting lines. Moreover reconfigurable architectures having a coarser grain than FPGA have been recently proposed which obviate some limits of FPGA, as power needs and delays originated by interconnections, especially on external buses In the same time these extremely advanced characteristics have as a drawback a decrease of the reliability of the obtained circuit caused by: a) permanent faults caused the ageing of the materials used to realize the chip, the break of the signal lines due to the electromigration phenomena, or the transistor gate rupture; b) transient faults, known as Single Event Effects, SEE. These effects are mainly caused by ionizing radiation of different origin and their frequency will rapidly grow up for devices and systems of new generation. They have been widely studied for applications related to space environment and in aircraft electronics: in DSM circuit these effects will be present also at sea level, together with other effects like Multiple Bit Upset (MBU), the crosstalk between closed lines and the noise on the core voltages. To face the effects of these faults fault avoidance or fault tolerant techniques can be used. The former operates directly on the manufacturing process, as for instance the Silicon on Insulator technological process, but it has high costs and often cannot be used for the performances reduction of the obtained system, since the better reliability often requires the use of older technologies. The fault tolerance approach implies a careful analysis of the type of fault and of the resource interested, having as target the correct functionality of the whole SoPC without reducing system performances since fault tolerance depends on the proposed architecture. The research program faces the problem as a whole. First of all the architectures of SoPCs will be studied in order to define the platform that will be used to develop the techniques for increasing the reliability of the SoPC, considering possible faults. Afterwards the fault models and their simulation on SoPC will be exhaustively studied, also using the results obtained with experimental tests under radiation: the program foresees tests under neutrons and heavy ions in international laboratories where typical environment can be created in order to determining rules for future qualification. At the same time error detection and correction (EDAC) techniques for macro blocks within SoPC will be studied: fault injection techniques will be considered evaluated also by the radiation tests. EDAC methods, studied and developed for each macro block, will be integrated in a unique framework with the goal of increasing the the fault tolerance of the systems. Alternative architectures, eventually reconfigurable, will be proposed, considering the goal of alleviate the costs of the required redundancy without reducing the system performances.This goal can be pursued using, as long as it is possible, COTS, Commercial Off The Shelf, devices and avoiding multiplation techniques. Final result of the project will be a system, a case study chosen during the project based on a SoPC, which is able to demonstrate that the developed methodologies increase the total reliability implemented using SoPC without reducing performances.