Friday, September 2, 2016

A space error: 370.000.000 $ for an integer overflow


Start. 37 seconds of flight. KaBOOM! 10 years and 7 billion dollars are turning into dust.
Picture 1
Four satellites, 2,600 lb, of the Cluster scientific program (study of the solar radiation and Earth's magnetic field interaction) and a heavy-lift launch vehicle Ariane 5 turned into "confetti" June 4, 1996.
The programmers were to blame for everything.
The previous model-rocket Ariane 4 has been successfully launched more than 100 times. What could go wrong?
Apparently, to conquer space, one should know Ada language well.

Dossier


Ariane 5 is a European expendable heavy lift launch vehicle that is part of the Ariane rocket family. It is used to deliver payloads into geostationary transfer orbit (GTO) or low Earth orbit (LEO), can launch two-three satellites, and up to eight micro satellites at a time.
The project history
It was created in 1984-1995 by a European Space Agency (EKA, ESA), the main developer - French Centre National d'Etudes Spatiales (CNES). The program participants' were 10 European countries, the project cost was 7 billion US dollars (46.2% - contribution of France).
About a thousand industrial firms took part in the creation of the rocket. The prime contractor is a European company, Airbus Defence and Space (Airbus Group unit, "Airbus Group", Paris). The marketing for Ariane 5 was done by a French company, Arianespace (Evry), with which ESA signed an agreement November 25, 1997.
Picture 3
Vehicle description
Ariane 5 is a two-stage heavy class booster rocket. Length — 52-53 m, maximum diameter — 5.4 m, starting weight: 775-780 tonnes (depending on the configuration).
The first stage is equipped with a liquid rocket engine Vulcain 2 ("Volcano-2"; the first three versions of the missile were made of Vulcain), and the second is HM7B (for the version of Ariane 5 ECA) or Aestus (for Ariane 5 ES). Vulcain 2 and HM7B engines run on a mixture of hydrogen and oxygen, and are manufactured by a French company Snecma (a part of "Safran" group, Paris)
Aestus uses non volatile fuel - a mixture of the MMH propellants with Nitrogen tetroxide oxidizer. The engine was developed by a German company Daimler Chrysler Aerospace AG (DASA, Munich).
In addition, attached to the sides were two solid rocket booster accelerators (manufacturer-Europropulsion, Suresnes, France; a joint venture between Safran Group and the Italian company Avio), which provide more than 90% of torque starting at the beginning, delivering 90% of the thrust during the first launch phases. In the version of the Ariane 5 ES, the second stage may not be available when outputting the payloads into low anchor orbit.
Picture 15
The day after the catastrophe, the General Director of the European Space Agency (ESA), and Chairman of the French National Centre for space research (CNES) issued a decree on the formation of an independent Commission to investigate the circumstances and causes of this emergency, which included well-known experts and scholars from all interested European countries.
The Commission began its work on June 13, 1996 and on 19 July they released its exhaustive report (PDF), which immediately became available on the net.
The Commission had telemetry data, trajectory data, as well as recorded optical observations of the course of the flight.
The explosion occurred at an altitude of approximately 4 km, and the debris was scattered over an area of about 12 square km in the savanna and the surrounding swamps. The Comission studied the testimonies of numerous specialists and examined the production and operational documentation.
Picture 22

Technical details of the accident


The position and orientation of the booster in space were measured by an Inertial Reference Systems — IRS, a part of which is a built-in computer, which evaluates the angles and speeds based on the information provided by the onboard Inertial Platform, equipped with laser gyroscopes and accelerometers. The data from IRS were passed by a special bus for the onboard computer, which provided the necessary information for the implementation of the flight program and managed directly - through the hydraulic and servo mechanism - the solid booster accelerators and cryogenic engines.
Picture 23
Duplication of the equipment was used to ensure the reliability of Flight Control Systems. Therefore, two IRS systems (one - active and the other is its hot standby) with identical hardware and software were operating in parallel. As soon as the onboard computer detected that the "active" IRS withdrew from a regular mode, it immediately switched to another. There were also two on-board computers.

Significant phases of development process

Picture 24
7 minutes before the scheduled launch there was detected an infringement of "visibility criterion". Therefore, the start was postponed by an hour.
LT (Launch Time) = 9 o'clock. At 33 min. 59 sec. local time, the "launch window" was "caught" again and finally, the vehicle launched and was running in a normal mode until LT+37 seconds.
In the following several seconds there was a dramatic deviation from the given missile trajectory that ended in an explosion.
At LT+39 seconds, because of high aerodynamic load due to the "angle of attack" exceeding 20 degrees, the starting accelerators separated from its main stage, which triggered the missile Autodestruct System.
The change of the angle of attack happened because of a malfunction in the nozzle rotation of the solid accelerators, which was caused by a command from an on-board computer based on the information from the active Navigation System (IRS 2).
Some of this information was incorrect in principle: what has been interpreted as flight details was actually diagnostic information from the IRS 2 firmware.
The built-in computer IRS2 passed incorrect data, because it diagnosed a contingency, having "caught" an exception that was thrown by one of the software modules.
At the same time the on-board computer could not switch to the backup system IRS 1 because it had already ceased to function during the previous cycle (which took 72 milliseconds) - for the same reason as the IRS 2.
An exception "thrown" by an IRS program, resulted from the conversion of data from a 64-bit floating point format to a 16-bit signed integer, which led to "Operand Error".
The error occurred in a component that is meant only for performing "adjustment" of the Inertial Platform. This software module generates significant results only until the moment LT+7 seconds of the detachment from the launch pad. After the rocket soared up, the module could no longer affect the module.
"The adjustment function" had to be active (according to the established results) for 50 seconds after the initiation of the "flight mode" on the Navigation System bus (the moment LT-3 seconds), was performed.
The "Operand Error" occurred because of an unexpectedly large magnitude of BH (Horizontal Bias — a horizontal skew), evaluated by the internal function based on the value of "horizontal speed" measured by the Platform sensors.
The BH magnitude served as an indicator of the precision of the Platform positioning. The BH magnitude turned out to be much greater than it was expected, because the trajectory of the Ariane 5 at the early stage was significantly different from the flight path of the Ariane 4 (where this software module was previously used), which led to a much higher "horizontal velocity".
The final action that had fatal consequences was the processor work termination. Thus, the whole Navigation System ceased to function. It was technically impossible to resume its actions.
The researchers were able to reproduce this chain of events using computer modeling, combined with other research materials and experiments this allowed them to conclude that the causes and the circumstances of the accident are fully identified.
Picture 25

The causes and origins of the accident

The initial requirement to continue the adjustment after the rocket takeoff, was embedded for more than 10 years before the fateful events, when they designed the early Ariane models.
The flight could be cancelled just several seconds before the flight, for example, in the interval of LT-9, for example, when the IRS started the "flight mode", and LT-5 seconds, when there was a command to perform several operations with the rocket equipment.
In the case of an unexpected cancellation of the takeoff, it was necessary to quickly return to the countdown mode - and not to repeat all the installation operations from the beginning, including the bringing of the Inertial Platform (an operation, requiring 45 min. - the time when the "launch window" would be lost).
It was stated that in case the launch was cancelled, 50 seconds after the LT-9 would be enough for the equipment on the Earth to regain full control over the Inertial Platform without data loss - the Platform could stop the transference that was initiated and the corresponding software module would register all the information about its condition, which will help to return to the original position (in case the rocket is still on the launch pad). Once, in 1989, during start number 33 of the Ariane 4 rocket, this peculiarity was successfully activated.
Picture 26
However, the Ariane 5, in contrast to the previous model had a fundamentally different scenario of pre-flight actions — so different that the work of the fateful software module after the launch time made no sense at all. However, the module was used again without any modifications.
ADA Language
Picture 27
The investigation revealed that this software module contained seven variables involved in type conversion operations. It turned out that the developers performed the analysis for the vulnerability of all operations, capable of throwing an exception.
It was their conscious action - to add adequate protection to four variables, and leave three of them - including BH - unprotected. The ground for this decision was the certainty that overflow is not possible in these variables in general.
This confidence was supported by the evaluations, showing that the expected range of physical parameters that was taken as the basis for the determination of the values of the mentioned variables can never lead to an undesirable situation. And it was true — but for the trajectory evaluated for Ariane 4.
The new generation Ariane 5 rocket launched on an entirely different trajectory, for which no evaluations were carried out. Meanwhile, it turned out that the "horizontal velocity" (together with the initial acceleration) exceeded the estimated (for Ariane 4) more than five times.
The protection of all 7 (including BH) variables wasn't provided because the maximum workload for the IRS computer was declared as 80%. The developers had to look for ways to reduce unnecessary evaluation expenses, and they weakened the protection in that fragment where theoretically the accident could not happen. When it occurred, then the exception handling mechanism was activated, which turned out to be completely inadequate.
This mechanism supposes three main steps.
  • The information about the contingency should be transmitted via the bus to the onboard computer OBC.
  • In parallel it was written - together with the whole context - to the reprogramming memory EEPROM (during the investigation it was possible to restore it and read the contents)
  • The work of IRS processor should have been aborted.
The last action was a fatal one; it led to the accident despite the fact that the situation was quite normal (even though there was an exception generated due to unsecured overflow).
Picture 28

Conclusion

The defect on the Ariane 5was the result of several factors. There were many stages during development and testing when the defect could have been detected.
  • The programming module was reused in a new environment where the conditions of functioning were significantly different from the requirements of the program module. These requirements have not been revised.
  • The system identified and detected an error. Unfortunately, the specification of the error-handling mechanism was inappropriate and caused the final destruction.
  • The erroneous module was never properly tested in the new environment - neither the hardware, nor the level of system integration. Therefore, the flaws in the development and implementation were not detected.
Picture 29
From the report of the commission:
The main task during the development of Ariane 5 was the reducing of the occasional accident. The exception thrown was not a random accident, but an error in the structure. The exception was detected, but handled incorrectly, because of the point of view that a program should be considered correct, until the opposite is shown. The Commission holds the opposite view, that the software should be considered erroneous, until the best practical current methods demonstrate its correctness.

Happy ending

Picture 30
Despite this failure, there were 4 more satellites, Cluster II built and put into orbit on the rocket Soyuz-U/Fregat in the year 2000.
This accident attracted the attention of the public, politicians, and the heads of organizations to the high risks connected with the usage of complex computational systems, which increased investment into research aimed at improving the reliability of life-critical systems. The following automatic analysis of the Ariane code (written in Ada) was the first case when the static analysis was used in the scope of a large project using the abstract interpretation technique.

Sources

This article was originally published (in Russian) at the website habrahabr.ru. The article was translated and published at our blog by the author's permission.
By Aleksey Statsenko


No comments:

Post a Comment