Showing posts with label Arianespace. Show all posts
Showing posts with label Arianespace. Show all posts

Monday, November 21, 2016

About the danger of programming errors

What is an error? According to Wikipedia: unintentional deviation from right actions, deeds and thoughts; the difference between the expected or measured and real value. We make errors every day. Some bring inconvenience only to us; others can have more serious consequences. This article provides facts about programming errors that could have been avoided if the code analysis was done more correctly.

About the human factor

A human brain is a sphere that is not yet explored to the end. There are a lot of books and articles written on the topic of its capabilities, but the majority of scientists agree on one thing - we aren't using 100% of our abilities. A human being isn't just logic, erudition, intelligence, but also feelings, emotions and upbringing. Even the most highly qualified specialist with the IQ above 140 (the average level is 100-120) can get tired, get upset or just be inattentive. The result of this concourse of circumstances could be a mistake.
Programmers are very pedantic people, thorough and definitely very smart. But still, when writing the code, they make mistakes. A lot of these errors get detected thanks to the -Wall, asserts, tests, meticulous code review, IDE warnings, building the project by different compilers for different OS, working on different hardware and so on. But even with all these measures, the errors often get unnoticed.
A person who is not connected with programming in any way may think: there is nothing critical in a program error! When a surgeon makes a mistake during the operation - that is dangerous, but an incorrectly placed symbol is nothing to worry about. That's when a person is drastically wrong. I'll provide some examples here, so that you can feel the importance of flawless code.

About money

Four satellites, 2,600 lb, of the Cluster scientific program (study of the solar radiation and Earth's magnetic field interaction) and a european heavy-lift launch vehicle Ariane 5,used to deliver payloads into geostationary transfer orbit (GTO), turned into "confetti" June 4, 1996. This accident attracted attention of the publicity, politicians and heads of responsible organizations.

Conclusion of the commission:
The investigation showed that of the key reasons of the accidents was the software module, which Ariane 5 for from the previous models. Ariane 5, in contrast to the previous model, had a fundamentally different scenario of the pre-flight actions — so different that the work of the fateful software module after the launch time had no sense at all. The module was not modified for the Ariane 5, so the analysis of all operations carried out by the developers, didn't protect the missile carrier from the crush.
later on there were also other issues found, that could have been avoided by doing a more thorough analysis of the launcher software.
The price of such carelessness: 370.000.000 $. Consequences: increase of the investment into the research aimed at the reliability improvement of the systems with special safety requirements. The following automatic analysis of Ariane code (written in Ada) was the first case when the static analysis was used in the scope of a large project using the technique of abstract interpretation.

About the human toll

Therac-25 radiation therapy machine, a medical accelerator. The Canadian Government Organization "Atomic Energy of Canada Limited" released three versions: Therac-6 and Therac-20, Therac-25. 6 and 20 were produced in conjunction with the French company CGR.
The programming code in Therac-20 was based on the code of Therac-6. All three machines had the PDP-11 computer installed. The previous models didn't require it, as they were designed as stand-alone devices. The radiotherapy technician set up various options manually, including the position of the rotating disk to configure the operating mode of the machine.

The Therac-6 and 20 hardware locking mechanisms did not allow the operator to do something dangerous, say, choose a high power electron beam without the x-ray targets.
In the Therac-25 the hardware protection has been removed and the security functions were all given to software. Gradual but inconsistent implementation of improvements in software have led to fatal mistakes. From June 1985 till January 1987 this machine caused six radiation overdoses, some patients got the doses of several thousand rads (a typical therapeutic radiation dose is up to 200 rads, 1000 rads is a lethal dose). At least two died directly from the radiation overdoses.
In the Therac-25 software there were found at least four errors that could lead to overexposure to radiation.
During the investigation it became clear that the software was tested with a minimum number of tests on the simulator, but the majority of time the system was tested as a whole. Thus, the module testing was disregarded, and only integration testing was done.
I think that now you will probably agree that the price of an error is sometimes intolerably high.

When in doubts - trust the program.

A programmer can improve the coding skills, can become a real professional. But even in this case, the error cannot be excluded. The examples provided above show that "trusting to luck" is dangerous, that's why programmers act as cautiously as possible: use a large number of methods and tools helping to control the code quality. One of the tools of this direction is static analysis. These tools help to detect a lot of errors in the source code of the programs written in various programming languages. Tools of this kind analyze the code and generate a report, that helps a programmer find and eliminate the errors.
The best way to show the benefits of such a product is to demonstrate its abilities by checking open-source projects. For example, there were already more than 10000 bugs detected with the help of PVS-Studio static analyzer. You may find them all here: http://www.viva64.com/en/examples/.
Yes, you can program without any additional help of the analyzers. You can check the code yourself, ask you colleagues to recheck your code. But do not forget that the programmer is just a human being, first and foremost. Using a static code analyzer to check the project isn't a sign of unprofessionalism. On the contrary, it is the desire to make the results of our work maximumly close to the ideal. If the error is detected on the stage of the development, only you will know that it was there, otherwise your blunder can become a ground for an article "The dumbest bugs of the decade".

You may find the full versions of the articles, the abstracts from which were used to write this one, here:

Friday, September 2, 2016

A space error: 370.000.000 $ for an integer overflow


Start. 37 seconds of flight. KaBOOM! 10 years and 7 billion dollars are turning into dust.
Picture 1
Four satellites, 2,600 lb, of the Cluster scientific program (study of the solar radiation and Earth's magnetic field interaction) and a heavy-lift launch vehicle Ariane 5 turned into "confetti" June 4, 1996.
The programmers were to blame for everything.
The previous model-rocket Ariane 4 has been successfully launched more than 100 times. What could go wrong?
Apparently, to conquer space, one should know Ada language well.

Dossier


Ariane 5 is a European expendable heavy lift launch vehicle that is part of the Ariane rocket family. It is used to deliver payloads into geostationary transfer orbit (GTO) or low Earth orbit (LEO), can launch two-three satellites, and up to eight micro satellites at a time.
The project history
It was created in 1984-1995 by a European Space Agency (EKA, ESA), the main developer - French Centre National d'Etudes Spatiales (CNES). The program participants' were 10 European countries, the project cost was 7 billion US dollars (46.2% - contribution of France).
About a thousand industrial firms took part in the creation of the rocket. The prime contractor is a European company, Airbus Defence and Space (Airbus Group unit, "Airbus Group", Paris). The marketing for Ariane 5 was done by a French company, Arianespace (Evry), with which ESA signed an agreement November 25, 1997.
Picture 3
Vehicle description
Ariane 5 is a two-stage heavy class booster rocket. Length — 52-53 m, maximum diameter — 5.4 m, starting weight: 775-780 tonnes (depending on the configuration).
The first stage is equipped with a liquid rocket engine Vulcain 2 ("Volcano-2"; the first three versions of the missile were made of Vulcain), and the second is HM7B (for the version of Ariane 5 ECA) or Aestus (for Ariane 5 ES). Vulcain 2 and HM7B engines run on a mixture of hydrogen and oxygen, and are manufactured by a French company Snecma (a part of "Safran" group, Paris)
Aestus uses non volatile fuel - a mixture of the MMH propellants with Nitrogen tetroxide oxidizer. The engine was developed by a German company Daimler Chrysler Aerospace AG (DASA, Munich).
In addition, attached to the sides were two solid rocket booster accelerators (manufacturer-Europropulsion, Suresnes, France; a joint venture between Safran Group and the Italian company Avio), which provide more than 90% of torque starting at the beginning, delivering 90% of the thrust during the first launch phases. In the version of the Ariane 5 ES, the second stage may not be available when outputting the payloads into low anchor orbit.
Picture 15
The day after the catastrophe, the General Director of the European Space Agency (ESA), and Chairman of the French National Centre for space research (CNES) issued a decree on the formation of an independent Commission to investigate the circumstances and causes of this emergency, which included well-known experts and scholars from all interested European countries.
The Commission began its work on June 13, 1996 and on 19 July they released its exhaustive report (PDF), which immediately became available on the net.
The Commission had telemetry data, trajectory data, as well as recorded optical observations of the course of the flight.
The explosion occurred at an altitude of approximately 4 km, and the debris was scattered over an area of about 12 square km in the savanna and the surrounding swamps. The Comission studied the testimonies of numerous specialists and examined the production and operational documentation.
Picture 22

Technical details of the accident


The position and orientation of the booster in space were measured by an Inertial Reference Systems — IRS, a part of which is a built-in computer, which evaluates the angles and speeds based on the information provided by the onboard Inertial Platform, equipped with laser gyroscopes and accelerometers. The data from IRS were passed by a special bus for the onboard computer, which provided the necessary information for the implementation of the flight program and managed directly - through the hydraulic and servo mechanism - the solid booster accelerators and cryogenic engines.
Picture 23
Duplication of the equipment was used to ensure the reliability of Flight Control Systems. Therefore, two IRS systems (one - active and the other is its hot standby) with identical hardware and software were operating in parallel. As soon as the onboard computer detected that the "active" IRS withdrew from a regular mode, it immediately switched to another. There were also two on-board computers.

Significant phases of development process

Picture 24
7 minutes before the scheduled launch there was detected an infringement of "visibility criterion". Therefore, the start was postponed by an hour.
LT (Launch Time) = 9 o'clock. At 33 min. 59 sec. local time, the "launch window" was "caught" again and finally, the vehicle launched and was running in a normal mode until LT+37 seconds.
In the following several seconds there was a dramatic deviation from the given missile trajectory that ended in an explosion.
At LT+39 seconds, because of high aerodynamic load due to the "angle of attack" exceeding 20 degrees, the starting accelerators separated from its main stage, which triggered the missile Autodestruct System.
The change of the angle of attack happened because of a malfunction in the nozzle rotation of the solid accelerators, which was caused by a command from an on-board computer based on the information from the active Navigation System (IRS 2).
Some of this information was incorrect in principle: what has been interpreted as flight details was actually diagnostic information from the IRS 2 firmware.
The built-in computer IRS2 passed incorrect data, because it diagnosed a contingency, having "caught" an exception that was thrown by one of the software modules.
At the same time the on-board computer could not switch to the backup system IRS 1 because it had already ceased to function during the previous cycle (which took 72 milliseconds) - for the same reason as the IRS 2.
An exception "thrown" by an IRS program, resulted from the conversion of data from a 64-bit floating point format to a 16-bit signed integer, which led to "Operand Error".
The error occurred in a component that is meant only for performing "adjustment" of the Inertial Platform. This software module generates significant results only until the moment LT+7 seconds of the detachment from the launch pad. After the rocket soared up, the module could no longer affect the module.
"The adjustment function" had to be active (according to the established results) for 50 seconds after the initiation of the "flight mode" on the Navigation System bus (the moment LT-3 seconds), was performed.
The "Operand Error" occurred because of an unexpectedly large magnitude of BH (Horizontal Bias — a horizontal skew), evaluated by the internal function based on the value of "horizontal speed" measured by the Platform sensors.
The BH magnitude served as an indicator of the precision of the Platform positioning. The BH magnitude turned out to be much greater than it was expected, because the trajectory of the Ariane 5 at the early stage was significantly different from the flight path of the Ariane 4 (where this software module was previously used), which led to a much higher "horizontal velocity".
The final action that had fatal consequences was the processor work termination. Thus, the whole Navigation System ceased to function. It was technically impossible to resume its actions.
The researchers were able to reproduce this chain of events using computer modeling, combined with other research materials and experiments this allowed them to conclude that the causes and the circumstances of the accident are fully identified.
Picture 25

The causes and origins of the accident

The initial requirement to continue the adjustment after the rocket takeoff, was embedded for more than 10 years before the fateful events, when they designed the early Ariane models.
The flight could be cancelled just several seconds before the flight, for example, in the interval of LT-9, for example, when the IRS started the "flight mode", and LT-5 seconds, when there was a command to perform several operations with the rocket equipment.
In the case of an unexpected cancellation of the takeoff, it was necessary to quickly return to the countdown mode - and not to repeat all the installation operations from the beginning, including the bringing of the Inertial Platform (an operation, requiring 45 min. - the time when the "launch window" would be lost).
It was stated that in case the launch was cancelled, 50 seconds after the LT-9 would be enough for the equipment on the Earth to regain full control over the Inertial Platform without data loss - the Platform could stop the transference that was initiated and the corresponding software module would register all the information about its condition, which will help to return to the original position (in case the rocket is still on the launch pad). Once, in 1989, during start number 33 of the Ariane 4 rocket, this peculiarity was successfully activated.
Picture 26
However, the Ariane 5, in contrast to the previous model had a fundamentally different scenario of pre-flight actions — so different that the work of the fateful software module after the launch time made no sense at all. However, the module was used again without any modifications.
ADA Language
Picture 27
The investigation revealed that this software module contained seven variables involved in type conversion operations. It turned out that the developers performed the analysis for the vulnerability of all operations, capable of throwing an exception.
It was their conscious action - to add adequate protection to four variables, and leave three of them - including BH - unprotected. The ground for this decision was the certainty that overflow is not possible in these variables in general.
This confidence was supported by the evaluations, showing that the expected range of physical parameters that was taken as the basis for the determination of the values of the mentioned variables can never lead to an undesirable situation. And it was true — but for the trajectory evaluated for Ariane 4.
The new generation Ariane 5 rocket launched on an entirely different trajectory, for which no evaluations were carried out. Meanwhile, it turned out that the "horizontal velocity" (together with the initial acceleration) exceeded the estimated (for Ariane 4) more than five times.
The protection of all 7 (including BH) variables wasn't provided because the maximum workload for the IRS computer was declared as 80%. The developers had to look for ways to reduce unnecessary evaluation expenses, and they weakened the protection in that fragment where theoretically the accident could not happen. When it occurred, then the exception handling mechanism was activated, which turned out to be completely inadequate.
This mechanism supposes three main steps.
  • The information about the contingency should be transmitted via the bus to the onboard computer OBC.
  • In parallel it was written - together with the whole context - to the reprogramming memory EEPROM (during the investigation it was possible to restore it and read the contents)
  • The work of IRS processor should have been aborted.
The last action was a fatal one; it led to the accident despite the fact that the situation was quite normal (even though there was an exception generated due to unsecured overflow).
Picture 28

Conclusion

The defect on the Ariane 5was the result of several factors. There were many stages during development and testing when the defect could have been detected.
  • The programming module was reused in a new environment where the conditions of functioning were significantly different from the requirements of the program module. These requirements have not been revised.
  • The system identified and detected an error. Unfortunately, the specification of the error-handling mechanism was inappropriate and caused the final destruction.
  • The erroneous module was never properly tested in the new environment - neither the hardware, nor the level of system integration. Therefore, the flaws in the development and implementation were not detected.
Picture 29
From the report of the commission:
The main task during the development of Ariane 5 was the reducing of the occasional accident. The exception thrown was not a random accident, but an error in the structure. The exception was detected, but handled incorrectly, because of the point of view that a program should be considered correct, until the opposite is shown. The Commission holds the opposite view, that the software should be considered erroneous, until the best practical current methods demonstrate its correctness.

Happy ending

Picture 30
Despite this failure, there were 4 more satellites, Cluster II built and put into orbit on the rocket Soyuz-U/Fregat in the year 2000.
This accident attracted the attention of the public, politicians, and the heads of organizations to the high risks connected with the usage of complex computational systems, which increased investment into research aimed at improving the reliability of life-critical systems. The following automatic analysis of the Ariane code (written in Ada) was the first case when the static analysis was used in the scope of a large project using the abstract interpretation technique.

Sources

This article was originally published (in Russian) at the website habrahabr.ru. The article was translated and published at our blog by the author's permission.
By Aleksey Statsenko