Monday, October 31, 2016

R-17 VS Patriot: a Rounding Issue

This is another piece in our series of articles where we talk about the importance of high-quality code in computer systems whose failure can cause huge expenses or casualties. This time we will talk about reliability of embedded software in military equipment.
Picture 1
February 11, 1991, the Israeli forces inform the Patriot Project Office about a defect found in the Patriot surface-to-air missile defense system. They discovered that running the system for consecutive 8 hours resulted in a 20% targeting precision loss, and estimated that after continuous operation for 20 hours the inaccuracy would grow so big that the Patriot would no longer be able to lock on, track, and intercept ballistic missiles. The U.S. commanders underrated the importance of the discovery, presuming that the system would never be used for over 8 hours as it had been designed as a mobile system to be used for short-time defense operations.
February 16, a bug fix is issued, but applying it to every unit requires some time due to the ongoing war.
February 21, the commanders issue a directive that the system should not run "for a very long time". It is not specified how much exactly "a very long time" is.
February 25, a ballistic missile R-17 (also known as Scud) strikes a U.S. Army barracks in Dhahran, Saudi Arabia, killing 28 and injuring 96 soldiers. The Patriot battery failed to intercept the missile due to a software error.
February 26, the bug fix is delivered to Dhahran.
Picture 2
Picture 6
R-17 (NATO reporting name SS-1C Scud-B; exported under the name R-300) is a Soviet single-stage ballistic missile propelled by storable liquid fuel.
Picture 7
Picture 12
Officers examining an R-17 missile shot down by a Patriot MIM-104 SAM system in the desert during Operation Desert Storm
The MIM-104 Patriot is a U.S. surface-to-air missile (SAM) defense system used by the USA and several allied nations.
Picture 13
Picture 17
Picture 19
A detailed view of an AN/MPQ-53 radar set. The circular pattern on the front of the vertical component is the system's main phased array, consisting of over 5,000 individual elements, each about 39 millimeters (1.535 in) diameter.
Picture 32
PAC-3 missile launcher, note four missiles in each canister
An investigation discovered a bug in the Patriot's tracking software that caused the system's internal clock to drift gradually from the real time.
The time was stored as an integer number in a 24-bit register with an accuracy of 1/10 of a second. This resulted in some portion of the time value being lost as it incremented each 0.1 seconds. To calculate a target's location, the data had to be cast to real numbers [source].
1/10 is 1/24+1/25+1/28+1/29+1/212+1/213+... In other words, binary expansion of the value 1/10 is 0.0001100110011001100110011001100.... That's why this value, stored in a 24-bit register, was rounded to 0.00011001100110011001100, resulting in a precision error of 0.0000000000000000000000011001100... in binary format, or about 0.000000095 in decimal format. During 100 hours of continuous operation, this error would build up to 0.000000095×100×60×60×10=0.34 seconds.
An R-17's velocity is 1676 m/s, so it covers over half a kilometer in 0.34 seconds, which is more than enough for the missile to slip past the Patriot's intercept range. The funny thing is that this time-calculation bug was fixed only in some parts of the software, but not in all of it.
The software had been written in an assembler language 15-20 years earlier and was modified a number of times by different programmer teams during the subsequent years.
The slides shown below are taken from the report on the Patriot system's failure:
Picture 33
Picture 35
Golden rules for programmers:
  • Choose adequate sizes for your variables. Always check twice how many bits each of them requires for storing values (longintdoublefloat, etc.) in a given language and a given operating system.
  • Use integer numbers instead of floating-point ones wherever possible. Measure money in cents, not in dollars. If you can't do without float, use double-precision format.
  • Never use floating-point numbers as loop counters.
  • Avoid mixing types (signed -- unsigned; integer -- floating-point; single precision -- double precision). Be careful with type casts.
  • Check for possible overflows and division-by-zero operations.

More on Patriot

Conclusion

Our goal is to attract the community's attention to the issues of software reliability. The times when computer programs were all about some strange, obscure scientific calculations in Fortran or video games are long over. Now they surround us and permeate every area of our activity.
In earlier times, critical software bugs affected narrow, specific areas, for example civil (Ariane 5) and military rocket industry. Nowadays, you may encounter them not only when working on your computer, but also when driving a car (Toyota) or undergoing medical treatment (Therac-25). We are among those who support programmers in their fight against bugs. Static code analyzer PVS-Studio developed by our team helps detect many of the errors in C, C++, and C# programs as early as at the coding stage. Taking this opportunity, I'd also like to remind you that starting with October 25, 2016, there is a Linux version of PVS-Studio available in addition to the existing Windows version.

This article was originally published (in Russian) on habrahabr.ru. The original and translated versions were posted on our blog with the permission of the author.

No comments:

Post a Comment