InnoCentive CEO Dwayne Spradlin wrote in this article about how many companies in a rush to develop something new and innovative try to solve a problem that they want to solve rather than the problem the customer wants to be solved. He quotes Albert Einstein who once said
If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.
Einstein’s quote resonates with reliability engineers. Because we are not necessarily chasing innovation when trying to make something more reliable. We are focused on solving the right ‘failure’ problems. A lot of time and effort can be wasted on solving ‘problems’ that are easier, quicker and less expensive to fix. But we don’t address the underlying issue of poor reliability. On the contrary – introducing unnecessary fixes may make the system less reliable as it is now more complex with all these unnecessary fixes.
This is the fifth article in a series that deals with the challenges of reliability in emerging technologies. It all started after my recent involvement in helping improve the reliability of small satellites through talking with several key industry stakeholders, and seeing the same issues raise their heads time and again. The previous articles looked at the archaic approach of managing risk, safety and reliability through compliance. I say archaic, but compliance is unfortunately still wreaking havoc today. The articles then go on to look at moving toward a ‘performance-based mission assurance framework’ that focuses on what the system actually does, not what the design team does. And the previous article looked at a case study of analyzing small satellite reliability.
We know that around a 35 per cent of small satellites fail to complete their mission. Around 20 per cent of small satellites are Dead on Arrival (DOA). That is, they never work after they are deployed. We called these failures ‘Region 0’ failures. Around 15 per cent of small satellites fail due to infant mortality. We call these failures ‘Region I’ failures. To date, there is no evidence of small satellites have a constant failure rate or are ever starting to wear out. And this is key.
This article goes on to discuss what these conclusions mean. This is broadly analogous to Einstein taking the 59 minutes to work out what he must do to save the world. Because the problem is that in the satellite (and many other) industries, this 59 minutes is wasted as we get on with solving the typically wrong problem. There is a rush to consensus, compliance driven frameworks that forget to look at what is really going on. The solution may be glorious and inspiring. But it may not solve the problem you need to be solved.
So lets talk about that.
What does a reliability analysis mean for improving reliability?
We know from the previous articles in this series that small satellite failures are ‘Region 0 and I’ only. The good news is that ‘we’ (the reliability engineering community and experienced satellite engineers) pretty much know what causes these failures. And they are listed below.
- An absence of functional testing. Small satellites built by universities and satellites developed under compressed timeframes are more likely to fail early, including DOA. Why might this be the case? Universities don’t have manufacturing expertise and are often building satellites for educational purposes. So universities and companies with compressed developmental timeframes are less likely to conduct functional testing. It is also anecdotally known that as this sort of testing occurs at the end of a developmental cycle after all the components have been designed and manufactured, it is often the first thing cut in response to budget and schedule pressure. And even if functional testing is not cut, test results are only useful if there is enough time to redesign the satellite should they identify a problem. So even if functional testing takes place, we need to make sure there is scope for satellite design to change. Functional tests need to stop being the ‘icing on the top of the design process cake’ and start being a fundamental ingredient.
- Manufacturing Issues will introduce failure causes that will likely induce failure early in the mission. These are failure mechanisms caused by defects that ‘are not supposed to be there,’ meaning they occur early and often.
- Fault Tolerance (or resilience) is a term often used in software developing domains, but in small satellites include hardware-based logic as well. It technically includes things like redundancy, which small satellites tend not to have. Things like asynchronous timing can be mitigated by designing satellites that routinely re-zero relevant clocks.
- Commercial-off-the-shelf (COTS) parts which historically provide challenges to manufacturers more broadly who want some sort of assurance or way to assess whether the supplier has provided high quality components. COTS parts are often problematic to analyze and control, particularly when third party suppliers don’t see customers as big enough to warrant change. There are plenty of examples of ‘good’ COTS components. And plenty of examples of ‘bad’ ones as well.
- Fundamental design flaws that include (but are not limited to) a weak understanding of thermal dissipation which often is only truly examined on deployment. We know that errors in thermal dissipation calculations will see satellites fail early.
So we should see reliability and risk frameworks address these issues first … right?
Satellite Design for Reliability is Rudimentary at Best
Given we know DOA failures and infant mortality drive virtually all satellite failures, do we see it influence satellite design for reliability activities? No.
Anecdotal data suggests that geosynchronous earth orbit (GEO) satellites often have their missions extended as they tend to last longer than other satellites. Perhaps there is something to be learned here. GEO satellites are commonly used for communications and navigation due to their fixed positions. That is, they have the same mission performance requirements as other and previous GEO satellites. This may in turn allow design knowledge to iteratively increase and evolve (in a good way). None of the literature to date outlines what GEO satellite manufacturers do differently to make their satellites seemingly more reliable. And design approaches are understandably not publicized. But doing the same (or similar) thing over again invariably allows continual improvement, and perhaps this is the key to the better satellite reliability.
When satellites more broadly are compared to ‘high reliability’ industries, the results are not good. And even those satellites with longer typical operational lives will fail due to infant mortality (albeit in an extended timeframe).
The small satellite industry does not recognize the reality of DOA and infant mortality failures in a meaningful way. Sure, plenty or organizations have sponsored research showing the satellites predominantly fail due to DOA or infant mortality failures. But when it comes to telling their design teams what to do, they may as well be talking about a Formula 1 race car.
NASA’s Geostationary Operational Environmental Satellite (GOES) had its entire reliability budget based on an assumption that the Weibull shape parameter (used to describe time to failure) was 1.6. The meaning behind this parameter was described in the previous article. If the Weibull shape parameter exceeds 1, then it is describing wear-out failure. So NASA arbitrarily assumed its GOES to only experience wear-out failures. If this was to be true … it would be the first satellite to exhibit this tendency! And contradicts many of the studies NASA has sponsored that investigates satellite reliability (… again, ready my previous article).
How NASA used this assumed parameter was at best amateurish. Once a mean mission duration was ascertained, this shape parameter was used to calculate other parameters, and allocate reliability goals to subsystems. A curiously similar shape parameter of 1.7 has been cited as a:
value commonly used for satellite systems.
Some people commonly assume bigfoot exists. And children commonly assume the tooth fairy to be a thing. At least we all know Santa Claus is the real deal.
These assumptions blatantly contradict what we have learned about satellite reliability (… again, read the previous article if you need to be refreshed on this). Satellites don’t wear out. And yet these assumptions have governed some of the most expensive satellite projects.
How can we say that satellites don’t wear-out … when we know that at least some of their parts wear-out?
It is all about perspective and what matters to the engineer. When we look at individual components and subsystems of any system, we can clearly find things that wear-out. Batteries wear-out as the number of charge and discharge cycles accumulates. Solar panel gimbals wear-out. And because designers focus on their designated element of the system – they tend to do something about the things that are ‘clear to them.’
But when the focus is on the small satellite system (not the components), there is no wear-out. Or more correctly, wear-out failure is completely dominated by DOA and infant mortality failure. The wear-out of individual components is so important to individual designers and traditional compliance assurance frameworks that the systemic effects are missed entirely. It is incorrectly assumed that focusing on the wear-out of makes reliable satellites. By extension, it is incorrectly assumed that wear-out failure is the only failure we need to concern ourselves with. We’ll show why this seemingly intelligent human beings do this time and time again.
Other small satellite analyses make equally arbitrarily wrong assumptions. There is a preponderance to assume that COTS components have a constant hazard rate. This is mathematically easy, but scientifically lazy. These analyses also forget about common cause failures. For example, if we use two capacitors for redundancy, but purchase them from the same supplier, there is a heightened chance that the manufacturing defects in one are identical to the manufacturing defects in the other. When conditions that expose the defect in the first component cause it to fail, the second redundant component is also likely to fail. Redundancy is lost.
Common cause failure has repeatedly defeated seemingly robust design processes. Flight United Airlines 232 crash landed when a turbine blade fractured. It severed not one but all three hydraulic lines that controlled ailerons on the tail of the aircraft. A contributing reason was that all hydraulic control lines had to be very close to each other due to the design of the aircraft. A triple redundant system failed due to a common cause. It was never truly ‘triple redundant.’
NASA broadly encouraged its satellite designers to embrace a ‘faster, better, cheaper’ ethos from the 1990s. The problem was that as schedules and budgets decreased, more satellites failed. A ‘complexity’ index to correlate the relationship between the ‘scope of design,’ cost and schedule for NASA’s systems was created in 2003. And of course, those satellites with the highest complexity but ‘beneath’ the trendline for cost and schedule tended to fail more often.
So why do we focus on things we want to – not things we need to?
There is no ‘scientific’ analysis that can answer this question unequivocally. This is a psychological question that invites theories – not proofs. The human mind is too difficult to model deterministically. But I think it has something to do with this:
The Maginot Mentality.
Let me explain. The ‘Maginot Line’ was a line of heavily fortified obstacles and weapon installations built on the eastern side of France throughout the 1930s to deter a German invasion. It was the idea of then French Minister of War André Maginot. The Maginot Line stopped at the north where it was assumed by the French that the swampy lowlands and forests would prevent a German invasion. And of course, there was Belgium even further to the north for which the French clearly did not have the authority to extend the Maginot Line into.
So on the 10th of May 1940, the German forces promptly invaded through Belgium, the Ardennes Forest and other lowlands using Panzer tanks that had little problem maneuvering through this supposedly inhospitable terrain. And the Luftwaffe flew over the Maginot Line. Paris fell just over one month later.
Opinion is divided on whether the Maginot Line of itself was a failure or a success. Those who argue that its only mission was to deny southern invasion routes to the German Army describe it as a success. Those who argue that its mission was to assist in the prevention invasion regardless of direction characterize it as a failure. I suggest to you that:
the mission always reflects the highest level of endstate. By this measure, the Maginot Line was a failure.
Who cares if it stopped invasions from the south if you were invaded nonetheless? The Maginot Line was extraordinarily expensive and diverted resources away from other initiatives.
The reason the French failed to defend their country was a lack of strategy, doctrine, command and control. The slow nature of their response and many other failings that emanate directly from the decision making of the French military commanders. All these things were not seriously focused on when all defensive efforts focused on the Maginot Line. They didn’t think they needed to. Or they built the Maginot Line because they didn’t want to really think about what they were doing (… any head nodding?)
Those who call the Maginot Line as a success are like those designers who only focus on the wear-out failure of individual satellite components when the system is collectively telling you to do something else. It appears as if all the sub-system and component level testing aimed at component degradation have eliminated wear-out as a driver of satellite failure. So things don’t wear out in the same way that the Germans did not invade France from the southern approaches. But this misses the reality that satellite failure is now dominated by other types of failure that remain untouched.
And what of the ‘Maginot Mentality?’ This refers primarily to the false sense of security that French officials and the wider public felt when the Maginot Line was being constructed. The media exaggerated its descriptions, and the amount of resources being dedicated to it was a proxy for a sense of impenetrability. The Maginot Mentality should conceptually extend to describe the scenario where decision makers executed ‘easy’ things well (despite large amounts of resources being expended) but did not focus on ‘harder’ factors such as strategy (that is – critical thinking).
The Maginot Mentality is alive and well in many industries. The human tendency to ‘repeat what we know’ and confuse effort for achievement is clear to see.
An almost siren like attraction to trying to model things that wear-out in satellite reliability analysis cannot be explained by anything else but a ‘sense of security’ we get when do ‘something we are comfortable with.’ The inexplicable assumption of a Weibull shape parameter of 1.6 to 1.7 has no other explanation. And even in studies that identify predominantly wear-in or infant mortality trends in satellites, authors almost exclusively hypothesize that degradative or Region III failure mechanisms are likely to blame. One suggested that fatigue (a well-known wear-out failure mechanism) was contributing to infant mortality – which is not possible. And even when manufacturing is discussed in the context of ‘workmanship’ faults, the conclusion to effectively ignore relevant standards is based on a baffling linkage with wear-out failure mechanisms that manifest themselves over time. For example:
… full application of these space flight electronic hardware workmanship standards may not be entirely necessary for short duration missions.
This does not make sense! The only thing that should be focused on for short duration missions are failures that are likely to occur initially – such as workmanship and manufacturing related failure mechanisms. Which relate to virtually every satellite failure observed to date.
If you do what you have always done, you get what you have always got
An influential set of design rules for highly reliable spacecraft was put forward in the early 1990s. These rules still guide many satellite design decisions today, regardless of the context. And the early 1990s may as well be ‘eons’ ago when it comes to relevance for satellite technology
These rules were put forward by NASA’s Jet Propulsion Laboratory with many other industry experts. But this collection based their experience on the Voyager probes and was compiled with a specific activity in mind: the Cassini-Huygens project. Both Voyager probes are still functioning more than 40 years after launch. The Cassini spacecraft was launched in 1997 and finally crashed into Saturn’s atmosphere this year. Their focus was on space craft designed to last decades, not years (or even months).
The standards that have followed have done very little to advance meaningful quality assurance and RAM improvement. International Standards Organization (ISO) Standard 1770: Space Systems – Cube Satellites (CubSats) makes simple suggestions that things like random vibration, thermal/vacuum bakeout, shock testing; along with visual inspection. There are no performance levels that need to be met, referencing customer sponsored verification (which is of little utility).
Let’s get to some specific examples
So let’s look at all the satellite design rules that focus on wear-out only. One rule involves ensuring that semiconductor junction temperatures are kept below 60 degrees Celsius as it is known that temperature accelerates electronic degradation failure mechanisms (such as dendritic growth or diffusion). Another deals with accelerated life testing (ALT) in response to the Voyager 2 azimuth drive scan actuator seizing prematurely due to degradation in lubrication. This failure was replicated on earth, highlighting the relevance of ALT (noting that the project was able to reconfigure Voyager 2 to continue flying). One problem with ALT is that you need to know all relevant failure mechanisms – a generally unassailable challenge for spacecraft that are developmental and/or more complicated than Voyager 2. And ALT only applies to wear-out failure mechanisms based on degradation or the accumulation of damage. And again, Voyager 2 is still going strong.
Other rules deal with adhesive joints and thin-membranes that degrade due to ultraviolet light exposure, thermal cycling or corrosion from oxidizing propellant. Another broader rule is applied to thermal and mechanical cycling where a 100 per cent design margin is suggested to allow for any related failure mechanism (such as fatigue cracking). The force and torque margins rule is also aimed specifically for accommodating degradation or changes in friction characteristics over time. As is the power-on vibration testing rule.
Some of the rules deal with Region I or wear-in failure mechanisms – albeit in incomplete ways. One requires Class ‘S’ (for space) electronic components to be used. But there is a problem with the underlying logic. The rules workshop discussed above concluded that:
low failure rates are guaranteed by virtue of required vendor test experience on parts fabricated on production lines with certified control processes.
Do you see the problem? You can never make such a guarantee. And if vendors were so experienced in the early 1990s that using their components ‘guaranteed’ low failure rates, why are we seeing anything in the world today fail?
There are hundreds of historical examples of manufacturers lazily assume that vendors are ‘good,’ placing the responsibility for reliability on them without any sort of assurance framework. And the results can be disastrous.
Contemporary small satellite manufacturers are continually faced with having to use components that are not Class ‘S’ anyway. With feature size decreasing and the commercial focus on short life consumer electronics, these sorts of ‘ideal’ components are simply not available for space applications.
Another rule deals with parts burn-in. But this requires vendors who are willing to provide historical or test data. These vendors are few and far between. But at least it deals with infant mortality, even if it’s a superficial thing to say.
Qualification and Certification versus Critical Thinking and an ‘Assurance Culture’
By now, it should be apparent that producing a checklist of things that must be done versus things a system must achieve introduces a culture of compliance. And before we infer that this is a criticism of designer or manufacturer motivations, a culture of compliance is often imposed by the customer – whether this is realized or not.
Of the spacecraft design rules above, many of them promote qualification testing. That is testing to pass, but not to learn. The design and test temperature levels rule prescribes temperature limits for testing based on things like internal heat generation. While thermal analysis is a good thing, the concept of qualification stops critical thought directed toward understanding how the satellite can fail. If the aim of qualification testing is to not fail, then passing the test means we learn nothing. And this rule references acceptance testing – which you cannot do for the entire system that involves a quantitative assessment of reliability even if it is assumed you can conduct it in Earthbound but representative conditions.
Acceptance testing involves the actual system being deployed in as representative an environment as feasible. This is not possible for small satellites for both physical and financial reasons. This does not mean that qualification or system level integration testing cannot occur (they can). But as the customer has already paid for the satellite’s development, they are essentially bound to proceed with its use at the end of the design, meaning the decision to ‘accept’ was made when the contract was signed before design commenced. Further, as acceptance testing is necessarily quantitative, you will typically need multiple satellites to obtain any degree of statistical confidence. This (beyond the physical impossibility of successfully mimicking launch, deployment and outer space on Earth) precludes traditional acceptance testing to be applied to small satellites
While there are ‘acceptance’ tests per se in terms of things the satellite must do before the customer ‘owns’ the system, these are not true acceptance tests.
The focus on qualification and compliance is reinforced by the worst-case analysis rule which defines ‘worst case’ as temperatures 10 degrees Celsius beyond those determined by the design and test temperature levels. This adds an arbitrary margin to what could already be arbitrary temperature limits. So how does a satellite designer ever find out which part of the design is the most susceptible to temperature related failure?
Resilience
Moving beyond focusing on things that wear-out and a basic ‘compliance’ approach to assurance, what is left. Two of the more useful rules revolve around graceful degradation and adaptive mission strategies. If a sensor degrades in a known way, we can account for the faulty signal it produces in the comfort of an Earthbound operations room. This and other things allow for ongoing strategy decisions. The Voyager probes were launched as a pair to allow one to achieve the mission objectives of the other in the event of a failure. Voyager 2 was also able to have its memory reset to accommodate a ‘flipped bit.’
But this focus on resilience is not nearly enough for small satellites. Apart from supporting a ‘constellation-wide’ view of mission assurance, it does little to provide guidance for designing a resilient system (more on that later).
So what is left?
The electromagnetic interference rule is more functional than failure. And the only other rule that is left is also the only rule that touches on Region I or ‘wear-in’ failures beyond hoping that vendors burn-in components. The electronic hardware cleaning rule has obvious merit, but with this being the only thing that starts to address manufacturing defects (let alone design flaws), it is unsurprising that we see the reliability characteristics that we currently do.
And what is missing?
A lot.
For example, system level integration testing is perhaps the most crucial form of testing that one can undertake for small satellites if we realize that DOA failures are the most likely. Having a satellite that cannot transmit signals due to one module not being able to talk to the other is much worse than a thermal dissipation failure. The satellite would be working for at least a little while in such a case. And other very important things like managing electrostatic discharge (ESD) during manufacture are not touched upon. Today, managing ESD is a key part of quality manufacturing regimes.
Some promising research in terms of managing functional level testing has been undertaken, leveraging principles of reliability growth – but in a different context to that one described above. The underlying reliability growth model in these analyses were applied to functional testing of a small satellite component. The model was suggested primarily as a decision-making tool. It would help testers decide at what point ongoing testing was not worthwhile due to the time it would take to uncover further failures.
These models can’t ever be relied upon to predict actual reliability. The underlying principles upon which the reliability growth model is based have not been examined in detail – and are ‘generic at best.’ It would be difficult to correlate controlled series of inputs with ‘operational conditions’ in a meaningful way.
Other basic things that can be done include derating. Derating involves using higher rated electrical components, which will tend to have a more robust design. Even contemporary satellite designers still assume that there is little benefit in challenging manufacturer stated operating margins. Experienced reliability practitioners can anecdotally attest the value of derating. It is perhaps because the benefits of derating, while significant, remain anecdotal, that they are not taken more seriously in a formalized way. It should also be remembered that it is NASA’s derating guidelines (based on its spaceborne experience) are still successfully employed by many organizations today. Vendor electronic component testing has routinely failed to provide the level of assurance one would ordinarily expect, due to many things including optimistic assumptions about the degradative nature of component failure. A lack of derating eliminates the electronic component’s ability to withstand voltage spikes and other phenomena that may emanate from electromagnetic interference – a significant issue in space.
Which brings us to the problem of using COTS component electronics. NASA’s EEE-INST-002: Instructions for EEE Parts Selection, Screening, Qualification, and Derating is often used for space applications. But it is often described as impractical for short schedule low budget small satellite design processes. But screening of COTS electronic components is vital, as we know that manufacturing defects contribute to infant mortality.
And the key takeaway is …
The satellite industry (like many others) is talking about the wrong things when it comes to reliability. We have gone through rule after rule in this article which does nothing to address the failures we are observing today. That is, nothing is seriously being done to address ‘Region 0’ (or DOA) and ‘Region I’ (or infant mortality) failures. The Maginot Mentality is alive and well. We are still building our defenses along the southern approach while we are being invaded from the north. For whatever reason, precious few realize that Paris has already fallen. Perhaps we feel good about doing something, even if it doesn’t matter
And for an emerging technology such as small satellites, the ramifications about focusing on the things others want you to instead of the things you need to are disastrous. Small satellites have very short operating lives, which means they are even more prone to ‘Region 0 and I’ failures. So the burden of focusing on wear-out is suffocating, potentially destroying business models.
But what do we need to do about it? You could say right now, the satellite industry needs to update its rules to focus on ‘Region 0 and I’ failures. And this would have an immediate benefit. But there remains a problem. It will be only a matter of time until a focus on ‘Region 0 and I’ failures has reduced them to the extent that they are not key reliability drivers. Wear-out failure mechanisms will come to dominate again. And then we need to switch back.
So the answer is something different and higher level. It is an approach, not a static set of rules. It is something that not only allows change, it makes it happen. And that is the topic of the next article.
Leave a Reply