What Should We Do When Something Fails?

Guest post by Fred Schenkelberg, Reliability Expert for FMS Reliability

A natural question to ask when something fails is “Why did it fail”?

The answer is not always obvious or easy to sort out. Some failures result from design errors, others are related to supply chain and assembly issues, and yet others occur because of seemingly random events (accidents, lightning strikes, etc.). As a reliability engineer, my concern is not simply accounting for end-of-life wear out; it is about meeting the operation’s reliability expectations. From design to failure analysis, by considering the range of possible sources I can identify and attend to the root causes that matter.

Consider a circuit board that has a small burn mark where a component exploded off the board. The customer failed to spot the missing part but noticed that certain features were no longer available. The box went dark and no longer powered up. It was dead, so the customer returned it. That is the failure mode – the loss of a feature or function. This is what the customer notices.

The engineer then has to investigate the root cause and identify the failure mechanism.

Failure Mechanisms and Root Cause
Failure mechanisms are the material or software code faults that lead to failure. They include thin insulation leading to dialectic breakdown, contamination leading to corrosion, or faulty code leading to an over-voltage command. Becoming aware of a product failure and starting to determine why it failed is an exploratory process.

The clues to when the failure occurs may help frame the initial investigation.

To answer the “Why did it fail?” question in a useful manner we need to determine the sequence of events that led to the failure. Root cause analysis is a process to determine this chain of events. The cause may be faulty material or assembly, damage, or design error. It may also include poor decisions and human error. Generally, we look for the physical or chemical reason for the failure. However, we should also explore the design, assembly, supply chain, and customer-related processes to ascertain where an error or weakness in the process could have contributed to the failure.

The idea behind seeking out root causes and determining failure mechanisms is to mitigate issues with problematic elements of the product whose failure would lead to product failure.

Types of Failures and Timing
Products fail for many reasons via many mechanisms. Most products have literally hundreds of ways in which they can fail. It is really a race between different mechanisms all vying to cause the failure. Eventually, everything will fail.

One of the first steps in sorting out the specific cause is determining when the product failed. How old was the product when it failed? Early life (e.g., when a product is just bought and installed) failures tend to cause more customer anguish than a product that has provided a long life of useful service. In general, we often talk about three periods of failure:
• early life failures
• random failures
• wear-out failures

The three periods are often depicted with a curve-shaped like a bathtub. The bathtub curve is the aggregate of many potential failures. Some tend to occur early, whereas some occur later. Each individual product has many possible ways in which it can fail and the most likely failure mechanisms may change over time as the product use and conditions change. Keep in mind that the curve is a fiction to explain a hypothetical profile of possibilities of failure over time for a single item.

Each period of failure also suggests a set of possible causes. Although this set is not always accurate, it provides a good starting place when looking for the root cause.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Blog

What Should We Do When Something Fails?