Lead Author: Alexandra Loumidis Co-author(s): Todd Paulos, todd.paulos@jpl.nasa.gov
Andrew Ho, andrew.h.ho@jpl.nasa.gov
Douglas Sheldon, douglas.j.sheldon@jpl.nasa.gov
Markov Modeling of Redundant System-On-Chip (SOC) Systems
The increasing cost and decreasing availability of space-rated
custom System on Chip (SoC) components has led to interest in
using commercial components from terrestrial industries in space
environments. Along with this interest, there exists a need to
understand how the reliability of the chips, including common
cause upsets, can impact the probability of mission success and
risk.
This project modeled the failure and recovery of a system
consisting of two Qualcomm Snapdragon processors with five
upset types each. Four Markov models were created, modeling both
recoverable and non-recoverable systems. Models 1 through 3
assume the system is recoverable while Model 4 was created to
account for non-recoverable upsets. Model 1 assumes the rate of
recovering two upset items is the same as the rate of recovering one
upset item. Model 2 assumes that items recover one at a time at two
different recovery rates. Model 3 assumes that the boot-up time of a
second processor is greater than the recovery time for a single
processor.
MATLAB scripts were produced to plot availability of each model
over time. The three recoverable models achieved availability of
greater than 0.970 after 10^6 seconds while the non-recoverable
model achieved availability of 0.344 after the same time period.
This research was carried out at the Jet Propulsion Laboratory,
California Institute of Technology, under a contract with the
National Aeronautics and Space Administration.
Keywords: Markov, redundant, Snapdragon.
Bio: Alexandra has a degree in Engineering from Harvey Mudd College. Last year, she worked as a summer intern at NASA Jet Propulsion Laboratory in the system reliability group, working on Markov modeling of redundant systems as well as creating a software conversion tool for probabilistic risk assessment software SAPHIRE.
Country: USA Company: Harvey Mudd College Job Title: Student Researcher
Paper 2 TP76
Lead Author: Todd Paulos Co-author(s): Andrew Ho andrew.h.ho@gmail.com
Curtis Smith curtis.smith@inl.gov
Reliability Modeling of Complex Components Using Simulation
This paper is a continuation of papers presented at the 15th and 16th Probabilistic Safety Assessment and Management Conference, in which discussions of modeling failure modes of complex components and the effects of censor bias were presented. The first paper demonstrated how the typical method of treating failure modes as being exponential in nature gives optimistic predictions when predicting how improvements to subcomponents will perform in the real world. Instead of relying on traditional analytical methods, it is a more accurate approach to model the failure modes as a race in time; unfortunately, this does not give a closed form solution. A simulation solution was presented that demonstrated the optimistic predictions of the classical techniques. The second paper demonstrated the effect of censor bias when dealing with large amounts of success only testing at the failure modes. Again, the censor bias also contributed to optimistic results.
In our quest for closed-form solutions and simplicity, the world of reliability engineering relies on everything behaving like an exponential. It makes the solution closed-form in most cases, and easier to solve, but unfortunately, leads to incorrect results when making real-world decisions on improvements at the failure mode level. An excellent real-world example against using exponential distributions in this context is the common automobile. No one expects a new car to have the same failure intensity as an older car; this is one reason why people purchase a newer car—he or she wants a car with fewer problems. Conversely, there are many other examples that we see in every industry, but it is still common to use the simple method of treating failure modes as exponential variables for ease of finding a solution.
In this paper, a simple system with complex components having different failure modes will be analyzed using a standard fault tree and data assessment approach and then compared to the simulation of complex assemblies discussed in the previous two papers. A discussion of the results will follow.
Bio: Dr. Todd Paulos has been involved with IAPSAM organization since its inception, and is a current Board Member and Treasurer for IAPSAM. He is the Technical Program Chair of this conference, was the General Chair of PSAM 12, and has served on many Technical Program Committees for PSAM conferences over the years. Dr. Paulos has almost 30 years working in aviation and space industries, and is currently the PRA Subject Matter Expert at NASA’s Jet Propulsion Laboratory overseeing the PRA efforts on all JPL programs. He is one of the prime authors for NASA’s PRA Procedures Guide and NASA’s soon to be released updated nuclear launch guidelines.
Dr. Paulos has a degree in Engineering from Harvey Mudd College, and a Master’s and Doctorate in Mechanical Engineering from UCLA.
Country: USA Company: NASA Jet Propulsion Laboratory Job Title: System Reliability Engineer
Paper 3 CO73
Lead Author: Courtney Otani Co-author(s): Mihai Diaconeasa, madiacon@ncsu.edu;
Steven Prescott, steven.prescott@inl.gov;
Arjun Earthperson, aarjun@ncsu.edu;
Robby Christian Robby.Christian@inl.gov
Probabilistic Methods for Cyclical and Coupled Systems with Changing Failure Rates
Advancements in the design of nuclear systems with automated control features has led to increasingly complex coupled systems and dynamic failure scenarios. This is especially true for micro reactor designs where components are not expected to be replaced during the reactors lifetime so the life of the system in addition to safety needs to be evaluated. Modeling these sequences of time-dependent events requires addressing cyclical processes and changing failure rates in ways that represent the true dynamics of the system in contrast to a single sampling for a component time to failure. This research presents two distinct analytical methods for several failure distributions, that evaluate a final time to failure used for different scenarios where the time to failure must be sampled multiple times. The first method is used when evaluating a component whose failure rate increases due to an outside event after the initial sampling but before the initially sampled time to failure. The second method is used when evaluating multiple identical components or a component that has been replaced with a new identical version before the second sampling. The two methods were implemented in a few representative case studies developed in the dynamic probabilistic risk assessment (PRA) tool Event Modeling Risk Assessment using Linked Diagrams (EMRALD). Overall, this paper provides guidelines on how these approaches could be considered for more realistic and accurate dynamic PRA of complex systems.
Bio: Courtney Otani (she/her) earned a Master of Science in mechanical engineering from the University of Washington and a Bachelor of Science in mechanical engineering from the University of Portland. While at the University of Washington, she was awarded a Clean Energy Institute graduate fellowship in 2019 to study the mixing properties of organic solvents and supercritical carbon dioxide for the purpose of metal organic framework synthesis. Currently she is a risk and reliability engineer/scientist for the Reliability, Risk, and Resilience Sciences department at Idaho National Laboratory. She also works on projects spearheaded in the Hydrogen and Thermal Systems department and the Infrastructure Security department. Her current work includes probabilistic risk assessment for novel nuclear power systems and thermal hydraulic system modeling.
Country: --- Company: Idaho National Laboratory Job Title: PRA Engineer/Scientist
Paper 4 FF206
Lead Author: Fernando Ferrante Co-author(s): Ken Kiper, kiperkl@westinghouse.com
Carroll Trull, trullca@westinghouse.com
Matt Degonish, degonimm@westinghouse.com
Development of Good Practices in the Implementation of Common Cause Failure in PRA Models
Modeling Common Cause Failure (CCF) quantitatively in Probabilistic Risk Assessment (PRA) models using parametric approaches has become significantly complex and challenging for risk-informed decision-making (RIDM) purposes, as the state-of-practice now includes an extensive consideration of CCF modeling. Different approaches are needed for modeling CCF in PRA models, depending on the specialized topic (e.g., support system initiating event, inter-system, functional dependency).
How to model dependencies appropriately is a critical aspect, as different topics may be better addressed via different solutions (e.g., CCF basic event derived via parametric modeling versus direct inclusion of dependencies in the PRA logic structure). For example, the distinction between “inter-system” and “intra-system” CCF is artificial and potentially misleading, as different types of dependencies can be misinterpreted under each term. A better approach to distinguishing how dependencies and CCF need to be handled, regardless of such artificial definitions, would better serve the PRA community. In addition, risk-informed applications that impose a specific condition on the baseline PRA model (such as the consideration of changes in CCF basic events dure to a failure or degradation of a component in an individual CCF group) can raise the impact of CCF in RIDM (including lower contributors in baseline PRA models). The current approach of using CCF parameters reflected in baseline CCF modeling may not be completely appropriate for such applications, to the extent that better approaches may be needed.
An investigation of CCF data gathering, development of CCF input parameters, estimation of CCF probabilities, and their inclusion in PRA modeling was performed which included a survey of a small subset of modern PRA models. The impacts and insights from CCF in base PRA models are derived from the surveys as well as a review of the detailed PRA models themselves. This is done to canvas the state-of-practice in this area, since multiple decades have now passed since the original development of the CCF tools and methods (as will be discussed in later sections).
The overall intent is to provide a better context for understanding CCF within RIDM applications as currently implemented, not just as a separate, complex, technical issue. To this end, a potential path forward with the development of a suggested good practice framework that is anchored in current, feasible approaches while taking into account the existing landscape of PRA and RIDM. Potential solutions for further improvement in areas where CCF can challenge RIDM implementation are suggested, which can be further explored. The good practices use existing CCF data and standard CCF methods but address technical areas where the state-of-practice can benefit from additional considerations.
Bio: Fernando Ferrante is a Principal Project Manager at the Electric Power Research Institute (EPRI) in the Risk and Safety Management group (RSM). Ferrante joined EPRI in 2017 as a Principal Technical Leader in RSM. He was promoted to Principal Project Manager within RSM in March 2021, gaining responsibility for direct oversight of RSM staff involved in human reliability, fire risk assessment, external flooding PRA, along with RIDM framework activities. Dr. Ferrante held positions as a risk analyst at the U.S. Nuclear Regulatory Commission and senior engineer at the Defense Nuclear Facilities Safety Board. Dr. Ferrante holds a Bachelor of Science degree in Mechanical Engineering from University College London, in the United Kingdom, and a Doctor of Philosophy degree in Civil Engineering from Johns Hopkins University.
Country: --- Company: Electric Power Research Institute Job Title: Program Manager