Controls Assessments
The level of risk
will depend on the adequacy and effectiveness of existing controls. When
conduction a controls assessment, questions to be addressed include:
What
are the existing controls for a particular risk?
Are
those controls capable of adequately treating the risk so that it is controlled
to a level that is tolerable?
Are
the controls operating in the manner intended and can they be demonstrated to
be effective when required (in practice)?
These questions can
only be answered with confidence if there are proper documentation and
assurance processes in place. The level of effectiveness for a particular
control, or suite of related controls, may be expressed qualitatively,
semi-quantitatively or quantitatively. In most cases, a high level of accuracy
is not warranted. However, it may be valuable to express and record a measure
of risk control effectiveness so that judgments can be made on whether effort
is best expended in improving a control or providing a different risk
treatment. The following methods are given in the ISO 31010:2011 are come under
control assessment tools, but some authors have categorized them in different
ways in different scenarios. However, we consider following methods are to be
controls assessment techniques or tools that can be used in the risk
assessments.
18. Layer
protection analysis (LOPA)
Layer of Protection Analysis is a risk
management technique commonly used in the chemical process industry that can
provide a more detailed, semi-quantitative assessment of the risks and layers
of protection associated with hazard scenarios. LOPA allows the safety review
team an opportunity to discover weaknesses and strengths in the safety systems
used to protect employees, the plant, and the public. LOPA is a means to
identify the scenarios that present the most significant risk and determine if
the consequences could be reduced by the application of inherently safer design
principles. LOPA can also be used to identify the need for safety instrumented
systems (SIS) or other protection layers to improve process safety.
A variety of approaches are employed which
could use order-of-magnitude, half order-of-magnitude, and decimal math.
Conceptually, LOPA is used to understand how a process deviation can lead to a
hazardous consequence if not interrupted by the successful operation of a
safeguard called an independent protection layer (IPL). An IPL is a safeguard
that can prevent a scenario from propagating to a consequence of concern
without being adversely affected by either the initiating event or by the
action (or inaction) of any other protection layer in the same scenario.
The basic steps are:
A cause-consequence
pair is selected and the layers of protection which prevent the cause leading
to the undesired consequence are identified.
An order of magnitude
calculation is then carried out to determine whether the protection is adequate
to reduce risk to a tolerable level.
LOPA is a less resource-intensive process
than a fault tree analysis or a fully quantitative form of risk assessment, but
it is more rigorous than qualitative subjective judgments alone. It basically focuses
efforts on the most critical layers of protection, identifying operations,
systems and processes for which there are insufficient safeguards and where
failure will have serious consequences. However, this technique looks at one
cause-consequence pair and one scenario at a time and, therefore, does not
apply to complex scenarios where there are many cause consequence pairs or
where a variety of consequences affect different stakeholders.
Normally multiple Protection Layers (PLs) are
provided in the process industry, in which each protection layer consists of a
grouping of equipment and/or administrative controls that function in concert
with the other layers. Protection layers that perform their function with a
high degree of reliability may qualify as Independent Protection Layers (IPL).
The criteria to qualify a Protection Layer (PL) as an IPL are:
The protection
provided reduces the identified risk by a large amount, that is, a minimum of a
10- fold reduction.
The protective
function is provided with a high degree of availability (90% or greater).
It has the following important
characteristics:
Specificity
An IPL is designed solely to prevent or to
mitigate the consequences of one potentially hazardous event (e.g., a runaway
reaction, release of toxic material, a loss of containment, or a fire).
Multiple causes may lead to the same hazardous event; and, therefore, multiple
event scenarios may initiate action of one IPL.
Independence
An IPL is independent of the other protection
layers associated with the identified danger.
Dependability
It can be counted on to do what it was
designed to do. Both random and systematic failures modes are addressed in the
design.
Auditability
It is designed to facilitate regular
validation of the protective functions. Proof testing and maintenance of the
safety system is necessary. Only those protection layers that meet the tests of
availability, specificity, independence, dependability, and auditability are
classified as Independent Protection Layers.
19. Decision Tree
Decision trees are
a simple, but powerful form of multiple variable analysis, which provide unique
capabilities to supplement, complement, and substitute for traditional
statistical forms of analysis (such as multiple linear regression) with variety
of data mining tools and techniques (such as neural networks) as well as
recently developed multidimensional forms of reporting and analysis found in
the field of business intelligence. A decision tree is a graph that uses a
branching method to illustrate every possible outcome of a decision. Decision
tree builds classification or regression models in the form of a tree structure,
which breaks down a data-set into smaller and smaller subsets while at the same
time an associated decision tree is incrementally developed. The final result
is a tree with decision nodes and leaf nodes, where a decision
node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and
Rainy). Leaf node (e.g., Play) represents a classification or decision, whereas
the topmost decision node in a tree which corresponds to the best predictor
called root node.
Decision trees can
handle both categorical and numerical data. Decision trees can be drawn by hand
or created with a graphics program or specialized software. Informally,
decision trees are useful for focusing discussion when a group must make a
decision. Programmatically, they can be used to assign monetary/time or other
values to possible outcomes so that decisions can be automated. Decision
tree software is used in data mining to simplify complex
strategic challenges and evaluate the cost-effectiveness of research and
business decisions. Variables in a decision tree are usually represented by
circles.
Drawing a Decision Tree
You start a
Decision Tree with a decision that you need to make, and draw a small
square to represent this towards the left of a large piece of paper.
From this box draw
out lines towards the right for each possible solution, and write that solution
along the line. Keep the lines apart as far as possible so that you can expand
your thoughts.
At the end of each
line, consider the results. If the result of taking that decision is uncertain,
draw a small circle.
If the result is
another decision that you need to make, draw another square. Squares represent
decisions, and circles represent uncertain outcomes.
Write the decision
or factor above the square or circle. If you have completed the solution at the
end of the line, just leave it blank.
Starting from the
new decision squares on your diagram, draw out lines representing the options
that you could select. From the circles draw lines representing possible
outcomes.
Again make a brief
note on the line saying what it means.
Keep on doing this
until you have drawn out as many of the possible outcomes and decisions as you
can see leading on from the original decisions.
20.
Human Reliability Analysis
Human
reliability assessment (HRA) is the common name for an assortment of methods
and models that are used to predict the occurrence of ‘human errors’. While the
origin of HRA is in Probabilistic Safety Assessment (PSA), HRA is increasingly
being used on its own both as a way to assess the risks from ‘human error’ and
as a way to reduce system vulnerability. According to the three principal
functions of HRA are:
Identifying
what errors can occur (Human Error
Identification),
Deciding
how likely the errors are to occur (Human
Error Quantification), and,
Enhancing
human reliability by reducing the given error likelihood (Human Error Reduction), if appropriate.
Thus
anticipating failures of joint human-machine systems requires an underlying
model. This should not be a model of human information processing in disguise,
but a model of how human performance is determined by – hence reflects – the
context or circumstances, i.e., a model of joint system performance rather than
of individual human actions.
Human reliability
assessment deals with the impact of humans on system performance and can be
used to evaluate human error influences on the system. At the risk of stating
the obvious, human reliability is very important due to the contributions of
humans to the resilience of systems and to possible adverse consequences of
human errors or oversights, especially when the human is a crucial part of
today’s large socio-technical systems. A variety of methods exist for human
reliability analysis. These break down into two basic classes of assessment
methods, that are probabilistic risk assessment (PRA) and those based on a
cognitive theory of control. In 2009, the Health and Safety Laboratory
compiled a report for the Health and Safety Executive (HSE) outlining HRA
methods for review. They identified 35 tools that constituted true HRA
techniques and that could be used effectively in the context of health and
safety management.
Conducting a HRA requires people with
expertise in conducting such methods where you need to follow these steps to
prepare a risk model based on HRA.
1. Understand the actions being investigated
2. Use a structured approach to investigate and
represent the actions (task analysis)
3. Consider the level of detail needed (compare
with description detail of available data)
4. Understand the failure criteria
5. Select an appropriate technique(s)
Represent the
identified errors in a risk model
Application of HRA
i). Eliminate Error Occurrence
This is the first
preference, where design features known to be a source of human error are
eliminated (e.g., lack of feedback, lack of differentiation, inconsistent or
unnatural mappings). Design choices available for error elimination include:
Replacement
of error inducing design features (e.g., physical device separation, physical
guards, application of validity and range)
Restructuring
of task where error prevalent behaviour is no longer performed (e.g., by
information filtering, only the information needed for the task is provided).
Automate
to change the role of human involvement in support of task performance.
ii). Reduce Error
Occurrence
Consider this
approach, if complete error elimination is not possible or feasible through
design choices. Design features which can reduce error occurrence include:
Identification
(e.g., device labeling)
Constraints
(i.e., build in constraints to limit operation to acceptable ranges)
Coding
(i.e., aid in choice differentiation and selection)
Consistency
feedback (i.e., convey device and system state directly in the interface)
Predictability
(i.e., design system responses so that operators can associate specific control
actions with system response)
iii). Eliminate
Error Consequence
The third approach
is to eliminate error consequences, where there are three categories of design
features that reflect the components of the consequence prevention strategy:
A.
Error detection design features (to promote detection prior to consequence/ occurrence):
Feedback
(i.e., status information in relation to operational goals and potential
side-effects of an action)
Alert
of Unacceptable Device States (e.g., visual/auditory feedback of off-normal or
unacceptable device states)
Confirmation
(i.e., support of self-checking and independent verification practices)
Prediction
(i.e., providing information on the outcome of an action prior to its
implementation, or with sufficient time for correction)
B.
Error recovery design features (to enable recovery prior to consequence/
occurrence):
Undo
(e.g., facilities for reversing recent control actions to allow promote error
recovery)
Guidance
(i.e., alternative forms of guidance for cases where reversing a recent control
action is not the preferred action)
C.
Consequence Prevention Design Features
Interlocks
Margins
and Delays (i.e., these features can provide more time to unacceptable
consequence realization, thus increasing the chances of error detection and
recovery prior to consequence occurrence)
Fail
Safe Features
iv). Reduce Error
Consequence
If errors and
consequences cannot be completely eliminated, consider measures that enable
consequence reduction. This may be achieved through application of additional
design features that allow operators or automation to recognize the occurrence
of an error consequence, and to take action to mitigate the consequences.
Additional design
features include:
Margins
(i.e., apply larger design margins to allow some consequences to be
accommodated by normal system function and capacities).
Engineered
Mitigating Systems (e.g., automatic special safety systems actions, such as
CANDU Automatic Stepback and Setback).
Human
Intervention (i.e., operations team can readily adapt to both predefined and
undefined operating situations).
Response
Teams (i.e., organizational structure is prepared and coordinated to deal with
the predefined consequences).
Consequence
Prediction (e.g., aids can assist operations staff in predicting the extent of
consequences of operating actions and assist in selection and execution of
mitigating actions).
Backup
Replacement Function (i.e., provision of equipment and/or human intervention to
mitigate consequences).
21. Bow Tie Analysis
Bow-tie analysis is
a simple diagrammatic way to display the pathways of a risk showing a range of
possible causes and consequences. It is used in situations when a complex fault
tree analysis is not justified or to ensure that there is a barrier or control
for each of the possible failure pathways. The Bow-tie analysis starts with the
risk at the “knot” of the tie, and then moves to the left to identify and
describe the events or circumstances that may cause the risk event to occur,
paying particular attention to root causes. Once those causes have been
identified, the analysis then identifies preventive measures that could be
implemented. At this point there could be an evaluation of the actual
preventive measures that the organization has in place to determine whether
additional measures should be implemented. The analysis then moves to the right
to look at the potential consequences that would result after the risk event
happens, and the plans the organization either has or should have in place to
minimize the negative effects of the risk.
Bow-tie Diagram
Construction
1. Define the
Hazard and the Top Event which is the initial consequence
"What
happens when the danger is" released "?"
2. Identify the Threats which are the Top Event
causes
"What
causes the release of danger? "How can lost control?
3. Identify the
existing Protection Barriers each Threat
Prevent
the Top Event occurrence
Can
be independents or dependents
"How
can controls fail?"
"How
can that their effectiveness can be compromised?"
4. Identify for
each Barrier their Escalation Factors
Factors
that make the Barrier fail
“How
can we avoid that the hazard being released?
“How
can we keep the control?”
5. Identify for
each Barrier their Escalation Factors Control
Factors
that prevent or minimize the possibility of the Barrier or the Recovery
Measures becomes ineffective
"How
to ensure that the controls will not fail?"
6. Identify the
consequences
Top
Event could have several consequences
7. Identify the
Recovery Measures
Factors
that make the barriers fail
"How
can we limit the severity of the event?”
"How
can we minimize the effects?"
9. For each
Barrier, Recovery Measures and Escalation Factors Controls identify the
Critical Safety Tasks
Critical Safety
Tasks
Tasks prevent
and/or minimize the possibility of the Barrier, the Escalation Factor Control
or the Recovery Measures fails or becomes ineffective What tasks can be taken
to ensure that the control is working?
Project
engineering, operation, maintenance, management.
"How
can we ensure that these tasks are done?"
"Who
do these tasks?
"How
do you know when to do the tasks?"?
"How
do you know what to do?“
"Is
there a procedure, checklist, instruction?"
22. Reliability Centered Maintenance (RCM)
Reliability
centered maintenance (RCM) is a corporate-level maintenance strategy that is
implemented to optimize the maintenance program of a company or facility. RCM
is the process of determining the most effective maintenance approach, where
RCM philosophy employs Preventive Maintenance (PM), Predictive Maintenance
(PdM), Real-time Monitoring (RTM), Run-to-Failure (RTF- also called reactive
maintenance) and Proactive Maintenance techniques in an integrated manner to
increase the probability that a machine or component will function in the
required manner over its design life cycle with a minimum of maintenance. The
goal of the philosophy is to provide the stated function of the facility, with
the required reliability and availability at the lowest cost. RCM requires that
maintenance decisions be based on maintenance requirements supported by sound
technical and economic justification.
RCM allows you to
identify applicable and effective preventive maintenance requirements for
equipment “…in accordance with the safety, operational and economic
consequences of identifiable failures, and the degradation mechanism
responsible for those failures”. RCM uses a failure mode, effect and
criticality analysis (FMECA) type of risk assessment that requires a specific
approach to analysis in this context. From a quality management standpoint,
it’s worth being aware that RCM identifies required functions and performance
standards and failures of equipment and components that can interrupt those
functions. The final result of an RCM program is the implementation of a
specific maintenance strategy on each of the assets of the facility.
The maintenance strategies are optimized so that the productivity of
the plant is maintained using cost-effective maintenance techniques.
There are four
principles that are critical for a reliability centered maintenance program.
1.
The primary objective is to preserve system function
2.
Identify failure modes that can affect the system function
3.
Prioritize the failure modes
4.
Select applicable and effective tasks to control the failure modes
An effective
reliability centered maintenance implementation examines the facility as a
series of functional systems, each of which has inputs and outputs contributing
to the success of the facility. It is the reliability, rather than the
functionality, of these systems that are considered. There are basic seven
questions that need to be asked for each asset is:
1.
What are the functions and desired performance standards of each asset?
2.
How can each asset fail to fulfill its functions?
3.
What are the failure modes for each functional failure?
4.
What causes each of the failure modes?
5.
What are the consequences of each failure?
6.
What can and/or should be done to predict or prevent each failure?
7.
What should be done if a suitable proactive task cannot be determined?
23. Sneak Circuit Analysis
Sneak Circuit
Analysis (SCA) is used in safety-critical systems to identify sneak (or hidden)
paths in electronic and electro-mechanical systems that may cause unwanted
action or inhibit desired functions. Sneak analysis can locate problems in both
hardware and software using any technology. The sneak analysis tools can
integrate several analyses such as fault trees, failure mode and effects
analysis (FMEA), reliability estimates, etc. into a single analysis saving time
and project expenses. The technique helps in identifying design errors
and works best when applied in conjunction with HAZOP. It is very good for
dealing with systems which have multiple states such as batch and semi-batch
plant.
The analysis is
based on identification of designed-in inadvertent modes of operation and is
not based on failed equipment or software. SCA is most applicable to circuits
that can cause irreversible events. These include:
a.
Systems that control or perform active tasks or functions
b.
Systems that control electrical power and its distribution
c.
Embedded code which controls and times system functions
The SCA process
differs depending on whether it is applied to electrical circuits, process
plants, mechanical equipment or software technology, and the method used is
dependent on establishing correct network trees. When evaluating sneak
circuits, the problem causing path may consist of hardware, software, operator
actions, or combinations of these elements. Sneak circuits are not the result
of hardware failure but are latent conditions, inadvertently designed into the
system, coded into the software program, or triggered by human error. Four
categories of sneak circuits are:
1.
Sneak Paths
Unexpected
paths along which current, energy, or logical sequence flows in an unintended
direction
2. Sneak
Timing
Events
occurring in an unexpected or conflicting sequence
3. Sneak
Indications
Ambiguous
or false displays of system operating conditions that may cause the system or
an operator to take an undesired action
4. Sneak
Labels
Incorrect
or imprecise labeling of system functions - e.g., system inputs, controls,
display buses, that may cause an operator to apply an incorrect stimulus to the
system