SMS: Finding trouble before trouble finds you
Faye Ackermans
          Member, Transportation Safety Board of Canada
          03 November 2015
          Ottawa, ON
Check against delivery.
[Slide 1: Title page]
Good afternoon. Thank you for that kind introduction. It is always a pleasure to spend time with the rail industry, and I thank the organizers for providing me with the opportunity to speak with you today.
Every industry has had its moments of horror – a major accident involving many fatalities or extensive environmental damage – that leads to change. That change has one goal in mind – to ensure it never happens again. That is the reaction of us all in the rail industry to the accident at Lac-Mégantic in July 2013. How could this have happened, and what do we have to do to make sure it never happens again? The rail industry and the rail regulators in North America have moved swiftly with action on many fronts to ensure that the circumstances and consequences of that accident are not repeated.
Some of you here today may think those actions have gone too far or are unnecessary. After all, wasn't the mistake of one individual the cause? In fact, the TSB's report pointed to 18 different factors that contributed to the accident, and an additional 16 factors that added risk to the operation. It is critical in any accident investigation to look deeper—to understand why people make the decisions they make, because if those actions make sense to one individual, they may make sense to others. If too much focus is placed on individual error, we will miss out on understanding the context within which members of an organization function. We need to shift our focus to how the whole system functions, including the organization.
But how do you, the rail industry leaders, PREVENT such accidents in the first place? How do you find trouble before trouble finds you? Most accidents, whether they are big or small, can be attributed to a breakdown in the way an organization identifies and mitigates hazards and manages risk.
Why are some companies better at managing risk than others? The answers are not simple, but in my view, there are 4 features of an organization at play here: having a just culture; having systems and processes to learn from incidents; having information flow to proactively identify potential safety concerns; and having effective safety leadership. These are all related to each other. I will deal with each of these in turn in a moment, but first I want to give a bit of background on the development of Safety Management Systems and three different approaches to safety.
[Slide 2: SMS – A bit of history]
Safety Management Systems had their genesis in catastrophic accidents. There were three in particular that drove change in Europe. The first, a chemical explosion at a petrochemical facility in 1974 in Flixborough, U.K., created the first requirement for a “safety case” in land-based petrochemical facilities. The second, a release of six tons of chemicals including carcinogens in 1976 in Seveso, Italy, resulted in a major overhaul of European safety regulations. The third, an explosion and fire aboard the Piper Alpha oil rig in the North Sea in 1988, was the subject of a public enquiry. One of the major recommendations from that enquiry, led by Lord Cullen, was a requirement for “formal assessments of major hazards to be identified and mitigated”. In the U.K., this was known as the “safety case”. By the 1990s the discussion on safety cases was being picked up by other industrial sectors. For example, in 1995, the International Marine Organization set forth requirements for Marine Safety Management Systems. And in 2001, Transport Canada published its initial requirements for SMS in the Canadian rail industry.
[Slide 3: Canadian rail SMS requirements]
The formal requirements for SMS in the Canadian rail industry are meant to act as a framework to help integrate safety into day-to-day operations and to consider safety in all decision-making. The current requirements are mostly about having certain processes in place – for example, processes to ensure compliance with regulations, processes to identify safety concerns, to report hazards, to implement and evaluate remedial actions, to perform risk assessments. All of these processes are needed to manage risks, but they are not sufficient by themselves. What is missing? Before I try to answer that, I want to provide a bit more background on safety management.
[Slide 4: Three approaches to safety management. The person model]
There are three approaches to safety in widespread use today. The oldest of these, the “person” model, dates to the early 1900s and is no doubt recognizable to all of you. There is a statistical relationship between each of the layers on the pyramid. For every fatality, there are usually 10Footnote 1 serious injuries and so on. This model is characterized by an emphasis on the person and on unsafe acts, and tends to result in blame, shame, retrain the individual who makes a mistake, followed by the writing of another procedure to ensure that someone else does not make the same mistake. Sound familiar?
[Slide 5: The technical/engineering model]
The second approach is a “technical or engineering model” and dates from the mid-1940s and ‘50s”. In this model, safety is engineered into the system. There is an emphasis on process safety and on reliability engineering. Humans are part of the system, and the man/machine interface is designed with this in mind. This approach incorporates human performance as part of the system. It has resulted in understanding why we make mistakes and introduced processes such as threat-and-error management, risk assessments and crew resource management. This approach lends itself to technical safety audits and assessments.
[Slide 6: The organization model]
The third model had its genesis in the 1980s and is basically an extension of the engineering model so as to encompass the whole organization. (e.g., think SMS). In the “organization model,” errors are regarded as symptoms of latent conditions in the organization, stemming from management decisions and system design. Identification of hazards may come from examining non-catastrophic accidents and incidents, or from listening to the “weak” signals in the organization and then connecting the dots. Using this approach, organizations find proactive ways to identify and mitigate hazards to reduce the overall risk of the system. Safety-based decision-making is embedded throughout the organization.
So now let's look at the organization as a whole to start to understand how the pieces all fit together.
[Slide 7: Safety, leadership, and culture]
In this model, the working interface is where all aspects of an organization come together. A worker interacts with the processes, the facilities and the equipment to get the job done. On the one hand, every organization must have safety-enabling systems—that is, processes that enable safe outcomes. These include processes to provide training and knowledge, to reduce exposure to workplace hazards, various policies and standards and operating procedures, and processes to recognize hazards. A safety leader needs to understand these processes, how they are audited, and how effective they are.
On the other hand, organizations must also have processes that sustain the enabling systems—such as how people are selected and developed, how an organization is structured, how performance is managed, and so on. Just having the enabling systems is not enough. The organization must be capable of supporting and sustaining safe operations. For example, is safety given adequate emphasis through the structure of the organization? Does performance management adequately address the safety responsibilities of leadership? How are employee's mistakes handled? Are they treated as learning opportunities for the organization or are the individuals punished? The link between the two sets of systems is the culture of the organization – the mostly unwritten rules of how things really work. This is where, despite the best enabling and sustaining systems, a culture may negatively impact the workforce through mistrust, poor communication, or lack of credibility in management. And finally, it is the leaders of an organization who drive both sides and have the greatest impact on the culture. Leaders make decisions and demonstrate behaviours which are interpreted by the rest of the organization, thereby influencing the unwritten culture.
[Slide 8: What is safety culture?]
So what is safety culture? The simplest definition is “the way we do things around here.” Note: This is a statement that talks about behaviours—things you can see others doing. Culture is embedded deep in an organization and takes a long time to change. Just changing a policy, for example, changing the way discipline is assessed when employees do not following operating instructions, does not change culture. It is absolutely a necessary first step, but it takes years for leaders to demonstrate that they are following a new policy to get workers to believe there has been real change. Once workers believe what management is saying, the culture will change.
In the academic safety world, the term that has come to be used to describe the best way to deal with employee errors is “just culture.”
[Slide 9: What is “just culture?”]
Just culture is a culture in which front-line operators and others are not automatically punished for actions, omissions or decisions taken by them which are commensurate with their experience and training—but where gross negligence, willful violations and destructive acts are not tolerated. Knowing the outcome of an error creates hindsight bias. In a just culture, organizations strive to understand why an employee's actions made sense at the time, rather than letting the outcome influence the determination of whether or not to punish.
[Slide 10: With a just culture]
In a just culture, an organization encourages openness. Employees willingly share information without fear of punishment.
[Slide 11: Without a just culture]
Without a just culture, safety-critical information is stifled for fear of reprisal. Organizations become defensive rather than learning from errors and seeking to improve. As a result, safety suffers.
[Slide 12: Systems and processes to learn from accidents]
There are several methods management uses to learn where risk exists in an organization. As I said earlier, I am going to speak to just two of them. The first is through actual experience such as accidents or incidents that get reported to the organization. In turn, the organization uses good processes to understand in-depth what has happened, why it is has happened, what changes need to be made to procedures or equipment or software, or what new technology needs to be implemented … all to ensure the event does not happen again. You must then make those changes and monitor their effectiveness. That, in a nutshell, is what I meant earlier when I said organizations must have systems and processes to learn from incidents. All of this can take a great deal of time and effort, and of course there are always competing interests for your time and competing interests for the costs involved. In some cases, it has proven hard to get management's attention to small incidents. This speaks to the safety mindset or individual beliefs of managers. In the TSB's experience, the causes and contributing factors for most major accidents have existed for some time, and most catastrophic accidents were preceded by minor events whose importance were not recognized by the organization.
So what YOU do as safety leaders, what processes YOU have in place, what time YOU spend on understanding and dealing with the seemingly minor events, can have a major impact on risk reduction in your organizations.
[Slide 13: Information flow to proactively identify safety concerns]
The second way management can learn from what is going on, is to have information flowing freely upwards in the organization. This information may come from various sources, such as employee self-reporting of hazards or close calls, or it may be available through technology such as on-board locomotive video, voice, or event recorders, or from automated inspection processes, or it may be from management observing how tasks are performed. Having this information flowing upwards in the organization, capturing it, systematically analyzing it, doing risk assessments based on the data, learning from it and making changes to procedures or the plant or equipment based on it, is what I referred to earlier as listening to the “weak” signals in the organization. While the rail industry has made great strides in the use of data from automated systems, there is a source of data that has not yet been adequately tapped – namely employees. Employees' trust that management will not react in a punitive way is critical to creating a culture that allows this type of data to flow freely. Without that trust, the organization cannot learn from these weak signals. Because really, it's all about understanding the why, and the more information managers have, the better they can do that.
So I have talked about just culture, having systems and processes to learn from incidents, and listening to the weak signals in an organization. The fourth feature of organizations that I set out to speak about today is having effective safety leadership. I have threaded this feature throughout the speech today. You are the ones who have the most direct influence on culture in your organizations. Your leadership behaviours—what decisions you make, how they are communicated and how they get interpreted by the workers—have a direct impact on the culture in your organizations. You are the ones who set out the policies and procedures that enable and sustain safety management in your organization. You are the ones who must decide what the future looks like and set out the changes necessary to make the future happen.
In my view, reliance on the historic approach to employee discipline for rule infractions is holding back the rail industry from effectively mitigating risks – both longstanding risks and emerging risks. Too many managers are still stuck in the old way of thinking that “if only everyone would follow the rules there would not be so many accidents.” And so they don't dig deep enough to understand the “why.” Where is the proof?
Let's take a look at some accident data.
Rule infractions play a large role in the minor accidents in yards and sidings, but I have chosen not to present that data since the changes in TSB reporting requirements over the years have impacted the statistics. But those changes have had very little impact on the reporting of main-track derailments.
[Slide 14: Inside the numbers]
Here is a quick summary of the number of main track train derailments in Canada for the past 15 years. The data is presented as an average number over a five-year period, and the data is separated by the cause of the accidents. As you can see, accidents caused by track issues have fallen from 56.6 per year in the 2000 to 2005 period to 34.4 in the most recent five-year period. Similarly, accidents caused by equipment have fallen from an average of 59 per year to 29.8. Accidents caused by the actions of people have fallen from 24.4 per year to about 20 per year. When main-track collisions caused by the actions of people are added to this number…
[Slide 15: Inside the numbers]
… it shows that these actions accounted for an average of 26.8 main track accidents each year during the past five years – in other words, almost as many as those caused by equipment.
Finally, all other causes of accidents have fallen from 16.8 per year to 10.2 per year.
So, this tells me the railways have made great strides in reducing the overall number of main-track train derailments BUT that not enough is being done to address the mistakes or actions taken by the people in the system. The only way progress can be made is for the industry to tackle this issue head-on. If employees believe they will be punished every time they make a mistake, they won't tell you about it. So you won't find out until the circumstances line up—like the holes in Swiss cheese—and you have an accident. If you THEN don't take the time to dig deep enough to understand how the systems, equipment, processes, training and so on created the context in which an error was made, then you won't be able to correct the real problems.
[Slide 16: SMS requirements for Canadian railways - Conclusions]
The rail industry and the regulator have now had about 14 years of experience with formal safety management systems. When I looked closely at the formal requirements, l realized that almost all of them will “enable” safety – that is, they are about processes. Therefore, while they may be necessary to improve overall management of safety, they are not sufficient. And the perception that the documentation requirements are bureaucratic has created a risk that although an SMS may exist on paper, it may not translate into the day-to-day operating environment.
For organizations to find trouble before trouble finds them, not only do they need the various processes I have spoken about, they also need a safety culture—a just culture—that encourages information to flow upwards. And culture change must be led by management.
[Slide 17: TSB Watchlist: Safety management and oversight]
At the TSB, we maintain a Watchlist to shine a light on those issues that pose the greatest risk to Canada's transportation system. Safety management and oversight is on our watchlist. The TSB will continue to thoroughly investigate accidents, and we will continue to take a close look at how effectively organizations have implemented their safety management systems.
[Slide 18: Words to consider]
I want to leave you with a final thought, a quote from Lord Cullen on the 25th anniversary of the Piper Alpha accident:
“No amount of regulations for safety management can make up for deficiencies in the way in which safety is actually managed. The quality of safety management … depends critically, in my view, on effective safety leadership at all levels and the commitment of the whole work place to give priority to safety.”
Thank you.
[Slide 19: Questions?]
[Slide 20: Canada watermark]