According to Webster, a failure is classified as “an omission of occurrence or a state of inability to perform a normal or specified function that ultimately leads to a lack of success.” In the light of operating systems, this topic can be extremely important to a customer and designer of said system. Failures in a system can create catastrophic events that are costly in both monetary and emotional aspects. Faults within a system can be different in nature and can continue to wreak havoc on the system or to do nothing in some cases. Failures in distributed and centralized system can be specific in nature; however a generalized topic can describe the majority of these failures across the system. There are four types of failures in distributed systems that can affect functionality; but two of these four specific failures that can affect centralized systems as well. The general failures include: halting type failures, byzantine style failures, omission failures, and failures related to network issues. While all of these are important some of them affect a system more violently than others.

The first types of failures to discuss are halting failures. These types of failures can be frustrating in nature since they affect most of the system and do not allow the operator to really investigate without rebooting the entire system. These failures essentially freeze the system without allowing any action to alleviate the issue. There can be many reasons for a halting failure such as hardware or software malfunctions, memory problems, or even viruses. Many of these problems can be solved by updating your system, acquiring memory that can create room to do some of the system work, or installing antivirus or spyware software that can give the system the cleanup needed to continue processing.

Second in the list of possible failures are byzantine failures. These types of failures can be caused by an array of options that have the effect of data corruption and loss. Many of these types of failures also come in the form of malicious attacks. Interestingly, the biggest development of the computer age, the internet, has become a superhighway for attackers to get in touch with their victims. Stopping these attacks are increasingly difficult due to the fact that the dependency of the population on computers is immense. The ability for hackers to infiltrate corporations is easier since the attackers send massive emails to the users. An unsuspecting user can easily click on a link and give access to the entire system. This type of failure can be applied to a centralized system as well and can be just as detrimental.

The tertiary failure that occurs in a distributed system is an omission failure. These failures correlate directly with the space available on the server to communicate through the routing system. An example of this is when a message is sent or received but the lack of system space causes a discard of these messages without a notice being sent to either the receiver or the sender. These failures are common within security sections of corporations where a hacker can essentially instigate a distributed denial of service by overloading the system with spam. This action clogs up the system and as described before, can cause errors within the system and affect the users. These failures can easily be fixed by sending out test messages and waiting an ample time to get a return from it. Then when isolated can be fixed by adding memory capabilities to the affected area.

The final area of discussion includes network failures. These are described as simply a loss of network. This type of failure can keep systems from properly communicating with each other and in this way cause two main issues to become present within the system in accordance with processors. These issues can be labeled as one way link problems and individual network partitions. The first problem comes up when one link is not communicating with the other, whereas the second issue occurs when lines connecting two areas of a network fail entirely. An abundance of issues can be attributed to the user themselves and could be something very simple. Network issues can be resolved by simple using a network diagram and testing each individual system. Many times it is best to halve test a system to cut away the portions that work correctly, by testing one half of the system and continually cutting down by halves, an operator can efficiently cut down on the time spent isolating problems by almost seventy-five percent.

Computers themselves and their systems can be difficult to navigate and understand, unfortunately with the systems that are designed to increase productivity daily for people, they are increasingly vulnerable. It falls to the user or the corporation to understand the differences between systems and balance it with a selection that offers what they need and what they can protect against.


Stallings, W. (2012). Operating Systems: Internals and Design Principles (7th ed.). : Pearson

Education, Inc.…...

