Hiperwall Fault Tolerance
March 14, 2017
Fault tolerance is a headline feature of the Hiperwall Premium Suite and it will provide significant peace-of-mind for Hiperwall customers. When we designed fault tolerance we looked at the most common failures experienced by our customers, which turned out to be very few, and those that most concerned our customers, which was a larger list. We designed fault tolerance to address those issues and more. This article describes what the Hiperwall Premium Suite fault tolerance features do and why they do it.
Defining Fault Tolerance
We must be clear: fault tolerance is not fault proof. As the name states, it tolerates faults, and the goal is to make that tolerance as seamless and painless to the user as possible. The basic idea of Hiperwall Premium Suite fault tolerance is that if one of our controllers fails or goes away for any reason, a second controller will continue to operate the wall with as few noticeable transition effects as possible (more on this below).
Why did we focus on this kind of fault recovery? Our customers were concerned that if something happened to their Hiperwall enabled controller PC, it could bring their system down. Since the controller PC is a commodity computer, it could suffer a hardware failure (typically fans failing or drives dying), a careless visitor could kick the power cord or knock over the PC if it isn’t in a rack, or most likely, an operating system update could cause the system to reboot for several minutes while the updates are applied.
The most obvious solution to a controller failure is to have a second controller. Luckily, with Hiperwall’s distributed visualization architecture, the HiperView enabled computers driving the displays are doing much of the work, so the controller software just runs on an ordinary PC and the added hardware expense is minimal. Adding a second controller has the added benefit of allowing a second user to actively control the system, an approach called “active-active” in distributed computing circles (as opposed to an active-standby approach). Thus a Hiperwall Premium Suite system has two active controllers that operate in parallel and can both control the wall simultaneously.
Other fault tolerance techniques that we rejected include triple modular redundancy (TMR) and voting systems, like those used in the space shuttle. They solve different problems than our fault tolerance solution does: they deal with a cosmic ray flipping a bit and causing a calculation to come out wrong, thus breaking the trajectory of the vehicle. Our premise is that as long as the hardware is working, it is producing correct results, but when it stops working, we need to detect that and recover.
Detecting faults in networked computer systems is a challenge, because computer networks are not reliable when they get busy. Switches drop packets when their buffers fill up, so we can’t rely on a single missed packet, for example. Network stacks (the software in Windows that sends and receives traffic on the network and implements communications protocols) are often tolerant of network disruptions, thus they don’t report connection failures for a long time. We had to develop techniques to detect that a fault occurred quickly, yet not be plagued with false detections. This required a multi-faceted approach with several monitoring channels and other low-level techniques to detect that the other controller has gone away and to take over the system if needed. In doing so, we had to define “primary” and “shadow” controllers, though the distinction is not apparent to users. The determination of primary and shadow occurs with an election process at system startup, so neither controller PC is always either primary or shadow, and neither is more capable than the other.
Testing fault tolerance was a significant challenge, because we wanted to make sure we tested failures that could affect our customers and those that they are concerned about, not just arbitrary failures that make the fault tolerance mechanism look good. Some of the test conditions were easy to do and replicate: pulling the USB license key, for example, can be performed over and over with identical results. Quitting the controller software or restarting the computer are similarly reliable and easy to detect and recover from. These produced nice, clear failure modes, because all the network connections were closed cleanly, thus failure detection is quick and easy (determining what actually happened is more challenging, but at least the controller knows that the other one is no longer connected). Pulling the power plug from a computer or causing it to “blue screen” (which we wrote a test program to force) were more challenging to test. These are somewhat dangerous operations, because they could corrupt the PC operating system or drive contents, so performing them often could lead to time consuming drive re-imaging. They also result in failures that are difficult to detect: the computer stops communicating, but the connections remain open for many seconds. This kind of failure was a great test for many of our timer-based fault detection algorithms.
Another kind of fault testing we perform is network failure. Network switches are extremely reliable, so we didn’t need to test for that kind of failure, but network cable pulls and intermittent connections can happen, so we test for those. A network cable pull is much like pulling the power cord, in that the computer stops communicating, but the network connections don’t timeout for a while. Intermittent network connections or cable pulls followed by quickly plugging it back in are much more insidious problems and are hard to diagnose. A very short cable unplug/re-plug is something most networks can tolerate with no disruption, so Hiperwall tolerates it with little to no interruption. A longer interval (half a second to a few seconds) between unplugging and re-plugging is one of the trickiest faults to detect and recover from, because the network is clearly disrupted, but not all the connections are broken. In this case, we choose to fail over to the other controller that did not have its network disrupted. This requires substantial coordination, because the controller that had the network failure needs to detect that it was interrupted, so when it is re-connected, it does not try to assert its old state, but instead negotiates with the other controller. The controller that didn’t fail needs to detect that it is no longer connected to the other controller, but it is still connected to the rest of the system. If it wasn’t already the primary controller, it then becomes the primary and takes over management of the system.
So what sort of behavior should a customer expect with a Hiperwall Premium Suite system when a controller fails? If the failed controller was the “shadow”, then there won’t be any effect, other than the failed controller can no longer control the wall. If it was the “primary”, there will be a short transition period as the old shadow controller becomes the primary and takes control of the system. On the displays, much of the content, including most HiperSource streaming content, will remain visible and continue updating without interruption. The exception is for Sender content from outside the LAN, for example HiperCast Senders or other remote Senders. They were connected directly to the failed controller, and it will take them a few seconds to change over to the new primary controller, but they will resume updating within 4-5 seconds typically. This means that a mission critical wall in a NOC or other control room will remain operating and only experience very minor disruption even if one of the system controllers fails. In a typical digital signage scenario, the failure of one of the system controllers won’t affect the displayed content at all, so no disruptions.
What kind of faults don’t we tolerate? The network switch failure mentioned above is the biggest example, but (a) switches are extremely reliable, and (b) our engineers have experimented with ways to failover to a second switch, but it requires very specific hardware. We obviously can’t do much about the failure of a display. We also don’t do anything about the failure of an individual computer running HiperView software and driving a display. As with a display failure, this doesn’t affect the functionality of the entire system, but just affects one display. Since the display PCs are typically commodity PCs, it is quick and easy to repair or replace them. Bringing a new HiperView PC online just takes a few minutes.
We designed the fault tolerance features of the Hiperwall Premium Suite to recover from the most common failures that can take a videowall system offline, including common hardware failures, mistakes, and accidents. It keeps the wall running in critical situations, but doesn’t require duplicating all the equipment nor the enormous extra cost that would take. We designed it to be as easy to use as all previous Hiperwall systems, but with extra peace-of-mind that can help all our customers.