Reliable/fault tolerant computing deals with techniques to provide a computer system an ability to keep normal operation despite the occurrence of failures. A failure may be permanent in which a component cannot function properly after the failure, or transient in which a component suffers from a temporary failure (such as loss of data) but remains functional after the failure. A failure may be suffered by a hardware component or by software components due to bugs in code.
The goal of fault tolerant computing is to provide high availability, measured in the percent of time it is functioning. Availability is affected by the failure rate as well as by the time to recover from the failure. Designing fault tolerant computer systems must balance the target availability that is appropriate for the market of the systems, the cost of providing fault tolerance, and performance overheads.
Failures can be masked by using redundant execution, for example by having multiple components performing the same task and selecting the majority outcome as the correct outcome. Failures can be detected and corrected using error detection and correction coding. Failures can also be detected and recovered using a roll-back recovery scheme, in which the state of the system is rolled back to a known good state, and computation is restarted from there.
Computer security deals with techniques to keep computers secure from attacks. With the increasing interconnectedness of computer systems, security attacks are of increasing concerns. The goal of a security attack is to modify the behavior of the computer system in order to benefit the attacker, such as leaking or destroying valuable information, or making the system inoperational.
Attackers can attack different components of the software layer by exploiting vulnerabilities in application code or the operating system, or the hardware layer by exploiting unprotected hardware components in the system.
At North Carolina State University, we cover fault tolerance and computer security briefly in several courses at the undergraduate level and introductory graduate level, and cover them extensively in an advanced graduate level course. Our research program addresses fault tolerance and computer security concerns at various components of the computer system, such as at the processor microarchitecture level, memory system architecture level, and at system software level. Some examples of our past projects that have demonstrated our role in pioneering research effort in memory subsystem include: