Security and Reliable/Fault-Tolerant Computing

Computer Architecture and Systems

Reliable/fault tolerant computing deals with techniques to provide a computer system an ability to keep normal operation despite the occurrence of failures. A failure may be permanent in which a component cannot function properly after the failure, or transient in which a component suffers from a temporary failure (such as loss of data) but remains functional after the failure. A failure may be suffered by a hardware component or by software components due to bugs in code.

The goal of fault tolerant computing is to provide high availability, measured in the percent of time it is functioning. Availability is affected by the failure rate as well as by the time to recover from the failure. Designing fault tolerant computer systems must balance the target availability that is appropriate for the market of the systems, the cost of providing fault tolerance, and performance overheads.

Failures can be masked by using redundant execution, for example by having multiple components performing the same task and selecting the majority outcome as the correct outcome. Failures can be detected and corrected using error detection and correction coding. Failures can also be detected and recovered using a roll-back recovery scheme, in which the state of the system is rolled back to a known good state, and computation is restarted from there.

Computer security deals with techniques to keep computers secure from attacks. With the increasing interconnectedness of computer systems, security attacks are of increasing concerns. The goal of a security attack is to modify the behavior of the computer system in order to benefit the attacker, such as leaking or destroying valuable information, or making the system inoperational.

Attackers can attack different components of the software layer by exploiting vulnerabilities in application code or the operating system, or the hardware layer by exploiting unprotected hardware components in the system.

At North Carolina State University, we cover fault tolerance and computer security briefly in several courses at the undergraduate level and introductory graduate level, and cover them extensively in an advanced graduate level course. Our research program addresses fault tolerance and computer security concerns at various components of the computer system, such as at the processor microarchitecture level, memory system architecture level, and at system software level. Some examples of our past projects that have demonstrated our role in pioneering research effort in memory subsystem include:

  • Secure heap memory : a heap library that removes the link between vulnerabilities in the application code and the behavior of the heap library. Secure heap library is provided using the help of protection that comes naturally from separate address space between the application and a new heap management process. The library, known as "Heap Server library" has been released to the public, and is an example of how research and innovation at NCSU provides an immediate benefit to the community.
  • Secure processor architecture : plain text of data and code stored in the main memory and disk is vulnerable to an attacker who tries to obtain secret information stored in the memory. To protect against the possibility of such attacks, the memory can be encrypted and protected with authentication code. Data is only accessed in plaintext form inside the processor chip. The project achieves secure memory with low performance overheads and compatibility with modern system features such as virtual memory and inter-process communication.

Associated Courses