NC State University ECE Dept.

Architecture Research for PErformance, Reliability, and Security

Research

For an overview of current (more up-to-date) research activities at ARPERS, here is the list of abstracts from recent publications.

 

 

imho

Intelligent Memory Hierarchy Optimization
Funded by: NSF Career Award, NCSU

Computer systems continue to face
memory wall problem, where the performance of applications is increasingly determined by memory latency. Processor speeds (may) continue to grow at 55% a year, whereas the memory speeds only grow at 7% a year. Even today, a processor suffers several hundred of cycles to access the main memory.  It is increasingly difficult to hide the latency of accessing the main memory. Although running multiple threads on an SMT or CMP can help hide the latency, it often gives pathological performance due to interaction between the threads.

Our main approach is to explore how to deal with the memory wall by making caches and main memory more efficient and intelligent. Recent examples:

 

FAIR CACHING: hardware/software support that enforces fairness when several threads in a chip multiprocessor (CMP) system share a cache [PACT '04].

PRIME CACHE INDEXING. We reduce conflict misses in applications by using cache hashing functions that utilize prime numbers. Prime modulo hashing uses a prime number of cache sets. We show that prime modulo indexing can be performed fast, without the use of integer division, and with a set of narrow addition and shift operations. We also propose prime displacement indexing, the cache index is calculated as the traditional index added with a displacement. The displacement is calculated as a prime number times some tag bits from the address. What is unique about the prime hashing is that while they eliminate conflict misses, unlike competing techniques, they do not cause extra conflict misses in applications with sequential access patterns. This is important because although a non-trivial fraction of applications suffer from conflict misses, majority of applications do not suf fer from much conflict misses, and should not be penalized by an alternative cache hashing function [TC¡¯05, HPCA'04].

INTELLIGENT MEMORY PREFETCHING: Our approach focuses in distributing the computation across the main  processor and the processors in memory (PIM). This approach utilizes embedded DRAM technology, which has recently been introduced into high-volume products such as the Sony Playstation2 and Nintendo GameCube. The advantage of PIM is that it provides low latency and high bandwidth access to the memory. Here, the memory processor runs a software handler that observes and learns the cache miss patterns of the main processor. It uses a correlation table to predict the future cache misses and prefetch those blocks into the cache of the main processor. Because the main processor finds the data in its cache, it does not need to access the memory for the data [ISCA'02, TDPS'03].

MEMORY CO-EXECUTION: Application code is partitioned into sections. We propose a partitioning algorithm and a scheduling algorithm that schedules the compute-intensive code sections to run on the main processor, and the memory-intensive code sections to run on the memory processor [HPCA'01, TC'01].




sods
Software support for On-Demand computing Servers

Funded by: NSF Next Generation Software, NCSU


Performance modeling. Performance prediction models are needed more than ever because of the cost of running a realistic simulation of an architecture is increasing and that there are not enough good tools for programmers to understand the performance bottlenecks in their programs. Recent examples:

CMP CACHE CONTENTION PREDICTION: Sharing the L2 or L3 cache with multiple cores in a CMP is common in order to save precious die area and increase cache utilization. Unfortunately, such sharing can also lead to severe performance degradation for some applications because they are not able to obtain sufficient cache space for their computation. However, we found that the impact of cache sharing on performance is highly application-specific as well as thread mix-specific. To understand precisely when an application suffers from cache sharing, we create a cache contention model that captures the contention behavior of shared caches. The model is able to give deep insights into the relationship of the impact of cache sharing on performance with the temporal reuse behavior of the affected applications.[HPCA¡¯05].

SCAL-TOOL: A tool to pinpoint and quantify scalability bottlenecks of shared memory programs. It breaks the execution time of a program into 4 components: actual computation, and extra time due to synchronization,  load imbalance, and insufficient cache size. Scal-tool was released in 1999 and is now part of NCSA's software repository [SC'99].

 


Architecture support for computer security and software reliability
Funded by: NSF, NCSU


HEAPMON: HeapMon is a helper thread that performs heap bug detection similar to Purify. Since the helper thread runs in parallel to the application, it offloads much  of the overheads of run-time bug detection that is interleaved with the application execution. In addition, efficient filtering mechanisms significantly reduce the workload of the helper thread, resulting in an average slowdown of less than 5%. [IBM Journal¡¯06].

MEMORY ENCRYPTION: more to come...





Self-optimizing systems

Funded by: NCSU's Faculty Research and Development Award

We note that many parallel programmers spend much more time tuning their code for performance or scalability than producing a correct parallel code. Our goal is to create an architecture and system that facilitates fast performance tuning. The advantages of this approach is that it relieves programmers from the bulk of tuning effort and increase performance portability. Past examples:
Mutable functional units. We designed a combined integer ALU and floating-point adder into a single unit, that is configurable at run time to operate in the integer mode or floating point mode. This approach saves die area allocated to the functional unit [ICCD'01].