Research

Computer Architecture & Arithmetic Papers

Bit-Plane Compression: Transforming Data for Better Compression in Many-core Architectures (). Proceedings of the International Symposium on Computer Architecture (ISCA).
Abstract
As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC) , to increase the effective memory bandwidth. BPC aims at homogeneously-typed memory blocks, which are prevalent in many-core architectures, and applies a smart data transformation to both improve the inherent data compressibility and to reduce the complexity of compression hardware. We demonstrate that BPC provides superior compression ratios of 4.1:1 for integer benchmarks and reduces memory bandwidth requirements significantly.
BibTex
A BibTex citation for the paper
All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory (). Proceedings of the International Symposium on Computer Architecture (ISCA).
Abstract
Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals. AIECC provides strong end-to-end protection of memory, detecting nearly 100% of CCCA errors and also preventing transmission errors from causing latent memory data corruption. AIECC provides these system-level benefits without requiring extra storage and transfer overheads and without degrading the effective level of data protection.
BibTex
A BibTex citation for the paper
Low-Cost Duplicate Multiplication (). Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH).
Abstract
Rising levels of integration, decreasing component reliabilities, and the ubiquity of computer systems make error protection a rising concern. Meanwhile, the uncertainty of future fault and error modes motivates the design of strong error detection mechanisms that offer fault-agnostic error protection. Current concurrent hardware mechanisms, however, either offer strong error detection coverage at high cost, or restrict their coverage to narrow synthetic error modes. This paper investigates the potential for duplication using alternate number systems to lower the costs of duplicated multiplication without sacrificing error coverage. An example of such a low-cost duplication scheme is described and evaluated; it is shown that specialized carry-save checking can be used to increase the efficiency of duplicated multiplication.
BibTex
A BibTex citation for the paper
Frugal ECC: Efficient and Versatile Memory Error Protection through Fine-Grained Compression (). Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
Abstract
Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; blocks that do not compress sufficiently require additional storage and accesses. As examples of FECC, we present chipkill-correct and chipkill-level ECCs on a 64-bit non-ECC DIMM with ×4 DRAM chips. We also describe the first true chipkill-correct ECC for ×8 devices using conventional ECC DIMMs. FECC relies on a new Coverage-oriented Compression (CoC) scheme that we developed specifically for the modest compression needs of ECC and for floating-point data. CoC can sufficiently compress 84% of all accesses in SPEC Int, 93% in SPEC FP, 95% in SPLASH2X, and nearly 100% in the NPB suites. With such high compression coverage, the worst case performance degradation from FECC is less than 3.7% while reliability is slightly improved and energy consumption reduced by about 50% for true chipkill-correct.
BibTex
A BibTex citation for the paper
Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Computer Memory (). Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
Abstract
Growing computer system sizes and levels of integration have made memory reliability a primary concern, necessitating strong memory error protection. As such, large-scale systems typically employ error checking and correcting codes to trade redundant storage and bandwidth for increased reliability. While stronger memory protection will be needed to meet reliability targets in the future, it is undesirable to further increase the amount of storage and bandwidth spent on redundancy. We propose a novel family of single-tier ECC mechanisms called Bamboo ECC to simultaneously address the conflicting requirements of increasing reliability while maintaining or decreasing error protection overheads.

Relative to the state-of-the-art single-tier error protection, Bamboo ECC codes have superior correction capabilities, all but eliminate the risk of silent data corruption, and can also increase redundancy at a fine granularity, enabling more adaptive graceful downgrade schemes. These strength, safety, and flexibility advantages translate to a significantly more reliable memory system. To demonstrate this, we evaluate a family of Bamboo ECC organizations in the context of conventional 72b and 144b DRAM channels and show the significant error coverage and memory lifespan improvements of Bamboo ECC relative to existing SEC-DED, chipkill-correct and double-chipkill-correct schemes.
BibTex
A BibTex citation for the paper
Truncated Logarithmic Approximation (). Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH).
Abstract
The speed and levels of integration of modern devices have risen to the point that arithmetic can be performed very fast and with high precision. Precise arithmetic comes at a hidden cost--by computing results past the precision they require, systems inefficiently utilize their resources. Numerous designs over the past fifty years have demonstrated scalable efficiency by utilizing approximate logarithms. Many such designs are based off of a linear approximation algorithm developed by Mitchell. This paper evaluates a truncated form of binary logarithm as a replacement for Mitchell's algorithm. The truncated approximate logarithm simultaneously improves the efficiency and precision of Mitchell's approximation while remaining simple to implement.
BibTex
A BibTex citation for the paper
On Separable Error Detection for Addition (). Proceedings of the Asilomar Conference on Signals and Systems.
Abstract
Addition is ubiquitous in computer systems, and rising error rates make error detection within adders increasingly important. This paper considers the best way to introduce strong, non-intrusive error detection to fixed-point addition within an existing, optimized machine datapath. A flexible family of separable error detection techniques called carry-propagate/carry-free (CP/CF) duplication is presented that offer superior error detection efficiency for a variety of adders.
BibTex
A BibTex citation for the paper
Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (). Scientific Programming.
Abstract
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration, and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.
BibTex
A BibTex citation for the paper
A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures (). Proceedings of the International Symposium on Microarchitecture (MICRO).
Abstract
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.
BibTex
A BibTex citation for the paper
Truncated Error Correction for Flexible Approximate Multiplication (). Proceedings of the Asilomar Conference on Signals and Systems.
Abstract
Binary logarithms can be used to perform computer multiplication through simple addition. Exact logarithmic (and anti-logarithmic) conversion is prohibitively expensive for use in general multipliers; however, inexpensive estimate conversions can be used to perform approximate multiplication. Such approximate multipliers have been used in domain-specific applications, but existing designs either offer superior efficiency or flexibility. This study proposes a flexible approximate multiplier with improved efficiency. Preliminary analysis indicates that this design provides up to a 50% efficiency advantage relative to prior flexible approximate multipliers.
BibTex
A BibTex citation for the paper
Towards Proportional Memory Systems (). Intel Technology Journal.
Abstract
Off-chip memory systems are currently designed to present a uniform interface to the processor. Applications and systems, however, have dynamic and heterogeneous requirements in terms of reliability and access granularity because of complex usage scenarios and differing spatial locality. We argue that memory systems should be proportional, in that the data transferred and overhead of error protection be proportional to the requirements and characteristics of the running processes. We describe two techniques and specific designs for achieving aspects of proportionality in the main memory system. The first, dynamic and adaptive granularity, utilizes conventional DRAM chips with minor DIMM modifications (along with new control and prediction mechanisms) to access memory with either fine or coarse granularity depending on observed spatial locality. The second, virtualized ECC, is a software/hardware collaborative technique that enables flexible, dynamic, and adaptive tuning between the level of protection provided for a virtual memory page and the overhead of bandwidth and capacity for achieving protection. Both mechanisms have a small hardware overhead, can be disabled to match the baseline, and provide significant benefits when in use.
BibTex
A BibTex citation for the paper
The Dynamic Granularity Memory System (). Proceedings of the International Symposium on Computer Architecture (ISCA).
Abstract
Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.
BibTex
A BibTex citation for the paper
Long Residue Checking for Adders (). The International Conference on Application-specific Systems, Architectures and Processors (ASAP).
Abstract
As system sizes grow and devices become more sensitive to faults, adder protection may be necessary to achieve system error-rate bounds. This study investigates a novel fault detection scheme for fast adders, long residue checking (LRC), which has substantive advantages over all previous separable approaches. Long residues are found to provide a ~10% reduction in complexity and ~25% reduction in power relative to the next most efficient error detector, while remaining modular and easy to implement.
BibTex
A BibTex citation for the paper
Hybrid Deterministic/Monte Carlo Neutronics Using GPU Accelerators (). International Symposium (DCABES).
Abstract
In this paper we discuss a GPU implementation of a hybrid deterministic/Monte Carlo method for the solution of the neutron transport equation. The key feature is using GPUs to perform a Monte Carlo transport sweep as part of the evaluation of the nonlinear residual and Jacobian-vector product. We describe the algorithm and present some preliminary numerical results which illustrate the effectiveness of the GPU Monte Carlo sweeps.
BibTex
A BibTex citation for the paper
Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (). Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
Abstract
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration, and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.
BibTex
A BibTex citation for the paper
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems (). Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
Abstract
Modern memory systems rely on spatial locality to provide high bandwidth while minimizing memory device power and cost. The trend of increasing the number of cores that share memory, however, decreases apparent spatial locality because access streams from independent threads are interleaved. Memory access scheduling recovers only a fraction of the original locality because of buffering limits. We investigate new techniques to reduce inter-thread access interference. We propose to partition the internal memory banks between cores to isolate their access streams and eliminate locality interference. We implement this by extending the physical frame allocation algorithm of the OS such that physical frames mapped to the same DRAM bank can be exclusively allocated to a single thread. We compensate for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. This combined approach, unlike memory bank partitioning or sub-ranking alone, simultaneously increases overall performance and significantly reduces memory power consumption.
BibTex
A BibTex citation for the paper
Hybrid Residue Generators for Increased Efficiency (). Proceedings of the Asilomar Conference on Signals and Systems.
Abstract
In order for residue checking to effectively protect computer arithmetic, designers must be able to efficiently compute the residues of the input and output signals of functional units. Low-cost, single-cycle residue generators can be readily formed out of two’s complement adders in two ways, which have area and delay tradeoffs. A residue generator using adderincrementers for end-around-carry adders is small but slow, and a design using carry-select adders is fast, but large. It is shown that a hybrid combination of both approaches is more efficient than either.
BibTex
A BibTex citation for the paper

Public Technical Reports

Survey of Error and Fault Detection Mechanisms v2 (). Technical report TR-LPH-2012-001, LPH Group, Department of Electrical and Computer Engineering, The University of Texas at Austin.
Abstract
This report describes diverse error detection mechanisms that can be utilized within a resilient system to protect applications against various types of errors and faults, both hard and soft. These detection mechanisms have different overhead costs in terms of energy, performance, and area, and also differ in their error coverage, complexity, and programmer effort.

In order to achieve the highest efficiency in designing and running a resilient computer system, one must understand the trade-offs among the aforementioned metrics for each detection mechanism and choose the most efficient option for a given running environment. To accomplish such a goal, we first enumerate many error detection techniques previously suggested in the literature.
BibTex
A BibTex citation for the paper
Containment Domains: A Full-System Approach to Computational Resiliency (). Technical report TR-LPH-2011-001, LPH Group, Department of Electrical and Computer Engineering, The University of Texas at Austin.
Abstract
Operating the Echelon system at optimal energy efficiency under a wide range of environmental conditions and operating scenarios requires a comprehensive and flexible resiliency solution. Our research is focused in on two main directions: providing mechanisms for proportional resiliency which use application-tunable hardware/software cooperative error detection and recovery, and enabling hierarchical state preservation and restoration as an alternative to non-scalable and inefficient generic checkpoint and restart. As an interface to these orthogonal research areas, we are developing and evaluating programming constructs based on the concept of Containment Domains. Containment Domains enable the programmer and software system to interact with the hardware and establish a hierarchical organization for preserving necessary state, multi-granularity error detection, and minimal re-execution. Containment domains also empower the programmer to reason about resiliency and utilize domain knowledge to improve efficiency beyond what compiler analysis can achieve without such information. This report describes our initial framework and focuses more on the aspects relating to enabling efficient state preservation and restoration.
BibTex
A BibTex citation for the paper

Other Papers

An Analytical Model for Hardened Latch Selection and Exploration (). Workshop on Silicon Errors in Logic--System Effects (SELSE).
Abstract
Hardened flip-flops and latches are designed to be resilient to soft errors, maintaining high system reliability in the presence of energetic radiation. The wealth of different hardened designs (with varying protection levels) and the probabilistic nature of reliability complicates the choice of which hardened storage element to substitute where. This paper develops an analytical model for hardened latch and flip-flop design space exploration. It is shown that the best hardened design depends strongly on the target protection level and the chip that is being protected. Also, the use of multiple complementary hardened cells can combine the relative advantages of each design, garnering significant efficiency improvements in many situations.
BibTex
A BibTex citation for the paper
Nanoprecipitation-Assisted Ion Current Oscillations (). Nature Nanotechnology.
Abstract
Nanoscale pores exhibit transport properties that are not seen in micrometre-scale pores, such as increased ionic concentrations inside the pore relative to the bulk solution, ionic selectivity and ionic rectification. These nanoscale effects are all caused by the presence of permanent surface charges on the walls of the pore. Here we report a new phenomenon in which the addition of small amounts of divalent cations to a buffered monovalent ionic solution results in an oscillating ionic current through a conical nanopore. This behaviour is caused by the transient formation and redissolution of nanoprecipitates, which temporarily block the ionic current through the pore. The frequency and character of ionic current instabilities are regulated by the potential across the membrane and the chemistry of the precipitate. We discuss how oscillating nanopores could be used as model systems for studying nonlinear electrochemical processes and the early stages of crystallization in sub-femtolitre volumes. Such nanopore systems might also form the basis for a stochastic sensor.
BibTex
A BibTex citation for the paper