High Performance Computing

RISC Architecture, Optimization and Benchmarks

Publisher: O'Reilly, 1993, 371 pages

ISBN: 1-56592-032-5

Last modified: March 16, 2022, 12:07 a.m.

Synopsis

The latest group of workstations — including IBM's RS/6000, DEC's Alpha/AXP, Sun's SuperSPARC, HP's 700 series, and others — incorporate many advanced features, pipelines, RISC instruction sets, long instruction words, multiprocessing support, etc. These features aren't all new; they've been used on supercomputers for a while. What is new is that "supercomputer" features are now appearing on desktop computers.

What do these changes mean for us? Well, they've made workstations a lot more interesting for "arm-chair" architects. If you'd like to know how the hardware on your desk works, this book is a good place to start; your workstation is alot more complicated than it was in 1980!

If you're a software developer, you probably know that getting the most out of a modern workstation can be tricky. Paying closer attention to memory reference patterns and loop structure can have a huge payoff. This book discusses how modern workstations get their performance, and how you can write code that makes optimal use of your hardware.

If you're involved with purchasing or evaluating workstations, this book will help you make intelligent comparisons. You'll learn what the newest set of buzzwords really means, how caches and other architectural tricks affect performance, how to interpret the commonly quoted industry benchmarks, and how to run your own benchmarks.

Whatever you do, you'll find that this book is an indispensable guide to the workstations of the 90s. Topics covered include:

CPU and Memory Architecture for RISC Workstations
Optimizing Compilers
Timing and Profiling Programs
Understanding Parallelism
Loop and Memory Reference Optimization
Benchmarking
Parallel Computing and Multiprocessing

Modern Computing Architectures

What is High Performance Computing

Why Worry About Performance?
Measuring Performance
The Next Step

RISC Computers

Why CISC?

Space and Time
Beliefs About Complex Instruction Sets
Memory Addressing Modes
Microcode

Making the Most of a Clock Tick

Pipelines
Instruction Pipelining

Why RISC?

Characterizing RISC

A Few More Words About Pipelining

Memory Refences
Floating point Pipelines

Classes of Processors

Superscalar Processors
Superpipelined Processors
Long Instruction Word (LIW)

Other Advanced Features

Register Bypass
Register Renaming
Reducing Branch Penalties

Closing Notes

Memory

Memory Technology

Random Access Memory
Access Time

Caches

Direct Mapped Cache
Fully Associative Cache
Set Associative Cache
Uses of Cache

Virtual Memory

Page Tables
Translation Lookaside Buffer
Page Faults

Improving Bandwidth

Large Caches
Interleaved Memory Systems
Software Managed Caches
Memory Reference Reordering
Multiple References

Closing Notes

Porting and Tuning Software

What an Optimizing Compiler Does

Optimizing Compiler Tour

Intermediate Language representation
Basic Blocks
Forming a DAG
Uses and Definitions
Loops
Object Code Generation

Classical Optimizations

Copy Propagation
Constant Folding
Dead Code Removal
Strength Reduction
Variable Renaming
Common Subexpression Elimination
Loop Invariant Code Motion
Induction Variable Simplification
Register Variable Detection

Closing Notes

Clarity

Under Construction
Comments
Clues in the Landscape
Variable Names
Variable Types
Named Constants
INCLUDE Statements
Use of COMMON
The Shape of Data
Closing Notes

Finding Porting Problems

Problems in Argument Lists

Aliasing
Argument Type Mismatch

Storage Issues

Equivalenced Storage
Memory Reference Alignment Restrictions

Closing Notes

Timing and Profiling

Timing

Timing a Whole program
Timing a Portion of the Program
Using Timing Information

Subroutine Profiling

prof
gprof
gprof's Flat Profile
Accumulating the Results of Several gprof Runs
A Few Words About Accuracy

Basic Block Profilers

tcov
lprof
pixie

Closing Notes

Understanding Parallelism

A Few Important Constants

Constants
Scalars

Vectors and Vector Processing
Dependencies

Data Dependencies
Control Dependencies

Ambiguous References
Closing Notes

Eliminating Clutter

Subroutine Calls

Macros
Procedure Inlining

Branches

Wordy Conditionals
Redundant Tests

Branches Within Loops

Loop Invariant Conditionals
Loop Index Dependent Conditionals
Independent Loop Conditionals
Dependent Loop Conditionals
Reductions
Conditionals That Transfer Control
A Few Words About Branch Probability

Other Clutter

Data Type Conversions
Doing Your Own Common Subexpression Elimination
Doing Your Own Code Motion
Handling Array Elements in Loops

Closing Notes

Loop Optimizations

Basic Loop Unrolling
Qualifying Candidates for Loop Unrolling

Loops with Low Trip Counts
Fat Loops
Loops Containing Procedure Calls
Loops with Branches in Them
Recursive Loops

Negatives of Loop Unrolling

Unrolling by the Wrong Factor
Register Trashing
Instruction Cache Miss
Other Hardware Delays

Outer Loop Unrolling

Outer Loop Unrolling to Expose Computations

Associative Transformations

Reductions
Dot Products and daxpys
Matrix Multiplication

Loop Interchange

Loop Interchange to Move Computations to the Center

Operation Counting
Closing Notes

Memory Reference Optimizations

Memory Access Patterns

Loop Interchange to Ease Memory Access Patterns
Blocking to Ease Memory Access Patterns

Ambiguity in Memory References

Ambiguity in Vector Operations
Pointer Ambiguity in Numerical C Applications

Programs That Require More Memory Than You Have

Software-Managed, Out-of-Core Solutions
Virtual Memory

Instruction Cache Ordering
Closing Notes

Language Support for Performance

Subroutine Libraries

Vectorizing Preprocessors

Explicitly Parallel Languages

Fortran 90
High Performance Fortran (HPF)
Explicitly Parallel Programming Environments

Closing Notes

Evaluating Performance

Industry Benchmarks

What is a MIP?

VAX MIPS
Dhrystones

Floating Point Benchmarks

Linpack
Whetstone

The SPEC Benchmark

Individual SPEC Benchmarks
030.matrix300 Was Deleted

Transaction Processing Benchmarks

TPC-A
TPC-B
TPC-C

Closing Notes

Running Your Own Benchmarks

Choosing What to Benchmark

Benchmark Run Time
Benchmark Memory Size
Kernels and Sanitized Benchmarks
Benchmarking Third Party Codes

Types of Benchmarks

Single Stream Benchmarking
Throughput Benchmarks
Interactive Benchmarks

Preparing the Code

Portability
Making a Benchmark Kit
Benchmarking Checklist

Closing Notes

Parallel Computing

Large Scale Parallel Computing

Problem Decomposition

Data Decomposition
Control Decomposition
Distributing Work Facility

Classes of Parallel Architectures
Single Instruction, Multiple Data

SIMD Architecture
Mechanics of Programming a SIMD Machine

Multiple Instruction, Multiple Data

Distributed Memory MIMD Architecture
Programming a Distributed Memory MIMD Machine
A Few Words About Data Layout Directives
Virtual Shared Memory

Closing Notes

Shared-Memory Multiprocessors

Symmetric Multiprocessing

Operating System Support for Multiprocessing
Multiprocessor Architecture

Shared Memory

Conservation of Bandwidth
Coherency
Data Placement

Multiprocessor Software Concepts

Fork and Join
Synchronization with Locks
Synchronization with Barriers

Automatic Parallelization

Loop Splitting
Subroutine Calls in Loops
Nested Loops
Manual Parallelism

Closing Notes

Processor Overview
How to Tell When Loops Can Be Interchanged
Obtaining Sample Programs and Problem Set Answers

FTP
FTPMAIL
BITFTP
UUCP

Reviews