The Portland Group
Oakridge Top Right
HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Enterprise Tech
Datanami
HPCwire Japan

Asynchronous Heterogeneity: The Next Big Thing In HEC


A number of computer vendors have quietly moved into heterogeneous processing,
whether they use 1) heterogeneous cores within the processor architecture, or
2) heterogeneous processors within the system architecture (or both), as a
natural response to application diversity.  After all, given that different
applications in a workload, or different portions of the same application,
can have significant differences in algorithmic characteristics, it is just
common sense to offload portions of an application with special algorithmic
characteristics to a coprocessor that is ideally suited to computing portions
of that type.

Algorithmic variability within a single application arises naturally from that
application's dynamic mix of different forms of parallelism, different forms
of locality, different forms of communication, and/or different forms of
synchronization.

This offloading may be loosely or tightly coupled.  The standard textbook view
of coprocessors is that they extend the instruction-set architecture (ISA) of
the main processor to enrich its functionality.  In this view, offloading
essentially consists of a remote procedure call to the coprocessor, where the
main processor waits for the coprocessor to return.  But such tight coupling
results in sequential processing, which is not what we want.  To obtain the
benefits of parallel processing, we must configure these processors (or cores)
of different types as self-contained, autonomous computers, with distinct
instruction sets and distinct _uncoupled_ program counters.  We must also be
extremely wary of synchronization, which easily serializes the computation.

Not every system uses asynchronous heterogeneity to accomplish the same goals.
Indeed, some design teams have not fully addressed one fundamental problem in
system-level asynchronous heterogeneity, viz., what styles of interaction,
i.e., what styles of communication and synchronization, minimize the
performance degradation of application portions of a given type that run on
processors of the corresponding type?  If large amounts of the computation run
on processors of a given type, you don't want to slow that type of processor
down, because of Amdahl's law.

Pragmatically, application portions are threads of a given type marshaled by
the parallelizing compiler from diverse sets of operations in the computation
according to well-specified, system-defined criteria.

In this article, the term "asynchronous heterogeneity" refers indiferently to
loosely coupled heterogeneous processing at either the processor or the system
level.  There is already a pre-thesis here: If portions of an application are
(somewhat uniformly) distributed across processors of different types, and if
each type is expected to perform a significant fraction of the computation,
then---given our concern with response times---there are clearly better and
worse styles of interaction between processors (or cores) of different types.
Quite simply, an interaction style is better if it does not reduce parallelism
that could otherwise have been exploited.

Heterogeneity is all around us.  Japan is spending close to a billion dollars
to build a "usable" 10-PFs/s system, which is due out in 2011, to simulate
climate change and galaxy formation, and to predict the behavior of new drugs.
This will be a specialized design, one reads, perhaps incorporating several
different breeds of processor in a single system.  "The architecture ... is a
hybrid [mixing vector processors, scalar processors, and special-purpose
compute engines]", says Jack Dongarra.  "To get to 10 PFs/s, I think you'd
need that".  Dongarra is quoted as saying that the planned system may even
include some chips that are designed to perform just one type of calculation,
albeit at lightning speed.

Whether conventional commodity clusters can scale to 10 PFs/s may be a
controversial question to some, but not to others.  We are already seeing
limits to "commodity scaling".  Recall that the phrase "commodity processor"
typically means a single-core RISC superscalar processor optimized for
single-thread performance and dependent for its performance both on its cache
and on complex, power-hungry control circuitry for out-of-order execution.

Recently, processor vendors with complex microarchitectures trying to scale to
higher single-processor performance have been forced by the laws of physics to
move to multicore microarchitectures (2X, 4X, 8X, ...), where each core is
just a smaller instance of the _same_ conventional, complex, power-hungry
design.  This doesn't particularly suit cluster-vendor integrators.  Why not?
Vast agglomerations of multi-complex-core chips are proving to be extremely
difficult to cool.  Who knows?  Homogeneity may simply annihilate conventional
killer micros in large clusters.  (Your correspondent suspects that most
multi-simple-core processors will be heterogeneous---unless they are just one
processor type in a larger heterogeneous parallel system).

Standing somewhat apart, Blue Gene/Light is not the best example of scaling
difficulties because its processors have been explicitly designed to be less
power hungry.

Closer to home, Cray Inc., SGI, and SRC (among others) have introduced largely
autonomous FPGAs into some of their processor architectures.  For example,
in SGI Altix 3000 Series Superclusters (think of the large Columbia Project
supercluster at NASA Ames), the NUMAlink interconnect is used in several ways:
1) to scale memory, 2) to scale I/O (distinct I/O processors), 3) to scale
application-specific units (FPGAs), and 4) to scale visualization (GPUs).
These are just multichip heterogeneous processors.  For example, two Itanium2
CPUs are linked to an interface chip, two FPGAs are linked to a separate
interface chip, the two interface chips have a NUMALink interconnect between
them, and finally the first interface chip is linked to memory.  This
heterogeneous processor is then used as the building block of a homogeneous
system architecture.

Without question, the most visible example of asynchronous heterogeneity is
the Sony-Toshiba-IBM (STI) CELL processor architecture, which was initially
designed for the PlayStation 3, but which is being adapted by IBM for more
general-purpose uses.  While the CELL is a heterogeneous processor
microarchitecture, rather than a heterogeneous system architecture, it vividly
illustrates the power of heterogeneous processing.

--- The IBM CELL Processor Architecture

In some sense, the CELL processor architecture is a scaled-down parallel
vector processor on a single chip.  Each CELL is a single-chip, nine-core
microarchitecture with a control processor (the "scalar core"), eight
workhorse vector processors (the "vector cores"), an off-chip local memory,
and high-speed I/O channels to send/receive messages (in part to/from other
CELLs).  One of the most striking features of the CELL is the way in which
both off-chip local memory and on-chip "local stores" are used to guarantee
a sustained high-bandwidth stream of data operands to each of the vector
cores.

A CELL's local memory is one (possibly more) GB of RAM connected to the CELL
by two Rambus XDR memory controllers.  This dual-bus memory channel can feed
the CELL with data at the rate of 25.6 GBs/s.  Now, can, say, a vector core
directly load and store from the CELL's local memory?  Not exactly.  Every
vector core has a local store consisting of 256 KBs of SRAM.  It also has
128 128-bit vector registers.  The vector loads and stores of the vector core
read and write the local store of that vector core.  In addition, a vector
core can replenish its local store by requesting a block transfer from the
CELL's local memory, with permissible block sizes ranging anywhere from 1 to
16 Kbs.  (Of course, block transfer is bidirectional).  The local store of a
vector core is thus a virtual second-level vector register file that is
"paged" against the CELL's local memory.  Obviously, if CELLs are used as
building blocks in medium-scale parallel systems, the compiler will be
responsible for some sophisticated data distribution if this "paging" scheme
is to be made to work.

Indeed, the CELL is a genuine RISC processor because it leaves all the Really
Interesting Stuff to the Compiler.  :-)

The CELL chip will be built using 65-nm VLSI process technology.  The scalar
core is a dual-issue, dual-threaded, in-order-execution RISC processor with
a slightly modified conventional cache but without elaborate branch-prediction
support.  This much simpler design uses considerably less power.  The primary
function of the scalar core is to queue up vectorized tasks (i.e., vector
threads) for execution by the vector cores.  The single scalar core thus
offloads all (vectorized) compute-intensive work to the vector cores.  It also
keeps the "outermost" scalar code for itself.

Is the scalar performance of the scalar core an issue?  (As is the custom,
each vector processor has an embedded scalar control unit that is tightly
coupled to the execution of vector instructions, but this is not our focus).
To repeat, the scalar core is responsible for executing the nonvectorizable
portion of any application that runs on the CELL, although vector cores could
---in a pinch---execute scalar threads.  Without much ILP, the scalar core
must pray that very little application work falls on its shoulders, and fairly
regular work at that.

Is the scalar core reponsible for keeping the CELL's local memory fresh?
The CELL does have 76.8 GBs/s of I/O bandwidth.  Is the scalar core constantly
requesting block transfers from outside the CELL to the CELL's local memory?
In any case, it appears---at least in the uniprocessor case---that having a
scalar thread running on the scalar core _wait_ for the completion of some
compute-intensive vector thread it has offloaded onto a vector core is not an
immediately obvious performance problem.  (Any scalar threads that might run
on vector cores are another matter entirely).

Each vector core is a self-contained vector processor.  Like the scalar core,
vector cores implement in-order execution.  Each vector core has a peak
arithmetic performance of 32 GFs/s, which gives all eight of them a combined
peak of 256 GFs/s.  Let us examine the memory bandwidth that is required to
support that much raw vector functional-unit performance.

Recall that the Cray-1 used vector registers primarily to reduce the demand on
limited memory bandwidth, and that the Cray-2 extended this to a two-level
scheme.  Specifically, the Cray-2 had 16K words of (processor) local memory
that had to be explicitly managed by the compiler/programmer.  Compare this to
the CELL.  Each CELL vector core has 128 128-bit vector registers, 256 KBs of
computational memory, and a backing store for this computational memory, which
is just the local memory of the CELL.  The local store of a vector core allows
us to do without a cache of any kind.

Imagine a vector thread that, over its lifetime, needs hundreds or thousands
of bytes of data to run to completion, and has perfect foreknowledge of what
those data are.  This much data will easily fit into a vector core's local
store.  (The local store must contain both the program and the data).  So, if
all of the data for the vector thread can be prestaged in the local memory, a
vector core can obtain a "prestaged" vector thread, use the DMAC to transfer
the thread's program and all the thread's data from the CELL's local memory to
the vector core's local store, and then run the vector thread to completion.
Since the local store is a computational memory and not a cache, it can
sustain high-bandwidth operand delivery over the entire lifetime of the vector
thread.  Specifically, it can sustain 64 GBs/s on perfectly vectorized codes,
which is one register per cycle.

Once started, vector threads do not stop, i.e., do not pause or wait.  They
do not synchronize in midstream.  The effective thread state of a vector
thread running on a vector core consists of the contents of the vector
register file _plus_ the contents of the vector core's local store.  A context
switch is unthinkable.  Vector threads do not start until all their data has
been localized (first in the local memory and then in their local store) and,
once they have started, they do not stop for anything.

--- System-Level Asynchronous Heterogeneity

The Cray Cascade supercomputer design effort is one of three projects left in
DARPA's HPCS program.  (The other vendors are IBM and SUN).  The competition
is heating up as each vendor must decide soon whether to commit to actually
building a general-purpose petascale machine.  HPCS program managers would
prefer a potential product line consistent with the vendor's market strategy
rather than a showpiece machine that leads nowhere.  The Japanese have
committed large amounts of money to build a heterogeneous petascale
architecture in roughly the same time frame as the HPCS program.

The purpose of building such a machine is to give real users a comprehensible
and programmable parallel system that would allow them to routinely achieve
sustained petascale computing on their important science, engineering, and
national-security applications.  The roadblock is that these (future)
petascale applications are themselves extremely heterogeneous in their
algorithmic characteristics, and---given the very real constraints on system
cost, power, size, and programmability---there is no single processor
architecture, no single clever hybrid of the dataflow and von Neumann
computing paradigms, no single heterogeneous processor architecture even,
that could be used as the _only_ processor building block in an efficient,
general-purpose system-level homogeneous petascale architecture.

Similar concerns echo in the reconfigurability community.  A recent conference
announcement states, "Reconfigurable high-performance computing systems have
the potential to exploit coarse-grained functional parallelism through
conventional parallel processing as well as fine-grained parallelism through
direct hardware execution".  Thus, many of us see value in decomposing (and
molding) applications into component computations of a given algorithmic type,
and then running these components (i.e., threads of a given type) on distinct
hardware subsystems, most naturally on processors of a given type.  But if
decomposition is good, what is the good decomposition?

What kind of heterogeneity might a single application have?  Well, is it
vectorizable or not?  More precisely, what is the respective potential for
vector-level and thread-level parallelism?  Is it communication intensive or
not?  More precisely, is it temporally local or not? spatially local or not?
engaged in frequent long-range communication or not? engaged in frequent
short-range communication or not? with what granularity? with what regularity
of its memory-access pattern?  Finally, is it loosely synchronized or not?
(The opposite of loose synchronization is fine-grained synchronization).  More
precisely, is the synchronization overhead, i.e., the context-switch cost,
ruinously expensive or not?  Synchronization latency is more a matter of an
application's being communication intensive.

Cascade is heterogeneous in that it combines efficient processing components
for heavyweight threads that exhibit high temporal locality and other
efficient processing components for lightweight threads that exhibit little
or no temporal locality.  Heavyweight processors are responsible for executing
the compute-intensive portions of an application, and combine the best aspects
of multithreading, vector computing, and stream processing for very
high-throughput execution of the compute-intensive portions.

Within each locale, a number of lightweight (multithreaded scalar) processors
will be near or within the locale's memory for high-bandwidth and low-latency
in-memory computation.  Locales, which are the fundamental Cascade building
blocks, contain one heavyweight-processor chip and eight multicore
lightweight-processor chips, which are either PIM chips properly so called or
"PIM equivalent" chips.  Each core is a multithreaded scalar processor.
Within the locale, PIM chips are backed by non-PIM DRAM.  Locales are
interconnected by next-generation hierarchical symmetric Cayley-graph networks
for high specific bandwidth.

In one sense, Cascade's asynchronous heterogeneity is contained within
individual locales; the Cascade system as a whole is then a homogeneous
interconnection of locales.  However, since processors of both weights have
long-range interactions, this gives us full system-level asynchronous
heterogeneity.

The main chip in a locale implements a heavyweight processor including its
caches, an interface to multiple DRAM memory chips with intermixed lightweight
processors, and a network router to interconnect to other locales.

Several design decisions were taken up front.  First, Cascade is a
shared-memory machine.  Second, Cascade is dependent on exceptional global
system bandwidth, although this is by far the most-expensive part of the
system.  Third, both heavyweight and lightweight processors are expected
to tolerate the latency of long-range communication.  In short, both
heavyweight and lightweight threads are presumed to be potentially
communication intensive.  Prestaging, i.e., localizing data in a locale's
memory for a locale's heavyweight processor is not totally absurd, but there
is little point if the localization consumes as much global bandwidth as
simply doing the long-range communication in the first place.

That is to say, localization in Cascade is not absurd but it is subtle.  For
example, how perfect is our foreknowledge of what data we need?  Also, what is
the net effect of localization on "return on bandwidth" (see below)?

Some early conceptualizations of Cascade focused on locality, i.e., on
different communication (memory-access) patterns.  Today, we more accurately
describe the differences between heavyweight and lightweight threads, and
their respective capabilities, by focusing on the enormous difference in the
amount of thread state.  Communication still matters, but this new
conceptualization allows us to cleanly handle differences in synchronization
granularity.

Heavyweight processors adopt the conventional approach to exploiting temporal
locality, albeit with a different cache architecture and an abundance of
processor parallelism to generate memory-reference concurrency.  Temporal
locality comes in two forms.  In data reuse, a data value that has been loaded
into a processor store is used---somewhat promptly---many times before it is
discarded.  In intermediate-value use (aka as "producer-consumer locality"),
intermediate values written to a processor store are read from that store---
somewhat promptly---and used as operands, possibly many times, before they are
discarded.

The conventional approach to exploiting temporal locality is to use a large
data cache and/or a large register set to provide significant processor
storage for live data values.  The bigger these stores are, the more temporal
locality one can exploit.  Exploiting temporal locality in compute-intensive
work is of great performance value because it can dramatically increase the
"return on bandwidth" (ROB).  Another term for this is "arithmetic intensity",
i.e., how many flops you can execute for each word you load.

The system's arithmetic return on bandwidth is perhaps its most important
objective function since global system bandwidth is, by far, the most critical
system resource.  In truth, there are many, many bandwidths in a computer
architecture, some more critical than others.

But there is a fatal flaw.  Allowing the thread state proper to grow without
limit, and remembering that rebuilding a thread's cache footprint after a
context switch has major performance implications, we find ourselves faced
with colossal effective-state computers, which should never be allowed to
synchronize, i.e., block---well, hardly ever.  The context-switch cost is
absurdly expensive.

If a compute-intensive, heavyweight thread absolutely must synchronize, so be
it.  However, we try to avoid this as much as possible.

Efficient execution of compute-intensive code goes with large thread state
goes with real synchronization difficulty because of the context-switch
overhead.  Indeed, a homogeneous, pure multithreaded, cacheless architecture
is unlikely to result in a usable petascale computer (of course, there are
wonderful terascale architectures that adopt this approach).  Together with
their compiler-directed data caches, heavyweight Cascade processors have
traded away their ability to synchronize well in exchange for exceptional
performance on compute-intensive codes.

--- The True Role Of Lightweight Processors

PIM processors (see qualification above) are often characterized by saying
that every part of memory has a processor that is _very_ nearby, resulting in
extremely high bandwidth and low latency between that processor and that part
of memory.  In Cascade, lightweight processors have two additional
architectural features.  First, they have been explicitly designed to have
very low thread state, i.e., there are very few words in the register set
owned by an individual thread.  Second, there is an extremely high degree of
processor multithreading.  (In MTA terms, we would say that the number of
processor-resident active threads is extremely high).

Lightweight and heavyweight processors are two intermediate design points on
an architectural continuum whose extrema are dataflow computing and von
Neumann computing (this will be explained in the conclusion).

In lightweight processors, an enormous number of program counters each
initiate a moderate number of concurrent operations per program counter.  In
heavyweight processors, a moderate number of program counters each initiate an
enormous number of concurrent operations per program counter.  In both cases,
the processor initiates an enormous number of concurrent operations per cycle.

We can see some minor differences in compiler strategy: a compiler can freely
unroll loops for a heavyweight processor without fear of register spill, but
should probably stick to software pipelining for a lightweight processor.

Lightweight threads were originally explained as a correction to the
conventional approach to exploiting spatial locality.  Today, microprocessors
use a long cache line to concurrently prefetch data.  (This is a
latency-tolerance scheme that does nothing to reduce the demand on bandwidth).
But an application, depending on its memory-access pattern, need not be cache
friendly: bandwidth is actually wasted if the fetched data aren't used, and
cache-coherence schemes quickly lead to the problem of false sharing.

Still, suppose there is a block of data localized in a part of memory
somewhere that is relevant to some computation.  Say this block is 'n' words
long.  A heavyweight thread that depends on the result of this computation
might well be tempted, not to load the 'n' words into its locale, but rather
to spawn a lightweight thread that migrates to the lightweight processor
closest to this memory part, and somehow obtains the result without causing
'n' words to be transferred to the original locale.  This process iterates, so
that if the lightweight thread falls off the end of a block, it simply migrates
to pick up at the start of the next block.

A simple example is computing the dot product of two vectors A and B that are
both distributed across memory in multiple independent contiguous blocks.  A
single lightweight thread can repeatedly migrate to track the blocks of vector
A, whose components it reads locally, while issuing remote loads to obtain the
corresponding elements of B.  This locality strategy reduces remote loads by a
factor of two, which more than offsets the cost of repeated migration.
Moreover, the remote bandwidth is not charged against the originating locale's
"bandwidth budget".

The use of lightweight PIM processors to exploit spatial locality seems very
natural, whether this is easily migratable threads pursuing spatial locality
across the memory, or somehow using lightweight threads for localization,
i.e., copying data nearby.  But the standard Cascade mantra is: The compiler
packages the temporally local loops into heavyweight threads and all other
loops into lightweight threads.  ("Loops" is a tad restrictive here).  So,
while spatial locality makes spawning lightweight threads worthwhile, there
are many, many other things that make it equally worthwhile.

Let's take a more abstract view.  Some (portions of) computations _require_
high thread state; hence, they must be scheduled on heavyweight processors.
Other (portions of) computation _require_ low thread state; hence, they must
be scheduled on lightweight processors.  Any (portion) of a computation that
requires neither should be scheduled on lightweight processors for efficient
execution.

For example, fine-grained dataflow problems are naturally programmed using
fine-grained synchronization, and only run effectively on lightweight
processors.  Here is a simple example.

The parallel prefix-sum problem is to compute the sum of the first j numbers
for j equals 1 to P.  Memory locations are equipped with full/empty bits to
enable fine-grained producer-consumer synchronization in memory.  The problem
is solved by migrating P lightweight threads, each initialized with one of the
original P numbers, to one or more PIM nodes with the collective capacity to
store a logP by P array.  After logP time steps, the P prefix sums have been
computed.  Here is the SPMD code for cyclic reduction (SPMD style is a rotten
way for humans to design code).

Cyclic Reduction
________________

 - code for thread j, which is assigned to column j of a logPxP matrix

/* upon termination, the jth prefix sum is in 'sum'

'shared'  a   'future' 'int' 'array'[1..logP,1..P] := undefined;
'private' sum 'int' := j;
          hop 'int' := 1;

'do' level = 1,logP --->
    'if' j <= P-hop ---> a[level,j] := sum      'fi'  // set location full
    'if' j >    hop ---> sum +:= a[level,j-hop] 'fi'  // accumulate sum
     hop := 2*hop
'od'

Each component of the 'a' matrix is a _future_ variable.  Each thread
iterates logP times.  Each time through the loop it synchronizes, i.e.,
potentially blocks, waiting for the relevant future variable to become
defined.  It does this _every other instruction_!  This constant fine-grained
synchronization mandates execution of all thread code on one or more
lightweight processors.  Spatial locality is irrelevant.

Indeed, focusing on thread state opens new vistas: a group of lightweight
threads can exploit, not just pre-existing spatial locality, but spatial
localizability!  We may have to adjust the Cascade mantra.

We come now to one of the fundamental problems of system-level asynchronous
heterogeneity: What styles of interaction, i.e., what styles of communication
and synchronization, minimize the performance degradation of either
lightweight or heavyweight threads?

Imagine a set of heavyweight threads, at least one of which is dependent on
the result---which can be almost anything---of a subcomputation that most
properly runs as one or more lightweight threads executing on lightweight
processors.  How should heavyweight and lightweight threads coordinate?

The solution is straightforward: any executing heavyweight thread may spawn
any number of lightweight threads but cannot _itself_ wait for results to be
returned from any of them.  Imagine the performance hit if, say, an
executing compute-intensive, heavyweight thread were to RPC a parallel prefix
summation, and then patiently wait for the result to be returned.

In point of fact, lightweight threads have many, many ways to satisfy the
(initial) task or data dependences of heavyweight threads.  It is just that
this is done upfront, before the heavyweight thread has ever been scheduled to
run on a heavyweight processor.  Think of a dataflow model in which dataflow
constraints schedule the first instruction of a thread, but then the thread
runs normally.

If one takes an extremely abstract view of "prestaging", then the compiler can
identify many kinds of work---whether this be rapid updates of contiguous data
blocks, tree/graph walking, efficient and concurrent array transpose,
fine-grained manipulation of sparse and irregular data structures, parallel
prefix operations, in-memory data movement---in fact, any computation at all
that executes efficiently in a low thread-state environment---that should
be completed before a _particular_ heavyweight thread is first scheduled for
execution on a heavyweight processor.  Remember, the compiler chooses what to
put into each thread.

As a group, heavyweight processors offload work of various kinds---work that
suits them poorly---onto lightweight processors.  Moreover, as a group,
heavyweight threads offload work of various kinds onto lightweight threads.
The trick, which isn't exactly rocket science, is that one heavyweight thread
spawns a set of lightweight threads whose completion satisfies some initial
dependence of _another_ heavyweight thread.

Conversely, a group of lightweight threads can cause a compiler-specified set
of dependence constraints for the initial scheduling of a heavyweight thread
to be satisfied.  Mechanically, the lightweight threads place the newly
enabled heavyweight thread onto the work (or ready) queue, from which it is
later retrieved when the heavyweight processor becomes free.  This is dynamic
scheduling of compute-intensive tasks from the locale's central work queue.

--- Conclusion

The deeper meaning of system-level asynchronous heterogeneity is that it can
be viewed as the natural implementation of a new "ontological" model of
parallel computation that identifies the single most important dimension of
intra-application variability in petascale applications.  In a word, parallel
computing is an irreducible mixture of high thread-state and low thread-state
computing.

Historically, dataflow and von Neumann computing have been viewed as radically
different, and perhaps irreconcilable, models of computation.  However, it is
now understood that the models are simply the extrema of an architectural
continuum.  The von Neumann architecture can be extended into a multithreaded
model by replicating the program counter and register set, and by providing
efficient primitives to synchronize among several threads of control.

What was the main issue in trying to make dataflow hybrids work?  Just one,
really: how to provide new mechanisms to reduce the cost of fine-grained
synchronization while retaining the performance efficiencies of sequential
scheduling, i.e., control-flow threads and registers.  Dataflow computing
could synchronize well, but extended von Neumann computing could actually
compute efficiently.

Speaking _very_ loosely, fine-grained dataflow gives us efficient
synchronization, and coarse-grained dataflow, in which threads actually
compute something appreciable before synchronizing, gives us high-performance
computing.  Pure dataflow is zero thread-state computing.  Pure MTA-like
multithreading is moderate thread-state computing.  Vector processing and
conventional RISC superscalar processing are high thread-state computing
(although, of the two, only vectors support scalable latency tolerance).

Alas, no single choice of thread state can possibly satisfy all of computing's
needs.  No _single_ intermediate design point on this architectural continuum
efficiently supports general-purpose computing.

In order of increasing thread state, the architectural continuum contains:
1) dataflow computing, 2) lightweight processing, 3) heavyweight processing,
and 4) von Neumann computing.  This leaves many questions open.  How light
should lightweight be?  How heavy should heavyweight be?  Can we adjust the
effective thread state of processors dynamically?

Cascade's model of parallel computation is that petascale applications contain
a (somewhat balanced and possibly hidden) irreducible mixture of portions of
the computation suitable for coarse-grained dataflow computing, i.e, high
thread-state computing, and other portions suitable for fine-grained dataflow
computing, i.e., low thread-state computing.

Portions suitable for coarse-grained dataflow computing are executed as
compute-intensive, heavyweight threads that essentially "synchronize"
precisely once, before they are scheduled on the processor for the very first
time.  Portions suitable for fine-grained dataflow computing are executed as
agility-intensive, lightweight threads that have no problem stopping and
restarting any number of times _during_ thread execution because the thread
state is so low.  (Low thread state creates thread agility, which is essential
for either migration or synchronization).  By the way, low thread-state
computing is a fascinating engineering discipline, whose many possibilities
have only been hinted at here.

Cascade's system architecture follows from its model of parallel computation
in a natural way.  If parallel computation is an irreducible mixture of
high thread-state and low thread-state computing, a parallel system
architecture should be split into separate subsystems that support the global
parallel computation precisely by efficiently executing both the high
thread-state code and the low thread-state code separately in parallel.

There is complete interaction symmetry here.  Just as lightweight threads push
heavyweight threads onto the central work queue in a locale (from which they
are dynamically scheduled onto the heavyweight processor), so heavyweight
threads (via spawning) push lightweight threads onto the work queue of some
lightweight processor (from which they are dynamically scheduled onto that
lightweight processor).

The compiler splits the entire application into compute-intensive, heavyweight
portions and agility-intensive, lightweight portions, possibly with
dependences between portions of the same type, but with the rigid style of
interaction between portions of opposite type outlined above.

Your correspondent's sense is that such decomposition will, in general, always
be possible.  We are only beginning to understand this compilation process and
its effect on independent subsystem utilization.

There is a clear design choice here.  Should we build explicitly
reconfigurable components that morph into different execution modes when the
application changes, or should we design two separate processing subsystems
that operate in parallel and have been optimized separately to support the
efficient execution of the two basic kinds of computation we expect in
petascale applications?  In fairness, only time will tell.

On the architectural continuum from dataflow to von Neumann computing, there
is no single good intermediate design point for parallel computing; all
parallel computing is an irreducible mixture of dataflow and von Neumann
computing; hence, we must build separate subsystems to handle each portion
separately.  This is a liberating view that lets us see design options we
never saw before.

---

The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous.  He alone bears responsibility for
these commentaries.  Replies are welcome and may be sent to HPCwire editor
Tim Curns at tim@hpcwire.com.

Most Read Features

Most Read Around the Web

Most Read This Just In

Most Read Blogs


Sponsored Whitepapers

Breaking I/O Bottlenecks

10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex | Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.

A New Ultra-Dense Hyper-Scale x86 Server Design

10/01/2013 | IBM | A new trend is developing in the HPC space that is also affecting enterprise computing productivity with the arrival of “ultra-dense” hyper-scale servers.

Sponsored Multimedia

Xyratex, presents ClusterStor at the Vendor Showdown at ISC13

Ken Claffey, SVP and General Manager at Xyratex, presents ClusterStor at the Vendor Showdown at ISC13 in Leipzig, Germany.

HPCwire Live! Atlanta's Big Data Kick Off Week Meets HPC

Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?

Xyratex

HPC Job Bank


Featured Events


HPCwire Events