HPCwire
The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / May 5, 2006
Features:
Hard Questions While Waiting For The HPCS Downselect
Proposals from Cray Inc., IBM, and Sun Microsystems for Phase III
awards from DARPA's High-Productivity Computing Systems (HPCS) program
are due as these words are being written.  To differing degrees, the
vendors are understandably coy about precisely what their proposals
will contain.  In a fiercely competitive environment where perception
management is key, a loosely guarded detail might open up one of the
proposals to successful sniping by a rival vendor.

But an innovative petaflops system architecture that, in the most
optimistic case, might reinvigorate the stagnant field of high-end
computing is hardly in the same category as a stealth fighter aircraft
or a classified graph algorithm.

First, taxpayers have at least some right to know how their tax
dollars are being spent.  Second, one of the critical questions in
postmodern computing is this: Is there some form of thread migration,
in shared-memory and distributed-memory architectures, respectively,
that effectively solves---or fails to solve---the dynamic load-
balancing problem?  This question deserves to be answered in full view
of the entire HPC community.  While it may be naive to expect a final
resolution of this perennial question in short order, we can demand
more openness from vendors as the evidence accumulates.

After the awards are made, it is likely that the government will
engage in some perception management of its own, perhaps arguing that
the innovative system architectures of the two winners (people expect
two winners) are prospectively likely to be as good as, or even better
than, the forthcoming Japanese 10-PFs/s heterogeneous-processing
machine, which appears to this observer to be architecturally quite
impressive.

Notwithstanding the vendors' undisputed proprietary rights, your
correspondent would like to preemptively _shame_ each Phase III winner
into coming clean soon after the awards are made.  He asks each
successful PI, at such time, to explain the broad outlines---with at
least some details---of his company's system architecture.  Obviously,
there are major differences in the degree to which the various vendors
have been coy up to now, but in all three cases one can reasonably
ask: What are your aspirations for this machine?  What do you really
think it can do?  What are you really going to put into it?

--- A Problem And Its Solution

A significant source of performance and programmability shortfalls in
current "high-end" system architectures is inter-application and
intra-application diversity.  Since we can safely leave the
embarrassingly _local_ (not "parallel"!) applications to clusters and
their ilk, the applications we consider have the following
characteristics: 1) they compute over globally dispersed data
structures; 2) they engage in long-range communication, sometimes
frequently; 3) they contain a dynamic heterogeneous mix of algorithm
characteristics due to dynamic variation of a) the amount, kind, and
granularity of the application's parallelism, and b) the amount, kind,
and granularity of the application's locality; and 4) in both
classical and postmodern problems, there are frequently rather serious
load-balancing problems that generally require flexible, dynamic load
balancing mechanisms to resolve.

We can easily imagine an ideal heterogeneous system architecture that
would gracefully deal with all these forms of application diversity.
Such a machine would have the following characteristics: 1) it would
have high global system bandwidth---an expensive, critical, finite
system resource---a) to permit tolerating the latency of long-range
communication, etc., and b) to ease the programming burden by
mitigating the performance nonuniformity of memory accessing; 2) it
would have a wide variety of _combined_ compile-time, runtime, and
hardware parallelism mechanisms that would adapt to dynamic variation
of the amount, kind, and granularity of an application's parallelism;
3) it would have a wide variety of _combined_ compile-time, runtime,
and hardware locality mechanisms that would adapt to dynamic variation
in the amount, kind, and granularity of an application's locality; and
4) it would include---as mandated by your correspondent's
architectural tradition---various work queues of fine-grained and
medium-grained parallel activities (threads, chores, threadlets, etc.)
that would be dynamically self-scheduled by various instances of
various "virtual processors" (VPs).

This notion of VP is fully generic.  An individual VP could be, for
example, 1) a multithreaded-processor instruction-stream processing
resource, 2) a vector processor, 3) a RISC superscalar processor, or
3) _any_ processing element with an assignable  tuple, i.e.,
program counter plus frame pointer, which specifies the data (or
execution) context.  Work queues, i.e., unordered collections of
ready-to-execute continuations, were originally popularized by
dataflow folk and later enthusiastically recommended by (massively)
multithreaded folk.  Perhaps vector processors---which come from an
entirely different architectural tradition---should ease up on static
scheduling by their scalar units, and try a little dynamic self-
scheduling instead.  Just a suggestion.  Note: A vector threadlet can
be much smaller than a full vector thread.

In any case, work queues in shared-memory systems are known to allow
flexible, dynamic load balancing by using the _work-stealing_
technique popularized by Blumofe in the MIT Cilk project.

Ah, but any cost-effective system architecture must find some way to
use all these parallelism, locality, and load-balancing mechanisms to
achieve high performance without over-consuming the precious, finite
global system bandwidth resource.

--- Heterogeneous Parallelism and Heterogeneous Locality

Much of the misunderstanding of parallelism and locality comes from
the confusion between 1) parallelism and locality as _attributes_ of
programs and computations, and 2) parallelism and locality as various
architectural (and software) _mechanisms_ that allow us to extract,
generate, shape, and exploit parallelism and locality, respectively
(although sometimes---to really confuse us---parallelism mechanisms
are used to exploit locality).

Compounding this misunderstanding is a heap of partially obsolete and
potentially misleading terms such as "instruction-level parallelism"
(ILP), "data-level parallelism" (DLP), and "thread-level parallelism"
(TLP), not to mention such hoary favorites as "temporal locality" and
"spatial locality". While there is no hope of escaping this muddled
terminology, we might at least try to clean it up.

For parallelism, massive abstraction simplifies the discussion.
Consider the flow-dependence graph of a computation, where the graph's
vertices are elementary operations.  We can take this as the finest-
granularity view of the parallelism in the computation.  Absence of a
direct (or transitive) flow dependence between two operations means
that it is permissible to execute these two operations in parallel.
Similarly, grouping subgraphs of the flow-dependence graph into
supervertices of a new dependence graph allows us to view this
parallelism at coarser granularity.  At this level of abstraction,
reasonable people agree there is no way to identify any portion of
this parallelism as ILP, DLP, or TLP---or any other kind of P---
because all the parallelism specified in this dependence graph is
_fungible_.  This simply means that it could very well be exploited by
any one of several architectural parallelism mechanisms.

If you find this paradoxical, think of a parallelizing compiler
transforming TLP into ILP or vice versa.

For better or worse, "instruction-level parallelism" is the term we
use to denote exploiting parallelism by conventional RISC superscalar
pipelining with dynamic instruction scheduling and superscalar
instruction issue.  Code so executed is called _serial code_ (or
single-threaded scalar code).  Of course, even if all the instructions
in a thread are totally ordered by dependences, we can still execute
them "in parallel" by overlapping pipeline stages of different
instructions, where the dependences are now between pipeline stages.
Alas, since instructions can specify long-latency operations (e.g.,
loads), and since RISC superscalar processors do _very_ advanced
pipelining, it helps if the operations in the thread are as unordered
as possible.  If not, we may find it essentially impossible to fill
the instruction-issue slots.

Does serial code exist?  Practically speaking, it exists whenever
other forms of extracting and exploiting parallelism are more
expensive---for whatever reason (perhaps because we are unwilling to
reprogram a legacy application). Single-thread performance on a RISC
superscalar processor is often quite respectable when parallelism is
not an option.

"Vector-level parallelism" is the term we _should_ use to denote
exploiting parallelism by conventional or hypermodern vector
processors.  Code so executed is called _vectorized code_.  Precisely
what code is vectorizable is a long story but, roughly speaking, such
code 1) contains large groups of independent homogeneous operations
that can be performed in parallel on vast arrays of data, and 2) has
limited branch intensity and/or irregularity of control flow, which
might very well overwhelm the compiler and lead to diminishing
performance returns.  Modern vector processors have spatially parallel
execution units and spatially parallel datapaths.

When vectorization "succeeds", in several senses of that word, it is
by far the most cost-effective architectural mechanism for exploiting
parallelism.

Finally, "thread-level parallelism" is the term we use to denote
exploiting parallelism by conventional or hypermodern (massively)
multithreaded processors.   Code so executed is called _multithreaded
code_.  What makes code threadable but not vectorizable?  Classic
answers include: 1) frequent conditional branches, and short-vector or
even scalar arithmetic; 2) data-dependent loop exits, recursive
function invocations, and code selection based on data values; 3) non-
linear and pointer-based data structures in data-intensive
applications, such as postmodern graph applications; and 4)
unpredictable fine-grained synchronization.  But maybe a better
question is what code _should_ be threaded?

--- A Model of Heterogeneity

The ultimate payoff of a heterogeneous architecture is the ability to
construct a hybrid system architecture that combines characteristics
of both 1) parallel von Neumann machines, and 2) parallel non-von
Neumann machines, which are basically (meta)dataflow machines.  Any
reader who understands that facts can change and opinions can evolve
is referred to your correspondent's first attempt to articulate the
subtle distinction between von Neumann and non-von Neumann machines:
"Asynchronous Heterogeneity: The Next Big Thing in HEC", HPCwire,
August 12, 2005 [M450881]).  An early example of such a hybrid
architecture is the MIT *T machine.

What does this mean in the current context?  Broadly speaking, vector
(and serial) threads _necessarily_ have high thread state and thus are
suitable for some von Neumann computing tasks, while multithreaded
scalar threads _can_ have arbitrarily low thread state and thus can
_become_ suitable for other (possibly complementary) non-von Neumann
computing tasks.  This is not to criticize threaded von Neumann
computing, but merely to show that threading is more general purpose--
-it can do both!

There is no question that applications present with mixtures of vector
parallelism and scalar parallelism and that this often dictates
whether vectors or threading is more appropriate.  But an entirely
different question is whether high-state threads or low-state threads
_as such_ are more appropriate.  Thread state matters so much because
of 1) the frequent need for lightweight synchronization, and other
issues related to dynamic self-scheduling; and 2) the desire to use
inexpensive mechanisms for migrating low-state threads to exploit
novel forms of locality.

Already, to take one topical example, new attributed-graph algorithms
in health care, proteomics, counterterrorism, bioinformatics, etc.,
etc., are showing us that there are entirely novel styles of
computation that must be addressed.

For locality, in contrast to parallelism, a more concrete (indeed,
almost architecture-dependent) abstraction simplifies the discussion.
One useful distinction is this: 1) there are temporal relationships
among data operations, and 2) there are spatial relationships among
data items and also among executing threads and (thread-relevant) data
items.  What we really want is an efficient general-purpose mechanism
to leverage instances of the first kind of spatial relationship into
instances of the second kind, of which more later.

Intuitively, our model of (system) locality starts by abstracting the
rich complexity of the _intraprocessor_ memory hierarchy into a one-
level store.

Consider the flow-dependence graph of a computation, where---for
simplicity--- the graph's vertices are elementary operations of
precisely two types: either loads or arithmetic operations.  Each
vertex is labeled with its type.  Since marshaling operations into
threads strongly affects locality, we view this computation as
resulting from the execution of some thread (serial, vector, or
threaded).

Obviously, both the loads and the arithmetic operations are
_producers_ of data values, while the arithmetic operations are also
_consumers_.  We include a separate dependence edge every time a value
produced by one operation is (nondestructively) consumed, i.e., used
as an operand, by another operation. For even more simplicity, we will
shortly elide some data-producing operations from this dependence
graph.

Our performance metric here is the average amortized cost of producing
a data value.  If a produced value is reused, i.e., consumed several
times, we amortize its production cost over the number of its uses.

We can _attribute_ this labeled graph by specifying the data-
production cost of each data-producing operation.  To do this, we need
an architectural model. The simplest model is that the thread runs on
a processor with a local store (e.g., the register file plus the D-
cache) and is connected through a pipe to a memory.  Qualitatively, we
say that each load produces a data value at _high_ cost, while each
arithmetic operation produces a data value at _low_ cost.  Ultimately,
we must assign precise quantitative values to these data-production
costs, and design locality mechanisms to reduce them whenever
possible.

To keep the model simple, we elide all (low-cost) "local reads", i.e.,
all loads satisfied from the D-cache---as well as all (implicit) reads
from the register file, so only (expensive) loads and (cheap)
arithmetics are retained in the elided flow-dependence graph.  Reads
from nonarchitected pipeline registers are included as part of the
cost of an arithmetic operation.  Of course, in any real-world
implementation, we would have to reinstate all the (implicit and
explicit) elided vertices---and worry about each and every data-
production cost.  Here, we are merely engaged in conceptual
exposition.

In any execution schedule permitted by the flow-dependence graph, at
each moment of time, there will be a set of produced values whose
_final_ consumption lies in the future.  This is the _live set_ of the
thread.  Since we are trying to minimize communication (i.e., wire)
cost, things work best if the thread's live set fits into the thread's
state, which is just the size of the thread's effective local store
(possibly a small fraction of the processor's local store).

Two observations lead to an important punchline.  1) Temporal
relationships among high-cost and low-cost data operations are
determined by the topology of the _labeled_ dependence graph (after
all, each 'load' vertex has a given out-degree and there is a given
ratio of loads to arithmetics).  2) Spatial relationships among data
items---and among executing threads and (thread-relevant) data items--
-can strongly influence the data-production costs of the _attributed_
dependence graph (for example, a thread sitting at the _center_ of a
cluster of useful data items reduces the average cost of performing a
load).  The data-production cost attributes add _metric_ information
to what is otherwise (largely) a _topological_ structure.

Note: The data-production costs of elided data-producing operations
are generally controlled by minimizing wire distances within the
processor's microarchitecture---for example, by segmenting register
files, caches, and functional-unit groups.  Bill Dally's paper on the
Imagine stream processor explains all this in great detail.  We have
simply chosen to abstract from microarchitecture locality to focus on
_system-architecture_ locality.

Quite naturally, the architectural mechanisms for exploiting locality
fall into two classes: 1) _intraprocessor_ locality mechanisms
optimized to exploit dependence graphs with high "internal locality",
i.e., those with significant reuse of loaded data values and/or
significant local production of data values; and 2) _extraprocessor_
locality mechanisms optimized to exploit dependence graphs with low
internal locality, i.e., those with little or no data reuse and little
or no local production of data values.

What "optimization" of the second type of dependence graph is
possible?  The latter dependence graphs _may_ have potentially high
"external locality", in the sense that the communication costs of
their high-cost data-producing operations (viz., their expensive
loads) can potentially be significantly reduced.

In class 1, threads execute on a fixed processor with a local store
that can store the peak amount of the thread's live set.  In this (von
Neumann) model, threads are immobile and gradually build up their
thread state---in the processor's local store---to maximize their
ratio of arithmetic operations to communication operations (the
thread's "arithmetic intensity").  Obviously, to be properly
exploited, a massive live set inherent in a thread's flow-dependence
graph requires an equally massive local store.  Dependence graphs with
high internal locality _and_ small live sets do exist, but they tend
to be the exception rather than the rule.

An immobile thread that gradually accumulates thread state is the
essence of von Neumann computing.  Contrariwise, a thread denied
thread-state accumulation by algorithm constraints may seek
performance by engaging in non-von Neumann computing.

In class 2, how a thread---which has been marshaled from some flow-
dependence graph---is intended to behave depends on whether 1) the
application is control and data decomposable, 2) the application's
data is distributable, or 3) neither.  The first attribute is true of
_localizable_ applications, i.e., the ones we ceded at the beginning
of this piece to clusters and their ilk as undemanding applications
not requiring a true supercomputer architect's full attention.  For a
strong proponent of heterogeneity, this may sound a tad sectarian, but
clusters remain very cost effective for easily localizable
applications.

The problem is that, for demanding applications, most serious people
have lost hope that, given a moderate amount of programmer effort, the
compiler will automatically do a good job of allocating and
distributing data to suit the computation and minimize communication.
In a word, you can't assume that your thread will automatically run in
a "data-friendly" environment, where the bulk of the data you need is
close by.

Is there a more modest goal that might be achievable with modest
programmer effort and a reasonable amount of automatic help from the
compiler and the runtime system?  Yes, any time you can _physically_
group together data that _logically_ belong together.  This is not
manually fragmenting data structures and designing and coordinating
threads to compute individual fragments.  Rather, it is, at least in
part, simply the "degree 0" of affinity scheduling.

In non-von Neumann computing, when we encounter (or manufacture) a
dependence graph with low internal locality, we should ask: Can I
shape and decompose this dependence graph into fine-grained parallel
activities, i.e., low-state threads, with the property that each low-
state thread would perform well if it was put into a local storage
environment containing, say, at least 75% of the data the thread
needed?  If so, we should further ask: Can I program this application
so that, for each low-state thread, at least 75% of the relevant data
becomes concentrated _around_ some location in the computer's virtual
address space, before the thread executes?  (Since we only care about
average cost, this "concentration" can take many forms; for example,
we might shape the distribution function of thread-relevant data into
a crude approximation to the Dirac delta function).  If so, and if
thread migration is dirt cheap, we can freely launch countless low-
state threads, from many other (possibly high-state) threads, and have
them migrate to these small regions of data concentration.

If the bulk of the loads can be satisfied from a small neighborhood---
in virtual address space---of the low-state thread's new location, and
if we can get low-state threads to migrate cheaply to essentially
arbitrary virtual locations, we have clearly minimized the average
cost of that thread's loads. Obviously, however, the virtual-address
metric space cannot be _independent_ of the physical-location metric
space, since---in the general case---cost equals wire distance along
the global system interconnection network.

Agile, low-state threads 1) are freely spawned with minimal overhead
by many types of thread, 2) migrate cheaply to prestaged local
concentrations of data, and 3) engage in lightweight synchronization
as often as they please. Deployment of these agile threads, and
programmed construction of their data environments, is the essence of
non-von Neumann computing.

Note: In a homogeneous multithreaded multiprocessor with logically
flat shared memory, this problem does not arise.  Take one of the
famously unpartitionable graphs in postmodern graph computing.  There
is no need to partition or distribute or rebalance this graph because,
logically speaking, every portion of the graph is at the same distance
from every thread.  Threads don't need exotic migration machinery
because they gain no performance from migrating!

With an unlimited budget, you could build a rather large homogeneous
multithreaded multiprocessor.  You could do the same thing if you were
willing to cut corners on the system's global hardware bandwidth.  But
the future of computing demands that we back off---at least for a
generation---from arbitrarily large flat shared memories.  We must,
for a while, become more architecturally modest; we must say, "I can't
guarantee sufficient performance with parallelism alone or locality
alone, if I truly want to scale; rather, I need to exploit (almost)
every scrap of parallelism, of any kind, and (almost) every scrap of
locality, of any kind, that I can dig out of this application". It is
this new-found architectural modesty that is the driving force behind
heterogeneous processing.

Note: Programmability is simply the desire not to be forced to dig out
all this parallelism, and all this locality, with your teeth!  A
decent programming model, grounded in a decent programming language,
is simply one that allows the programmer, the compiler, and the
runtime system to extract almost all the parallelism, and almost all
the locality, without forcing the programmer to do hard labor.  That
being said, performance tuning may require a larger contribution from
the programmer.

--- Heterogeneous Systems vs. Heterogeneous Hardware

If the first hard question was, how well can we load balance in
shared-memory and distributed-memory system architectures,
respectively?, the second hard question is this: Even if a vendor
provides heterogeneous hardware to deal with different forms of
parallelism and---far less commonly---different forms of locality, how
committed is the vendor to the massive system-software "architecture"
research---in new compiler technology, in the design of new runtime
systems, and in getting Unix to live on what is essentially a new
planet---that is required if heterogeneous processing is to achieve
even a small fraction of its true potential?  More bluntly, what must
be added to even an impressive collection of heterogeneous system
hardware to produce a genuine heterogeneous system architecture?

Is this asking for too much?  Isn't it at least conceivable that
certain vendors---at least those with actual system architectures in
their HPCS proposals---are already stretched to their limits
(technical, financial, commercial, ideological, or whatever)?  Would
additional government pressure from HPCS on vendors at this time---
asking for more innovation or more-radical system design---really be
productive, or would it be counterproductive?

These are delicate questions.  But just as we have all scrutinized the
intentions of a potential wife (or husband), so may we scrutinize the
intentions of any vendor who is asking for a large chunk of public
money. Notionally, your correspondent can imagine at least two
possible vendor attitudes: 1) We fully agree that a thorough
exploitation of heterogeneous processing is the nation's _only_ way
out of the high-end crisis; we offer the best design we can produce at
this time given current constraints of human and financial resources;
we have considered---and postponed---more expensive designs; and 2)
While the notion of heterogeneous processing is intriguing, and might
lead to some interesting bragging rights down the road, we are not
totally sure that the heterogeneous architecture proposed by our
academic partners will fit well with our traditional business model,
i.e., the traditional way we make our money and the traditional "high-
end" brand we currently promote.

Not to put too fine a point on it, sometimes a suitor does not fully
understand his own intentions, much less those of his intended match.
And, throughout history, some suitors have been cads.

From a crusader's point of view, it makes sense to articulate the
full vision of heterogeneous processing.  After all, no system
architecture, and no government program, is cast in concrete.  At
present, HPCS is the only high-end computing program our government
has.  We can't reduce our support  for this utterly vital national
program just because---as several reputable persons have suggested---
it has watered down its original high expectations, indeed, its
original mandatory requirements, for an HPCS machine.  Should it be
the case that giant steps by U.S. vendors at this time exceed our
current capabilities, let us at least make sure that we are marching
in the right direction.

--- Conclusion

Heterogeneous processing, towards which your correspondent is
radically biased, is turning out to be a big tent.  It is not the only
future of computing, but it must be a large part of any possible
future.  The drivers of heterogeneous processing are inter-application
and intra-application diversity; the big tent is just the fact that we
are incrementally widening our diversity net.  In fact, some
innovative _homogeneous_ machines are targeted at important biomedical
and national-security challenges.

Still, choices abound.  What are the natural limits of homogeneous
processing? Are homogeneous systems in some sense essentially niche
machines?  Do shared memory and distributed memory have profound
effects on our ability to load balance?  How is this (alleged)
difference related to programmer mediation of data distribution?  Why
are designers comfortable with parallelism diversity but willfully
obtuse when it comes to locality diversity?  Did the regrettable PIM
mania destroy the credibility of non-von Neumann computing?

How much should one pay for a genuine heterogeneous system
architecture? Since it is a sensitive subject, your correspondent has
not rehashed arguments he made in late September showing that low-
state threads deserve a _distinct_ intraprocessor memory hierarchy
(essentially to lower the low-state thread-migration overhead).  Is a
polymorphic intraprocessor memory hierarchy even conceivable?  Is it
implementable?

The mission agencies have computational problems they need solved
yesterday. Crusaders---especially multithreaded crusaders---dream of
killer apps that will follow postmodern graph processing, and novel
styles of computation that will define the future paradigms of high-
end computing.  Computational biomedicine and computational national
security are converging towards massive data-intensive applications
where very limited (conventional) locality trashes all (conventional)
locality mechanisms.  This is the proper driver of non-von Neumann
computing.

---

The High-End Crusader, a noted expert in high-performance computing
and communications, shall remain anonymous.  He alone bears
responsibility for these commentaries.  Replies are welcome and may be
sent to HPCwire editor Michael Feldman at editor@hpcwire.com.