Many seasoned observers of high-end computing feel it in their bones
that heterogeneous processing is rising like the Fundy tide (to the
east of Maine and then north), a massive natural phenomenon of gravity
and hydrodynamics.In the United States notably, under the auspices of
DARPA's HPCS program, there is Cray's Cascade project and IBM's
infuriating "don't ask; don't tell" Percs project. Admittedly, IBM
_has_ made strong representations to DARPA concerning Percs'
heterogeneity content. In Japan, the heterogeneous 10-PFs/s Keisoku
Keisanki---the poetry is lost in translation---will be built. In
Japan, the money and political will are there.
So, can we just sit back and wait for good things to happen? Will
this inevitable transition to heterogeneous processing in high-end
computing usher in a refreshingly beneficial paradigm shift in
Ha! What have you been smoking?
Unlike the Fundy tide, which needs only natural law to work its
awesome magic, heterogeneous processing---which encompasses all
computing on any member of a widely diverse set of heterogeneous
system architectures---needs, in each and every nontrivial instance, a
revolution in system software. In the absence of appropriate system
software, a nontrivial heterogeneous-system project will just abort.
In Japan, where the proposed hardware architecture for the Keisoku
Keisanki is very clean, the Japanese inability to handle the software
revolution will---in all likelihood---simply stop the Keisoku Keisanki
dead in its tracks. But the U.S. has recognized expertise in
sophisticated system software, programming languages, language
processing, runtime systems, program-development tools, models of
parallel computation, and all that, does it not?
Perhaps the U.S. has such expertise in principle, but it clearly has
not yet worked the issues of developing system software to manage,
operate, and exploit next-generation heterogeneous systems. More
generally, in the area of heterogeneous processing for high-end
computing, there is a total absence of leadership---from vendors, from
academia, from government. Whether we consider heterogeneous system
architectures, hardware technology, language processing, etc., etc.,
there are no private-sector computer architects or government agencies
who either can, or are willing to, assume a leadership role; there are
no compelling visions of heterogeneous processing from which to
choose. We are lacking even a simple roadmap for heterogeneous
And DARPA has---or so it seems---bought into the Army's oxymoronic
concept of low-footprint counter-insurgency, i.e., it is not
pressuring vendors to bite the bullet of revolutionary heterogeneous
processing---notably, in the areas of developing sophisticated system
software and providing sufficient global system bandwidth.
What might a concerned citizen (say, a high-end crusader) do? Start
at the beginning, start with what we know, and slowly build up the
foundations until we see the clear choices in heterogeneous system
architectures and the deep challenges in system software that
accompany each one.
To anticipate, the key task for heterogeneous-system system software
lies in scheduling strategies and other system functions that maximize
the performance extracted from scarce system resources, notably the
heterogeneous system's limited global system bandwidth. This is the
punchline of this article.
--- Latency Avoidance and Latency Tolerance
In the von Neumann model, processors are separated from memories, from
which they fetch operand data to feed their arithmetic functional
units. We say that a high-value processor suffers from _latency
disease_ if it idles much of the time waiting for data operands to
arrive. For simplicity, we focus on memory latency (or network/memory
latency). Since arithmetic functional units are of exceedingly low
value, they cannot suffer from latency disease; no one gives a hoot
about their degree of utilization. Only _critical_ system resources
can suffer from either low _or_ foolishly extravagant utilization,
where "critical" means "costs a lot" or "is a primary performance
bottleneck" or both.
In the early 90s, when control-flow processors were reasonably
critical resources, the memory-latency story was simple. Latency is
avoided by copying data nearby. Latency is tolerated by doing
something else while waiting. Avoidance scales up if locality
increases with size. Tolerance scales up if parallelism increases
Multiprocessing (a/k/a multiprogramming) is based on the latter
premise. In multiprocessing, a job requests I/O and then blocks,
performing a context switch to another job. There is a dependence
because the first job cannot continue until its I/O has completed.
The compute processor offloads work onto the I/O processor. No deep
thinking is required because only the compute processor can handle
computation and only the I/O processor can handle I/O, assuming that
I/O is DMA. The compute processor _tolerates_ disk latency in order
to maximize total system throughput. There is no interest in
individual job latency, which often increases.
Multiprocessing doesn't scale down because of context-switch cost.
The heterogeneous multicore Cell processor avoids this problem by
using both nonpreemptive vector threads and software control of the
memory hierarchy. Think of a vector core's SRAM local store as if it
were a very nearby local memory (whose latency is not an issue), think
of the Cell processor's DRAM external memory as if it were the vector
core's disk, and think of the data movement between the Cell's DRAM
and its SRAM as if it were I/O. Using the "I/O-request (DMA)"
instruction in its ISA, the vector compute processor offloads work
onto the I/O processor, i.e., the scalar core. Moreover, this "I/O"
is orchestrated so that data from the "disk" is always present in the
local store well before it is required by the vector processor, so
this processor never needs to wait for data (and never _should_ be
made to wait).
Here, a _little_ thinking is required. After all, the scalar core and
the vector cores are _compute_ processors. A decision must be taken
that some work is best performed by offloading it from one processor
subsystem onto the other processor subsystem. In a heterogeneous
multicore system, it may be quite important that work be scheduled on
a core of appropriate type.
Memory-latency avoidance techniques include processor registers,
caches used temporally, and nearby memory. Memory-latency tolerance
techniques include vector pipelining, caches used spatially (i.e.,
long cache lines), prefetching (a/k/a precommunication), and
multithreading. Every parallel machine contains some mix of these, or
similar, techniques. For example, the MTA uses the following
memory-latency techniques: processor registers, nearby memory,
prefetching (i.e., explicit-dependence lookahead), and of course
multithreading. Note that the MTA predated any understanding of
heterogeneous processing as a key enabler of scalable high-end
computing, or the practical desire to scale to sustained petaflops and
beyond, for that matter.
In processor-based latency tolerance, the processor supplies a steady
stream of memory references that eventually fill the memory pipeline
(i.e., outgoing network, memory subsystem, incoming network). The
operand values returned by the memory satisfy dependences on these
values in requesting threads. Little's law prescribes how many memory
references must be outstanding in order to sustain, in the face of a
given network/memory latency, the desired bandwidth of returned
But what, abstractly, _is_ processor-based latency tolerance? (The
answer points the way to system-level latency tolerance). The
processor issues a steady stream of _dependence requests_. The
processor receives a steady stream of _dependence satisfiers_. This
stream of dependence satisfiers guarantees that a processor's work
queue is constantly stocked with ready threads, and hence that the
processor is always constructively occupied.
This abstraction _crashes and burns_ if there is not enough global
system bandwidth to transport these streams of dependence requests and
dependence satisfiers. No system software for a heterogeneous system
can begin to compensate for a too-significant underprovisioning of the
system's most critical resource---its global system bandwidth.
The reader should remember that, in this writer's view, high-end
computing properly targets the "difficult" applications, which are
both not easily localizable and have other interesting attributes (see
"Hard Questions While Waiting For The HPCS Downselect", HPCwire, May
For quite a few reasons, neither the MTA nor the MTA-2 had a D-cache.
Most of the heavy lifting was done by the processor parallelism, and
hence the memory-reference concurrency, generated by the MTA's
multithreaded processors. Although people use the term
"latency-tolerant processor", latency tolerance is actually the joint
result of processor parallelism and network bandwidth. In any case, as
scaling parallel systems to tens and hundreds of sustained petaflops
on nonlocalizable applications was contemplated, it became more and
more clear that the combination of processor parallelism and network
bandwidth ---when used in isolation---simply does not scale to handle
the large system diameters in petascale systems. Obviously, a hybrid
approach incorporating an _extended_ mix of latency-tolerance and
latency-avoidance techniques is required.
The affordable, industrial-strength solution to the problem of scaling
parallel machines to tens and hundreds of sustained petaflops on
difficult applications, which are profoundly cluster unsuitable, lies
in increasing the system's parallelism, for superior latency
tolerance, and increasing the system's locality, for superior latency
avoidance. There are feasible (heterogeneous) and infeasible
(homogeneous) ways of attempting to do this.
As heterogeneous processing began to be understood, it became clear
that we have to generate parallelism both inside and _outside_ of the
processors (at the system level, as it were) and to exploit
heterogeneity to create entirely new approaches to generating
locality. The amazing thing is that (both hardware-controlled and
software-controlled) intelligent _bidirectional_ offloading of work
onto the other of two complementary processor subsystems turns out to
be the key to both endeavors, and leads to generalized notions of
latency tolerance, latency avoidance, dynamic (thread) scheduling, and
--- Scheduling Multithreaded Computations
Scheduling of multithreaded computations on parallel machines is
complicated by the dynamic nature of such computations, which is only
fair since both hardware and software multithreading have been
proposed as general solutions to the problem of exploiting dynamic,
unstructured parallelism. Here, dynamically created threads cooperate
in solving the problem at hand. However, for efficient execution, we
must have efficient runtime thread placement and scheduling. Although
general thread placement for optimal processor utilization is NP-hard,
schedulers based on simple heuristics have been proposed that work
well for a broad class of applications.
Historically, implementations of the thread scheduling and placement
task at the very core of the runtime system have had two possibly
conflicting goals: 1) maintain related threads on the same processor
to minimize communication cost, and 2) migrate threads to other
processors as required for dynamic load balancing.
Previous work on runtime scheduling assumed that threads and
processors were homogeneous. There are two main approaches. In _work
sharing_, whenever a processor generates new threads, the scheduler
attempts to migrate some of them to (potentially) underutilized
processors. In _work stealing_, only processors that are actually
underutilized attempt to "steal" threads from other processors. Work
stealing minimizes the communication cost of thread migration---if
processors have enough work to do, no thread migration takes place.
In the multithreaded world, a work-stealing scheduler must also handle
"dataflow" computations, in which threads may stall due to a data
dependence. This usually involves dynamic scheduling of the threads
present in the processor's work (or ready) queue---when an execution
unit finishes a task, it automatically reaches into the work queue to
fetch another task.
What makes sense here critically depends on the precise communication
cost of thread migration, and on what performance advantages might
accrue from running a thread on a processor with special
characteristics. But this can mean many things. In "Hard Questions"
(op. cit.), much was made of threads extracted from either serial
code, vector code, or multithreaded code. Obviously, a given code
type mandates execution on a core of matching type (i.e., either
serial, vector, or multithreaded). In this section, we largely
abstract from "serial/vector/multithreaded" code heterogeneity---we
take this form of heterogeneity for granted, and seek to uncover and
exploit deeper forms.
A multithreaded computation may be represented as a directed acyclic
graph (dag) where the vertices are tasks (possibly elementary
operations) and the edges are either flow dependences or spawn
operations. During execution, a thread may create, or _spawn_, other
threads. In this representation, the only edges within a given thread
are flow dependences (computations arising from individual threads are
themselves partial orders). An edge that crosses from one thread to
another is either a spawn operation or a flow dependence between a
task in one thread and a task in another thread.
An important special case is where an inter-thread dependence edge (or
edges) functions as an _abstract_ spawn operation. Suppose that only
initial tasks of a given thread are flow dependent on tasks in other
threads. Until these dependences are satisfied, the thread is blocked
and is not uploaded to the processor's work queue. Only when all
initial dependences of the thread have been satisfied (when the
closure becomes full, as it were) does this uploading take place.
Essentially, a "fresh" thread has been created that now competes for
the processor's execution units. We have seen such threads before, in
our discussion of the Cell processor. Threads with only initial
dependences are instances of nonpreemptive threads, which---once
scheduled on an execution unit---run to completion without stalling or
Nonpreemptive threads are only justified in general-purpose computing
when the thread's state becomes so colossal that context switching is
not an option. These threads lack the expressiveness and flexibility
required to exploit dynamic, unstructured parallelism. More familiar
preemptive threads, which may stall or block many times during
execution, are often used exclusively; in any case, preemptive threads
are mandatory for the "dataflow" portions of a multithreaded
Heterogeneous processing's potential lies in providing mechanisms that
allow us to combine approaches that were once thought to be mutually
exclusive. So, task the system software with mixing and matching
(agile) preemptive threads _and_ (cumbersome) nonpreemptive threads.
Let the compiler and the runtime cooperate to mix and match
dependence-aware thread scheduling _and_ dependence-oblivious thread
scheduling. In fact, for performance beyond the limits of processor
parallelism and for dynamic load balancing, let us attempt to combine
work stealing _and_ work sharing.
But we are getting ahead of ourselves.
Multithreaded computations with arbitrary data dependences are often
impossible to schedule efficiently. Dynamic scheduling within and
across processors is normally dependence oblivious. The standard
scheduling heuristic organizes each processor work queue as a _ready
deque_ (double-ended queue) with a _top_ and a _bottom_. Threads are
inserted on the bottom, and can be removed from either end.
Briefly, an idle processor normally fetches its next thread from the
bottom of its ready deque. Locally spawned or enabled threads are
placed on the bottom of the ready deque. An idle processor confronted
with an empty ready deque attempts to steal the topmost thread from
the ready deque of a randomly chosen remote processor.
This "local depth-first, remote breadth-first" scheduling heuristic
captures only some of the dependence order of threads but is
computationally feasible as a scheduling algorithm.
--- System Software for Heterogeneous Parallelism
The grand challenge in this area is designing system software to
manage a heterogeneous system architecture with full-scale generation
and exploitation of locality. The software tasks are so revolutionary
that very few people even begin to understand them. On a more mundane
level, your correspondent finds it challenging, in a clear and
coherent fashion, to describe these software tasks, and the idealized
heterogeneous system architecture to which they correspond. One fact
that complicates matters is that there is no uniquely preferred
implementation of this idealized system. So much of the heated debate
about competing implementations has been misdirected! Design teams
have suffered major attrition. Well, there is no point crying over
The idealized architecture does appear to provide the maximal
exploitable parallelism _and_ the maximal exploitable locality in the
presence of "difficult" applications (remember, this is a term of
art). There can be no doubt that different implementations have a
_major_ effect on the granularity with which we can exploit either
parallelism or locality, but a coarse-grained implementation of the
right architecture is better than any implementation of the wrong
After the excitements of 2005, your correspondent is resigned to
incremental implementation; that is why it is so important to be clear
about the idealized heterogeneous system architecture that we would
_all_ like to implement.
Given the difficulty of exposition, we pause to discuss system
software to manage heterogeneous parallelism before moving on to
discuss the more subtle problem of system software to manage
Homogeneous multicore processors, which are now becoming quite
standard in industry, create conflicting requirements in core
complexity. Ironically, for this very reason, even
"latency-intolerant" commodity processors will soon be forced to bite
the bullet of heterogeneous multicore.
Some applications do not parallelize (or at least have not been
parallelized yet). Other applications may have "serial" portions. In
such cases, single-thread performance becomes the relevant figure of
merit. When this is so, the appropriate core is a larger,
higher-power core that implements _out-of-order_ execution of
individual threads. But out-of-order cores, with associated major
increases in area, power, and design complexity, lead to only modest
increases in application performance (leaving aside the effects of
Other applications can be easily decomposed into parallel threads.
When we replace single-thread performance as the figure of merit by
system throughput of the set of threads that have been extracted from
an application, the appropriate core is a large number (say, ten) of
smaller, lower-power cores, each of which implements _in-order_
execution of individual threads. Given a decomposition of our
application into ten threads, the ten in-order cores simply blow the
single out-of-order core out of the water. (The Achilles heel of
out-of-order cores is the use of register renaming, which increases
the length of the instruction pipeline).
In reality, applications dynamically change their stripes during
execution, sometimes behaving as serial code, at other times behaving
as threaded code. Since no single choice for core complexity makes
sense, we _should_ design heterogeneous multicore processors with both
one large out-of-order core and ten small in-order cores---in this
case, almost certainly on the same die. (Rescuing latency-intolerant
commodity processors with heterogeneous multicore is an example, not
your correspondent's dream microarchitecture. But the idea is good).
One possibility is to use the same ISA for both types of core, with
only performance differences (the power-hungry cores do run threads
faster). The compiler will extract threads from both the portions of
the code it deems serial and the portions of the code it deems
threadable. The compiler will decide on an appropriate core to
execute the thread at the moment of its extraction.
More interesting is to have different ISAs in different cores. This
may lead to having multiple versions of codes, where the compiler and
runtime working together make the final decision. The Cell processor
uses different ISAs: the scalar core runs scalar code and the vector
cores run vector code. In the applications that have been ported to
the Cell so far, scalar versus vector appears to be a fairly trivial
decision. The Cell processor is so cut-and-dried that it may not
actually need a software revolution.
Even this simple examples provides a preliminary answer to the general
question: what software revolutions are required to bring out the
potential of heterogeneous processing? At first glance, it seems that
we only need to build tools for program development, design a decent
compiler that takes advantage of the possibilities of using threads
creatively, build a runtime system that solves the problem of thread
scheduling, and, oh yes, agree on a computational model of
heterogeneous processing that will allow us to integrate our hardware
and software efforts.
The example also points the way to a central theme of the next
section. Pretend that the single out-of-order core forms one processor
subsystem and that the ten in-order cores form a distinct processor
subsystem. In this way, we may view the transition from a serial
portion of the computation to a threaded portion as one thread running
on a heavyweight processor subsystem offloading work onto a
lightweight processor subsystem by spawning ten new threads. More
interesting, the decision to offload work is based not only on the
fact that the work can be done faster on the other subsystem, but also
on the fact that a critical system resource will be more moderately
consumed. Here, the system resource is the cooling capacity, which
must offset power dissipation (we are---reasonably---assuming that ten
in-order cores can be made to dissipate less power than one monster
Hardware and software control of general-purpose
heterogeneous-processing systems is all about scheduling heterogeneous
tasks wisely onto distributed, heterogeneous system resources.
Heterogeneous multicore can also mean a microarchitecture in which
(some of) the cores are full-custom designs of vector and/or
multithreaded latency-tolerant processors. With
serial/vector/multithreaded heterogeneity, the compiler must generate
code for all three types of core (execution unit). Obviously, these
cores have very different ISAs. But some of the decisions (initially)
taken by the compiler are decidedly nontrivial. For example, how do
you decide correctly, almost all of the time, whether a given loop
should be vectorized or whether it should be multithreaded? Perhaps
there is no single right answer at compile time. Again, this looks
like an area where it would be advantageous to generate multiple
versions of the code. The programmer has some knowledge of his
application and should not be ignored, but even the simple question of
code generation and core assignment will require closer cooperation
between the compiler and the hardware, with the runtime system often
making the final call.
Major advances in compilers, runtime systems, development tools,
debugging tools, etc., etc., are required just to manage heterogeneous
parallelism. Things get really interesting when we task the system
software with integrated joint management of heterogeneous parallelism
_and_ heterogeneous locality. This is the theme of the next section.
--- Heterogeneous Parallelism and Locality
Consider a _virtual_ heterogeneous system architecture (i.e., don't
ask if distinct processor types imply the existence of distinct
full-custom chip designs or polymorphic processors or whatever). Make
the stipulation that global system bandwidth is the heterogeneous
system's most precious, and most critical, resource. Further
stipulate that global system bandwidth is _somewhat_ underprovisioned,
by neccessity, not by choice. The goal of heterogeneity is to make
the best possible use of this limited global system bandwidth by
extracting the maximum possible performance from each unit of consumed
bandwidth. In a heterogeneous system, the hardware provides distinct
mechanisms, and the system software makes sophisticated decisions
about which mechanisms to use.
Processors, i.e., execution units (indeed, individual cores), are
divided into two types: _heavyweight processors_ (HWPs) and
_lightweight processors_ (LWPs). (Do not ask if a given individual
core can morph at runtime from being an HWP into being an LWP---these
are virtual processors and such questions are inappropriate and/or
proprietary). HWPs have the property that they allow executing
threads to accumulate a colossal amount of thread state, which is
normally stored in a D-cache and/or a large register set. In
contrast, LWPs have the property that they can accommodate executing
threads that do not, or should not, accumulate large amounts of thread
Logically, we imagine both an HWP processor subsystem and an LWP
processor subsystem. Physically, both the set of HWPs and the set of
LWPs are distributed throughout system memory, at the same or
different granularities. We assume nonuniform-memory-access (NUMA)
shared memory. More importantly, we suppose a memory and bandwidth
hierarchy in which processors enjoy more bandwidth to closer ("more
local") memory and less bandwidth to farther ("more global") memory.
This memory and bandwidth hierarchy may be quantized, e.g., by
providing special high-bandwidth interconnect within well-defined
locality regions (locales or places).
Furthermore, we assume that the bulk of the HWPs, and _all_ the LWPs,
support processor-based latency tolerance. This means that, given the
need for global communication, both HWPs and LWPs can be _starved_ by
insufficient availability of global bandwidth. (LWPs starve more
easily). Lack of global bandwidth can also starve system-level
latency tolerance in a heterogeneous system, simply because it
inhibits both thread migration and the return of dependence
The desire for distinct processor types is in reality the desire for
distinct thread types: high-state threads (a/k/a heavyweight or
immobile threads) and low-state threads (a/k/a lightweight or mobile
threads). Your correspondent also uses "cumbersome" and "agile" to
make this distinction. Heterogeneous systems exploit these two
distinct thread types to make more productive use of limited global
system bandwidth. System software decides which type of thread to run
and where to run it.
So, what is this $64,000 question that system software needs to
answer? Recall that a processor that sustains an average network
bandwidth of 'b' words/cycle over an average network distance of 'd'
links consumes system bandwidth at the rate of (b * d) network
words/cycle. This must be matched against some sustained performance,
say, 's' flops/cycle. For efficient execution, we need to maximize
the ratio of bang to buck, i.e., the value of 's' divided by (b * d).
Consider a hypothetical daemon who combines the powers of an
incremental compiler, a sophisticated runtime system, and perhaps even
has access to hardware instrumentation that gives him tips about
network bandwidth and network distance. We will place one daemon at
each HWP, leaving unsaid whether there is also a daemon at each LWP.
This daemon is constantly asking himself: At this point in my (local)
computation, do I have access to a ready thread with high internal
locality (a/k/a temporal locality; see "Hard Questions", op. cit.)?
If so, does the arithmetic intensity (the arithmetic per delivered
operand) justify the still significant consumption of global
bandwidth? What is the ratio of bang to buck? In other words, should
I schedule this high-state thread to execute here, at this
(heavyweight) processor? In contrast, do I have access to a ready
(spawnable) thread with low internal locality? Is there a
(programmed) concentration of thread-relevant data somewhere in the
system to where I could migrate this thread? At the remote
(lightweight) processor, it might not enjoy much arithmetic intensity,
but the average network distance would be considerably reduced, thus
maintaining an acceptable ratio of bang to buck.
The decision tree is exceedingly complex. Very crudely, considering
1) a potential thread's internal locality, 2) the bandwidth to the
thread's data structures, 3) the network distance, and 4) whether
there is a favorable nonuniformity of data distribution, should work
'xyz' be accomplished by running a marshaled high-state thread here or
migrating a marshaled low-state to execute there?
Both threads and processors morph. For example, a lightweight thread
arriving at a rich clump of data puts on weight, i.e., has its further
migration inhibited, until it has consumed the clump---at which point
it promptly regains its former weight. Or, a multithreaded processor
might temporarily shrink the size of its execution contexts,
guaranteeing threads with very low thread state, which synchronize
In our heterogeneous system, we imagine separate per-processor virtual
work queues of high-state and low-state threads; some of this work
will be done locally and some will be offloaded to run elsewhere. The
tough decision is: do I perform a given task by running this thread
here or migrating that thread there? Some of the time, both choices
will be equally good.
Now, dream deeply. Imagine a daemon with a rich supply of
nonpreemptive high-state threads with only initial dependences, which
the daemon keeps in a blocked queue. Imagine that the daemon also has
a rich supply of preemptive low-state threads that satisfy these
dependences, and that he can easily migrate these low-state threads to
system regions where _each_ thread is physically colocated near the
center of mass of a reasonably compact set of thread-relevant data.
My goodness! We have turned thread migration into system-level
latency tolerance. By offloading work onto the LWP processor
subsystem, the daemon directs the HWP processor to issue a steady
stream of _dependence requests_. These requests may circulate
transitively within the LWP processor subsystem for some length of
time. However, the LWP processor subsystem eventually returns a
steady stream of _dependence satisfiers_ to this HWP processor. As a
result, the HWP's work queue is constantly stocked with ready
high-state threads, and hence the HWP is always constructively
In a word, we have moved from tolerating memory latency to tolerating
work latency, and we have done so in a fashion that minimizes the
consumption of global system bandwidth. We have also augmented memory
pipelining with work pipelining.
In and of itself, processor-based latency tolerance consumes precious,
limited global system bandwidth. A locality-aware heterogeneous
system mitigates foolishly extravagant utilization of global bandwidth
by optimizing bandwidth usage to extract the maximum possible
performance from each unit of consumed bandwidth. This requires quite
sophisticated system software to schedule heterogeneous threads onto
heterogeneous execution resources.
Heterogeneous systems are required both to support different styles of
computation and to make better use of critical resources. The
significance of bidirectional offloading of work between two distinct
types of processor subsystems is that we thereby allow performance
scaling beyond the limits of processor-based latency tolerance. The
system-level (work) streams of dependence requests and dependence
satisfiers bring the latency that must be tolerated down to the point
where it can by handled by the processor's parallelism (and possibly
its D-cache), because some of the work has already been accomplished
or for other reasons. But moving from memory pipelining to work
pipelining changes everything.
There are performance considerations. To migrate a thread, we send a
full continuation no larger than a network packet. To receive a
dependence satisfier, we should receive something of roughly the same
size. If either 1) you do not have bite-sized (work) traffic in both
directions, or 2) your network does not have the bisection bandwidth
required to handle both the "dependence" traffic _and_ the "operand"
traffic that is required to feed your latency-tolerant (i.e., vector
and multithreaded) processors, then the whole system falls apart, or
rather limps along when it should be charging.
Caveat: Making receiving a dependence satisfier into the same thing as
transporting a large amount of data up close and personal to an HWP
has not enjoyed unalloyed success in previous designs, from HTMT
forward. If only for system balance, processor-based latency
tolerance and system-level latency tolerance should meet each other
half way. If the cycle time is exceedingly small, this is an open
HPCS must accommodate different computation styles because of the need
to compute informatics graphs. It must accommodate performance
scaling for reasons too numerous to mention.
What is HPCS' proper focus? If we agree that thread migrations are
bite-sized dependence requests, and that dependence satisfiers are
bite-sized responses to thread migrations, then HPCS should _demand_
1) global system bandwidth sufficient to carry both dependence and
operand traffic, and 2) sophisticated system software that extracts
appropriate heterogeneous threads from difficult applications and
dynamically schedules them onto heterogeneous execution resources in
order to use limited global system bandwidth well. This is the _key
problem_, for Pete's sake!
Stop talking only about flops and megawatts per dollar.
The High-End Crusader, a noted expert in high-performance computing
and communications, shall remain anonymous. He alone bears
responsibility for these commentaries. Replies are welcome and may be
sent to HPCwire editor Michael Feldman at firstname.lastname@example.org.