![]() |
|
| The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / June 29, 2006 | |
The End of Moore's Law
Increasing the architectural complexity and clock frequency of single-core microprocessors has come to an end (see what has happened to the Intel Pentium 4 successor project). Instead, multi-core microprocessor chips are emerging from the same vendors. But just putting more CPUs on the chip is not the way to go for very high performance. We have learned this lesson from the supercomputing community, which has paid an extremely high price for monstrous installations by following the wrong road map for decades. Such fundamental bottlenecks in computer science will necessitate new breakthroughs. Instead of the traditional reductionism, we need transdisciplinary approaches, such as those heralded by the current revival of Cybernetics, sometimes labeled as Integrated Design & Process Technology or Organic Computing. To reanimate the stalled progress in HPC for a breakthrough in very-high-performance computing, we need a transdisciplinary approach for bridging the hardware/software chasm, which has turned into a configware/software chasm. For much more successful efforts, we need a transdisciplinary paradigm shift to a fundamentally new model. An example of such a shift is represented by the Reconfigurable Computing community dealing with configware engineering as a counterpart to software engineering.
Classical Parallelism Does Not Scale
This is a very expensive lesson that we have already learned from the supercomputing community, which has massively increased the number of processors by going to cheap COTS (commodity off the shelf) components. With the growing degree of parallelism, the programmer productivity goes down drastically ("The Law of More"). As long as the reductionistic monopoly of the von Neumann mindset will not be relieved, where the classical fundamental paradigm is still based on concurrent sequential processes and message passing through shared memory -- both being massively overhead-prone and extremely memory-cycle-hungry -- it is an illusion to believe that scalability would get massively better, when all these processors will be resident on a single chip. Rescue should not be expected from threads, although Intel pre-announced some tools intended to avoid, that the programmers shy away. In his cover article for Computer Magazine, Edward A. Lee from the University of California-Berkeley claims that for concurrent programming to become mainstream, we must discard threads as a programming model. Nondeterminism is the overhead- prone problem, not only hidden behind methodologies attributed as "speculative." Threads, by the way, perfectly illustrate the von Neumann-based software paradigm trap.
Escape the Software Development Paradigm Trap
So says IRIS director Mark Bereit, who refutes the assumption that software development will always be difficult and bug-ridden, noting that this is due "solely to the software development paradigm that we've followed, unchallenged, for decades," as well as the reason of bad scalability and bad programmer productivity in classical parallelism. Bereit proposes re-working the model and studying other engineering disciplines for inspiration. He proposes to study mechanical engineering, but it would be much better to study reconfigurable computing. Tensilica senior vice president Beatrice Fu said that reconfigurable computing offers the option of direct processor-to-processor communications without going through either memory or a bus. This paradigm shift is old hat, but was until recently mostly ignored, and not only by the supercomputing community. Buses cause multiplexing overhead and the dominating instruction- stream-based-only fundamental model is extremely memory-cycle-hungry: the for the "memory wall." The alternative offered by reconfigurable computing is data stream parallelism by highly parallel, distributed and fast local memory. This memory parallelism is more simple and more straightforward than, for example, interleaved memory access known from vector computers.
Crooked Labeling
The difference between parallel computing and reconfigurable computing is often blurred by projects labeled "reconfigurable," which, in fact are based on classical concurrency on a single chip. To avoid confusion: switching the multiplexers or addressing the registers at runtime is not "reconfiguration." At runtime, real reconfigurable computing never has an instruction fetch, only data streams are moving around (this should not be confused with dynamically reconfigurable systems, a mixed mode approach switching back and forth between reconfiguration mode and execution mode, which should be avoided for introductory courses).
FPGAs Became Mainstream Years Ago in Embedded Systems
This is a new computing paradigm based on configware instead of software. Configware is not instruction- stream-based and needs compilation methods that are simple but fundamentally different from compiling software. Compared to software solutions, speed-up factors up to four orders of magnitude have been obtained due to a different form of parallelism that is drastically less overhead-prone than classical parallelism from concurrency by communicating (von Neumann) sequential processes. A few of the many embedded FPGA application examples are: control, signal processing, automotive, multimedia, video, wireless, music, vision, coding, defense, image processing, crypto, pattern recognition, HDTV, manufacturing, aerospace, computer graphics and many others.
FPGA-based Scientific Computing
More recently, FPGAs are also highly popular for scientific computing in many application areas. A few of these application examples are: medical, physics, defense, environmental, chemical, evolution, mathematics, fluid dynamics, astrophysics, bio, genetic, weather, chemistry, molecular, mechanics, neural network, DNA, pattern recognition, computer graphics, materials science, nuclear weaponry, data mining, combustion, crash simulation, black hole, petroleum, oil and gas, and many others.
CPUs Outperformed by FPGAs
The worldwide total running compute power of FPGAs outperforms that of CPUs. Most total MIPS running worldwide have been migrated from CPUs to accelerators, often onto FPGAs. The FPGA market, with almost $4 billion, is the fastest growing segment of the integrated circuit market. Gartner Dataquest predicts almost $7 billion for the year 2010. Xilinx and Atera currently dominate this market with a combined share of 84 percent. The rapidly growing number of FPGA- based design starts is now an order of magnitude higher than the shrinking number of ASIC-based design starts. By this software to configware migration, enormous speed-up factors are obtained -- ranging from one and more orders of magnitude (OoM) in astrophysics; two and more OoM in molecular biology and other bioinformatics applications; 3 OoM in cryptography; substantially more than 3 OoM in DSP and wireless communication; and to almost 4 OoM in pattern matching and image processing. By this software to configware migration, most CPUs are replaced by FPGAs.
Massively Slashing the Electricity Bill and Equipment Cost
Major supercomputer vendors went to reconfigurable computing, where a highly welcome side effect of migration to FPGAs is the massive reduction of the electricity bill (by one order of magnitude) and of the equipment cost. Already, one supercomputer vendor, SGI, has filed for Chapter 11 bankruptcy protection. By supporting FPGAs, SGI has torpedoed its own market, since these new customers do not need a hangar full of equipment (nor air conditioning). Saving electricity by software to configware migration might become a strategic issue at national or global level. The press tells us that 25 percent of Amsterdam's electricity consumption goes into server farms (Amsterdam is an Internet hub) and that Google's yearly electricity bill amounts to $50 million -- more than the value of its equipment.
The von Neumann Paradigm is Loosing its Dominance
Our curricula still mostly ignore that the de facto current model is not von Neumann, but a symbiosis of CPU (or multiple CPUs) and non-von Neumann accelerators -- in fact, a dual- paradigm model. Today, most compute power comes from non-von Neumann accelerators attached to the CPU -- a mixture of (a) hardwired accelerators and (b) reconfigurable accelerators, such as FPGAs. The CPU (central processing unit) loses this central role by becoming an auxiliary processing unit mainly to run legacy code. Many more MIPS are running software than configware, although most engineers do not yet know how to spell "configware."
The Reconfigurable Computing Paradox
FPGAs have bad technology parameters. Because of massive overhead like wiring, reconfigurability, and routing congestion, their effective integration density (transistors per chip really serving the application -- DeHon's first law) is less than .01 percent that of the Gordon Moore curve. More negative factors are contributed, as FPGAs are very power-hungry (compared to hardwired accelerators) and their clock frequency (substantially less than 1 GHz) is massively lower than that of microprocessors or hardwired accelerators. There are more negative factors on FPGAs, such as very poor application development support, implementation languages and tools unacceptable for software people, and extremely poor reconfigurable computing education -- or none at all -- because it is ignored by computer science curricula. What is the reason of the rapid market growth and these massive speed-up factors, although parameters and other factors around FPGAs are so massively bad? The reasons of this paradox are: the paradigm trap (the wrong mind set using the wrong model), severe educational deficits (graduates having the wrong educational background, missing the de facto job market) and management deficits. Removing these educational deficits will help to remove this paradox.
Educational Deficits
Less welcome side effects of the paradigm shift are educational deficits needing on-the-job training, since typical CS or CE curricula ignore reconfigurable computing -- still driving the dead road of the von Neumann- only mind set. A new IEEE workshop series on reconfigurable computing education has been founded to cope with this problem. Within the several hundreds of pages of all volumes of the 2004 joint ACM/AIS/IEEE-CS curriculum recommendations, running the find and replace function found zero encounters of the term "FPGA" and its synonyms, or other terms pointing to reconfigurable computing. This is criminal. The only reaction to my complaint by e-mail has been adding the term "FPGA" just once in front of "etc." in the 2005 version of the reports. This does not help: this is still criminal because these recommendations do not hit the current IT-based job market. These recommendations completely fail to accept the transdisciplinary responsibility of computer science to combat the fragmentation into many application domain-specific, tricky reconfigurable computing methodologies.
Advantages of Coarse-Grained Reconfigurability
Coming along with a more convenient abstraction level than FPGAs, coarse- grained reconfigurability makes the educational gap smaller. Another advantage of coarse-grained reconfigurability is the much higher computational density than with FPGAs. A computational density by 4 OoM higher than an FPGA is obtained by using a rDPA (reconfigurable DataPath Array) with rDPUs instead of CPUs. For software people, this is much easier to understand than FPGAs because DPUs share the same abstraction level with CPUs. To software people, the configuration of FPGAs looked more like logic design on a strange platform. But in contrast to a CPU, a DPU is not instruction-driven and has no program counter, and its operation is transport-triggered by the arrival of operand data. This new machine paradigm (the counterpart of von Neumann) is based on free form large pipe networks of rDPUs (without memory wall and compilation is easy), but not on concurrent sequential processes. There is no instruction fetch overhead at runtime because these pipe networks, generalizations of the systolic array, are configured before runtime. This new paradigm is based on data streams generated by highly parallel, distributed, on-chip local small but fast memory that consists of auto-sequencing memory (ASM) blocks using reconfigurable generic address generators (GAG), providing even complex address computations not needing memory cycles. This kind of memory parallelism is simpler and more straight forward than interleaved memory access known from vector computers.
The Personal Supercomputer is Near
The Munich-based startup PACT has demonstrated that a 56-core 16-bit rDPA running at less than 500 MHz can simultaneously host everything needed for a world TV controller, like multiple standards, all types of conversions, (de)compaction, image improvements and repair, all sizes and technologies of screens, and all kinds of communication -- including wireless: higher performance by less CPUs, by reconfigurable units instead of CPUs. By this methodology, a single-chip game console is feasible on such a coarse-grained reconfigurable platform. A highly promising vision would be a super Pentium with multiple dual-mode PUs, which could individually run in CPU mode or in rDPU mode (not using the program counter). Choosing the right distributed on-chip memory and the right reconfigurable interconnect between these PUs is the key issue.
Management Deficits
Because of educational background problems, the impact of a paradigm shift can hardly be explained to managers in a one-page executive summary. A well-known historical example is the Windows methodology of OS user interfaces having been pioneered on all the Altos computers running at Xerox PARC in the late 1970s. At that time, the very rich Xerox Corp. failed in launching this as a product, so a startup called Apple Computer was needed to open this market by the Macintosh. So, it is not very likely that one of the major microprocessor vendors now going toward multi-core microprocessor chips will adopt the coarse-grained reconfigurable array methodology. Another vision, for example, would be, that Xilinx inserts a rDPA onto a new platform FPGA targeting the scientific computing market. The RAMP project having proposed to run the operating system on an FPGA sounds like "Xilinx Inside" instead of "Intel Inside."