HPCwire
The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / November 15, 2006
Features:
AMD versus Intel: The Compiler as Referee

The divergence of AMD and Intel x86 implementations has created certain "challenges" for compiler vendors. Application developers would like to deliver a single binary that can execute optimally on both architectures. PGI's solution allows users to create separate versions of code for both chips, but enables them to be built into a single PGI Unified Binary. The Portland Group's Michael Wolfe describes the rationale and implementation of this technology, which he presented in an Exhibitor Forum this week at SCO6.

HPCire: Today what are the main differences between EM64T and AMD64 that a compiler writer needs to be aware of? Are there still ISA differences or is it all micro-architecture?

Wolfe: The main differences are in the chip implementations, the detailed micro-architectures of the processor cores. For instance, Intel EM64T processors have typically been implemented with deeper instruction pipelines and higher clock rates. This increases the importance of good scheduling by the compiler in order to avoid pipeline stalls and extract maximum performance from the chip. With respect to streaming SIMD extensions (SSE) instructions, Intel EM64T chips use parallel floating point pipelines, which provide higher performance for packed arithmetic but no advantage for scalar code.

The AMD64 implementation uses separate pipelined floating point units. This allows for faster double-precision scalar performance, but essentially means that AMD64 has the same peak performance for double-precision scalar or packed SSE instructions. There are also a wide variety of cache sizes and configurations, which are significant to how and when a compiler should generate parallel code on a multi-core processor.

There are also temporal ISA incompatibilities. Intel introduced SSE2 instructions to the Pentium 4, and these were later adopted by AMD as part of AMD64. AMD introduced 64-bit extensions and extended register sets to the x86 architecture with AMD64, and these were eventually adopted by Intel. Intel introduced SSE3 instructions with EM64T, which AMD adopted soon thereafter, and Supplemental SSE3 instructions with Core 2 which create binary incompatibilities between Core 2 processors and current generation AMD64 processors. The PGI Unified Binary allows users to leverage these innovations as they occur, but without generating code that is sub-optimal or simply does not work on competing processors.

For the compiler, we see five distinct categories. First, as mentioned, some instructions are introduced by one vendor, so there is a time period where the ISA is different. Second, the scheduling rules for instructions differ among the processors; typically, the schedule is more critical for the Intel processors, with its deeper pipeline. Third, instruction selection can also be important. We have cases where there are two or more instructions or instruction sequences to give the same result; due to the micro-architectural differences, different instructions or sequences will be faster for the two vendors. Fourth, the various choices for vectorizing for the packed SSE arithmetic can be very specific to the chip; this includes instruction selection and scheduling, but also involves tradeoffs in the breakpoint between scalar and vector code, whether to optimize for aligned operands, and so on. Lastly and also related to vectorization, cache optimizations can depend on the cache size, which differs between the chips and even between different revisions of the same processor.

HPCire: In general, do you see the AMD64 and EM64T architectures diverging or converging in the future? For example, are the latest Intel Xeon and AMD Opteron Rev F processors more different from each other than previous generations?


Wolfe: This is a very difficult question. Certainly it has been the case over the past few years that the processors from AMD and Intel have never been 100 percent binary compatible. As noted above, there are significant micro-architectural differences that affect optimization, even for features such as SSE1/SSE2/SSE3 where they are now compatible. Our assumption is that at the very least the processors will continue to have ISA and micro-architecture differences, with both AMD and Intel adopting successful innovations by the other over time. For the HPC community, we expect optimizing for specific processors will continue to be quite important. From our perspective, that is, from the compiler's point of view, the latest Intel and AMD chips require more target-specific tuning with each new revision.

HPCire: Can you describe the PGI Unified Binary capability and its principle advantages?

Wolfe: As always, PGI is looking to deliver user-centric HPC solutions. We designed the PGI Unified Binary to address the needs of HPC developers who build applications on either Intel or AMD processor-based hardware platforms. Suppose you have a performance-critical application with many users, or one that must run in an environment where there are workstations and clusters based on both AMD and Intel processors. You want to optimize for the target platform, but you don't really know which platform your users are running. You could just deliver source code, letting application users build for their own systems, but that assumes the code is not proprietary and that users of the application are willing and able to build and deploy the application from source. You could choose to target one hardware platform, requiring customers or users to conform to that choice, but that restricts your customer base and limits users in their upgrade paths. You could build a generic, non-vendor-specific version of the code, but this leaves some performance on the table. You could build separate binaries for each platform, but this increases the costs of building, testing, packaging, installation, and support, including the problems facing users when they upgrade their systems.

The PGI Unified Binary solution is to optionally build two (or more) versions of each function in the application, each version highly optimized for a specific target processor. At runtime, the program automatically tests the processor type on which it is running and executes the version best optimized for that type. In particular, this means that optimized AMD and Intel versions of the program can be packaged into one executable file. This can significantly reduce the cost of building, testing, producing and supporting performance-sensitive applications that will run on both Intel and AMD processor-based systems.

HPCire: How would you characterize performance and memory use for PGI Unified Binary code compared to code generated for specific x64 processors?

Wolfe:
We've been very pleased, even pleasantly surprised, at how low the overhead is for the processor selection. In the great majority of cases, the overhead of using a PGI Unified Binary usually runs from being unmeasurable to a couple percent, relative to using a processor-specific binary. As you might expect, we are looking at additional schemes to improve this even more in the future. The memory footprint is the next natural question, and yes, the program can take up nearly twice as much space if every function in the program is optimized for a specific processor. This includes both the space for the binary on disk and the virtual memory footprint when loaded. Unlike the Apple universal binary, we don't have control over the loader, so we can't load just the versions that we want. However, we are looking at methods to reduce the virtual memory and instruction cache pollution from having multiple versions of certain functions.

HPCire: How long has the PGI Unified Binary capability been available in PGI compilers? Do you have a sense that customers are using this capability or are you still trying to get the word out?

Wolfe: We first introduced the PGI Unified Binary in our 6.1 compilers, released in February 2006. Leading edge ISVs became quite interested, for exactly the reasons we mentioned above. It's a tool we expect the average HPC user to take more advantage of over time as performance and features seesaw back and forth between AMD and Intel. HPC users typically edit/compile/run on their own hardware. As more sites end up with systems based on both types of processors, we expect it to become the norm for a customer to want to build a binary that runs optimally on either an Intel Core 2 Duo laptop or an AMD Opteron cluster. We expect that the importance and attraction of the PGI Unified Binary will also grow as the performance of each chip generation becomes more sensitive to the generated code.

HPCire: Anything else you'd like to add?


Wolfe: As a former academic, I like to compare the relation between compiler and processor as a classical software vs. hardware tradeoff. The RISC revolution in the early 1980s moved many decision factors away from the hardware, the processor control unit, to the software, the compiler. Superscalar and VLIW processors attempted to solve the same problem -- multiple instruction issue, one using hardware in the control unit, the other using software in the compiler. The advantage of a hardware solution is the presence of actual runtime information; a hardware branch prediction unit can take advantage of actual branch histories. The advantage of a software solution is a big window; a compiler can optimize across functions and even across the whole program, perhaps removing the branch altogether. In a broad sense, these are competing solutions to performance, but in fact, they must work in concert; superscalar processors, for instance, deliver much higher performance when the compilers are tuned to schedule independent instructions together.

Now we've got two very aggressive chip vendors providing two different hardware solutions for executing the same system software. Tuning your software for either potentially sacrifices performance on the other. PGI has developed a software solution to deliver the advantages of each, leveling the playing field between the hardware vendors to the benefit of HPC end-users.