AMD | PAPPP's Rambling

Post-SC, I sat down to do some deeper reading on the HSA (Heterogeneous System Architecture) stuff. This is AMD/ARM (and many friends)’s plan for the future, and it is pretty fucking exciting (in an obscure technical sort of way).

The best starting point I found is this year old whitepaper [PDF warning]. They’re using slightly odd terminology, the important bits are LCU = Latency Compute Unit = Conventional MIMD CPU Core, TCU = Throughput Compute Unit = Accelerator, typically SIMD-engine-ish like a GPU, HSAIL = HSA Intermediate Language = IR that can be compiled at install/run time to accelerator’s ISAs. The hardware-side implementation details are nowhere to be found, but there are a lot of seriously exciting model-affecting things detailed on the software side. The general model, with things broken into grid, work group, work item, wavefront is FAR more sane than most of the parallel schemes (I’m thinking specifically of the awful CUDA nomenclature). Internally, the exciting stuff includes requiring a limited sort of preemption on the accelerators, a relaxed consistency model across memory shared over a whole system (nice thread-like shared memory), an intermediate low level language/VM for portability, and assurances about barrier capability in the TCU. The actual objects are basically FAT ELFs with a complete copy of the program for the LCU, plus the HSAIL representation for the parts that can be shipped to TCUs. I’m pleased that there seems to be a clever run-time that does a bunch of platform enumeration and controls where parts run in a rule-automated-but-overrideable way.

I had some folks at SC tell me they’d try to get me a more implementation-focused whitepaper on the hardware side at AMD but they weren’t sure if/when details would be clear for distribution. On the software side, the details are in a published draft of the ISA/Model/Compiler Writer’s Guide that I browsed around in a bit and found very enlightening. The reference tool-chain seems to be mostly built on LLVM and OpenCL.

I have some other SC-related thoughts to share, but I want to get them a little bit more refined (and decide which are for public consumption) before I post.

I learned some really interesting things at SC this year, and now that I’ve had a day to process, I want to share. Many of these observations come from first or second hand conversations, or justifiable interpretations of press releases, so I don’t promise they are correct, but they are plausible, explanatory, and interesting. I apologize for the 1,000 word wall of text, but there is a lot of good stuff.

This is the big one: I’m pretty sure I understand the current long term architecture plan being pursued by Intel, AMD, and Nvidia. This plan signals the end of the current style of monolithic symmetric processor cores.
They are all apparently pursuing designs with a small N of large integer units, coupled to M >> N SIMD engines.
- Nvidia’s “Project Denver” is a successor/big sibling to Tegra design, and appears to be the beginning of a line with 2-8 64-bit (probably) ARM cores tightly integrated with a big honking GPU-like SIMD structure for FP. The stale press release about this stuff is kind of nauseating to read, but it looks like they’re betting the farm on that design.
- Intel’s HPC efforts are going to be based on a lot of MIC (Many Integrated Cores, successor to the Larabee stuff) parts coupled with a few big cores like the current Xeons. The MIC chips are basically large numbers of super-Atoms: tiny, simple, dumb integer units attached to big SWAR (SIMD Within a Register) units focused on SSE/AVX performance. This is less speculative than most observations, they made a pretty good press push (This for example) on this idea.
  The ring interconnects and higher per-“thread” hardware complexity are probably not a good idea in the long run (IMHO), but having an integer unit for every big SWAR engine will be a major advantage in terms of programming environment and code generation. I suspect the more cautious approach is because Intel doesn’t want/can’t afford another Itanic, where the tools couldn’t generate good code for the programming model on their intended high-end part.
- AMD’s two current products are stepping stones to a design similar to Nvida’s – Bulldozer is a design with some ridiculously powerful x86-64 integer units decoupled from a smaller number of shared FPUs. The APU (I haven’t heard the “Fusion” name in a while) designs are CPUs tightly coupled to GPU structures. The successor parts will be a hybrid of the two – a few big, bulldozer style integer units, with a large number wide next-gen GPU SIMD structures coupled to them.
I think this is generally a good design direction, particularly with current directions in computing in mind, but it is going to make the compiler/concurrent programming world exciting for a while.
AMD appears to be gearing up to abandon a fifth generation of GPGPU products. CTM, CAL, Brook+, OpenCL on 4000 series cards have all been deprecated while still shipping, and indications are that OpenCL (and general driver) support for the current architecture (4-wide VLIW SIMDs, like in the 5- and 6- series) has been relegated to second-class citizen status, while they work on a next generation architecture. The rumor is the next gen parts will be 4 independent banks of SIMD engines instead of 4-wide VLIW SIMD engines, which should be both both nicer to program and generate code for and more similar to Nvidia.
Nvidia is going to open source their CUDA environment. One of the primary objections to CUDA in a lot of circles is reluctance to use a proprietary single-vendor programming environment (people who have been in super/scientific computing for long have all been burnt on that in the past), and the Integer+SIMD model is going to require that not be an issue. This is assembled from information from several places, including PGI, Nvidia, and various scientific compute facilities, much of it second hand or further, but it would make sense.
I still don’t exactly know what went down at Infiscale, but the impression that the Perceus community was abandoned by the company, the developers fled, and it was a bad scene seems to be correct. No one I know that was there seems to be talking, but they’re all on their way to other interesting things, especially Greg Kurtzer’s Warewulf3 project at LBL.
The dedicated high performance compute nodes in Amazon’s EC2 cloud are actually connected as a few large partitionable clusters, users just can’t (nominally, don’t need to) see and instrument the topology like they could with a normal cluster. This is from interpreting press releases, because the people manning Amazon’s booth really didn’t want to chat (and, in fact, were kind of dicks when we tried). This explains how they’ve been getting performance out of a loosely coupled cloud — which is to say they aren’t, they just have a huge cluster attached to their cloud that shares the interface.
The current hard drive production problems have given SSDs the opportunity they need to become first class citizens. Talking to OEMs, the wholesale cost per capacity on HDDs almost tripled, and the supply lines aren’t all that stable, so everyone is scrambling to make things work with mostly SSDs. I saw a lot of interesting new form factors for SSDs, and several flavors flash or battery backed “nonvolatile” DRAM floating about as well, so the nature of storing data-sets is changing.
I saw motherboards with 32 DIMM slots (mostly AMD Interlagos based) on the floor. I saw 32GB DIMMs on the floor. I saw some shared-memory systems with multiple Terabytes of RAM in them. The standard for high memory machines has roughly quadrupled in the last year or two.
The number of women (not booth babes, real technical people, especially younger ones) and educators on the show floor this year was way higher than in the past. This is very good for the field.

I think that covers most of the really good stuff coming off the floor this year, although I am still processing and may come up with some other insights when I’ve had more sleep and discussion.
Also, Pictures! WOO! (Still sorting and uploading the last batch at time of posting).

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tag Archives: AMD

Heterogeneous System Architecture

SC’11 Lessons

Web Presence

Page Navigation

Meta

Recent Posts

Random Quote

Categories

License