Posting up my notes from SC13 is another thing I didn’t get to during the end of the semester. Remedying now.
The main takeaway sequence from conversations on the floor is as such:
- The era of single-core performance gains is already over.
- Furthermore, the era of usable single-die performance for MIMD machines is coming to an end.
- Therefore, big machines are going to be getting physically bigger… to the point where connection lengths are a problem (everything is Infiniband, and Infiniband doesn’t tolerate long runs well)
- There is a LOT of cooling effort to make the necessary density happen – central large fan systems, immersion cooling, closed-circuit water gear, etc.
The other really exciting thing that it seems AMD is going to make it, and more. Their lean period finished when the payoff on the XBone/PS4 came in, and they have a VERY good plan for the next >2 years. It works with the premise above about single-core/die MIMD performance ending, and points in the HSA direction – this is the crazy parts with MMUs so a CPU and GPU can share memory without skew penalty and such. ARM and partners are also generally pointed that way, and have been for some time, though apparently AMD isn’t getting out of the x86 game, but it does look like they are getting out of the fat core game.
Post-SC, I sat down to do some deeper reading on the HSA (Heterogeneous System Architecture) stuff. This is AMD/ARM (and many friends)’s plan for the future, and it is pretty fucking exciting (in an obscure technical sort of way).
The best starting point I found is this year old whitepaper [PDF warning]. They’re using slightly odd terminology, the important bits are LCU = Latency Compute Unit = Conventional MIMD CPU Core, TCU = Throughput Compute Unit = Accelerator, typically SIMD-engine-ish like a GPU, HSAIL = HSA Intermediate Language = IR that can be compiled at install/run time to accelerator’s ISAs. The hardware-side implementation details are nowhere to be found, but there are a lot of seriously exciting model-affecting things detailed on the software side. The general model, with things broken into grid, work group, work item, wavefront is FAR more sane than most of the parallel schemes (I’m thinking specifically of the awful CUDA nomenclature). Internally, the exciting stuff includes requiring a limited sort of preemption on the accelerators, a relaxed consistency model across memory shared over a whole system (nice thread-like shared memory), an intermediate low level language/VM for portability, and assurances about barrier capability in the TCU. The actual objects are basically FAT ELFs with a complete copy of the program for the LCU, plus the HSAIL representation for the parts that can be shipped to TCUs. I’m pleased that there seems to be a clever run-time that does a bunch of platform enumeration and controls where parts run in a rule-automated-but-overrideable way.
I had some folks at SC tell me they’d try to get me a more implementation-focused whitepaper on the hardware side at AMD but they weren’t sure if/when details would be clear for distribution. On the software side, the details are in a published draft of the ISA/Model/Compiler Writer’s Guide that I browsed around in a bit and found very enlightening. The reference tool-chain seems to be mostly built on LLVM and OpenCL.
I have some other SC-related thoughts to share, but I want to get them a little bit more refined (and decide which are for public consumption) before I post.
I will be at SC’13 November 16-21 with the aggregate.org/University of Kentucky research exhibit again this year in booth 629. Media and impressions should appear somewhere in my ‘net presence during and after the conference, it is always a good show.
Edit:Pushing photos from the show floor into this album.