SSE code 6 times slower without VZEROUPPER on Skylake (2016)

Source: Hacker News

Article note: Hey look, that lesson we supposedly learned in the 80s about obscenely, incomprehensibly complicated architectures being a bad idea is back for it's regular visit. The plethora of vector extensions in x86 interact in implementation-dependent non-local ways with massive performance implications.
