Monthly Archives: July 2024

Intel Addresses Desktop Raptor Lake Instability Issues: Faults Excessive Voltage from Microcode, Fix Coming in August

Source: AnandTech Articles

Article note: Cant wait to hear what the performance regressions from a microcode patch that prevents a localized power problem will be.

In what started last year as a handful of reports about instability with Intel's Raptor Lake desktop chips has, over the last several months, grown into a much larger saga. Facing their biggest client chip instability impediment in decades, Intel has been under increasing pressure to figure out the root cause of the issue and fix it, as claims of damaged chips have stacked up and rumors have swirled amidst the silence from Intel. But, at long last, it looks like Intel's latest saga is about to reach its end, as today the company has announced that they've found the cause of the issue, and will be rolling out a microcode fix next month to resolve it.

Officially, Intel has been working to identify the cause of desktop Raptor Lake’s instability issues since at least February of this year, if not sooner. In the interim they have discovered a couple of correlating factors – telling motherboard vendors to stop using ridiculous power settings for their out-of-the-box configurations, and finding a voltage-related bug in Enhanced Thermal Velocity Boost (eTVB) – but neither factor was the smoking gun that set all of this into motion. All of which had left Intel to continue searching for the root cause in private, and lots of awkward silence to fill the gaps in the public.

But it looks like Intel’s search has finally come to an end – even if Intel isn’t putting the smoking gun on public display quite yet. According to a fresh update posted to the company’s community website, Intel has determined the root cause at last, and has a fix in the works.

Per the company’s announcement, Intel has tracked down the cause of the instability issue to “elevated operating voltages”, that at its heart, stems from a flawed algorithm in Intel’s microcode that requested the wrong voltage. Consequently, Intel will be able to resolve the issue through a new microcode update, which pending validation, is expected to be released in the middle of August.

Based on extensive analysis of Intel Core 13th/14th Gen desktop processors returned to us due to instability issues, we have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor.

Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages. We are continuing validation to ensure that scenarios of instability reported to Intel regarding its Core 13th/14th Gen desktop processors are addressed. Intel is currently targeting mid-August for patch release to partners following full validation.

Intel is committed to making this right with our customers, and we continue asking any customers currently experiencing instability issues on their Intel Core 13th/14th Gen desktop processors reach out to Intel Customer Support for further assistance.
-Intel Community Post

And while there’s nothing good for Intel about Raptor Lake’s instability issues or the need to fix them, that the problem can be ascribed to (or at least fixed by) microcode is about the best possible outcome the company could hope for. Across the full spectrum of potential causes, microcode is the easiest to fix at scale – microcode updates are already distributed through OS updates, and all chips of a given stepping (millions in all) run the same microcode. Even a motherboard BIOS-related issue would be much harder to fix given the vast number of different boards out there, never mind a true hardware flaw that would require Intel to replace even more chips than they already have.

Still, we’d also be remiss if we didn’t note that microcode is regularly used to paper over issues further down in the processor, as we’ve most famously seen with the Meltdown/Spectre fixes several years ago. So while Intel is publicly attributing the issue to microcode bugs, there are several more layers to the onion that is modern CPUs that could be playing a part. In that respect, a microcode fix grants the least amount of insight into the bug and the performance implications about its fix, since microcode can be used to mitigate so many different issues.

But for now, Intel’s focus is on communicating that they have fix and establishing a timeline for distributing it. The matter has certainly caused them a lot of consternation over the last year, and it will continue to do so for at least another month.

In the meantime, we’ve reached out to our Intel contacts to see if the company will be publishing additional details about the voltage bug and its fix. “Elevated operating voltages” is not a very satisfying answer on its own, and given the unprecedented nature of the issue, we’re hoping that Intel will be able to share additional details as to what’s going on, and how Intel will be preventing it in the future.

Intel Also Confirms a Via Oxidation Manufacturing Issue Affected Early Raptor Lake Chips

Tangential to this news, Intel has also made a couple of other statements regarding chip instability to the press and public over the last 48 hours that also warrant some attention.

First and foremost, leading up to Intel’s official root cause analysis of the desktop Raptor Lake instability issues, one possibility that couldn’t be written off at the time was that the root cause of the issue was a hardware flaw of some kind. And while the answer to that turned out to be “no,” there is a rather important “but” in there, as well.

As it turns out, Intel did have an early manufacturing flaw in the enhanced version of the Intel 7 process node used to build Raptor Lake. According to a post made by Intel to Reddit this afternoon, a “via Oxidation manufacturing issue” was addressed in 2023. However, despite the suspicious timing, according to Intel this is separate from the microcode issue driving instability issues with Raptor Lake desktop processors up to today.

Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.

Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.

For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed.
-Intel Reddit Post

Ultimately, Intel says that they caught the issue early-on, and that only a small number of Raptor Lake were affected by the via oxidation manufacturing flaw. Which is hardly going to come as a comfort to Raptor Lake owners who are already worried about the instability issue, but if nothing else, it’s helpful that the issue is being publicly documented. Typically, these sorts of early teething issues go unmentioned, as even in the best of scenarios, some chips inevitably fail prematurely.

Unfortunately, Intel’s revelation here doesn’t offer any further details on what the issue is, or how it manifests itself beyond further instability. Though at the end of the day, as with the microcode voltage issue, the fix for any affected chips will be to RMA them with Intel to get a replacement.

Laptops Not Affected by Raptor Lake Microcode Issue

Finally, ahead of the previous two statements, Intel also released a statement to Digital Trends and a few other tech websites over the weekend, in response to accusations that Intel’s 13th generation Core mobile CPUs were also impacted by what we now know to be the microcode flaw. In the statement, Intel refuted those claims, stating that laptop chips were not suffering from the same instability issue.

Intel is aware of a small number of instability reports on Intel Core 13th/14th Gen mobile processors. Based on our in-depth analysis of the reported Intel Core 13th/14th Gen desktop processor instability issues, Intel has determined that mobile products are not exposed to the same issue. The symptoms being reported on 13th/14th Gen mobile systems – including system hangs and crashes – are common symptoms stemming from a broad range of potential software and hardware issues. As always, if users are experiencing issues with their Intel-powered laptops we encourage them to reach out to the system manufacturer for further assistance.
-Intel Rep to Digital Trends

Instead, Intel attributed any laptop instability issues to typical hardware and software issues – essentially claiming that they weren’t experiencing elevated instability issues. Whether this statement accounts for the via oxidation manufacturing issue is unclear (in large part because not all 13th Gen Core Mobile parts are Raptor Lake), but this is consistent with Intel’s statements from earlier this year, which have always explicitly cited the instability issues as desktop issues.

Posted in News | Leave a comment

How to use the new counted_by attribute in C (and Linux)

Source: Hacker News

Article note: Ooohh. That's an immediately obviously useful feature.
Comments
Posted in News | Leave a comment

CrowdStrike broke Debian and Rocky Linux months ago, but no one noticed

Source: Hacker News

Article note: LOL, they had a similar fuckup with their less-widely-used Linux client a while back that they managed to keep quiet. Bolt-on security liability shifting bullshit being bullshit is a decades-old story. Late stage capitalism min-maxing reaching the "reap" stage (or the "find out" stage in the modern formulation) is dominating the news. Random loosely-affiliated groups making software that is better in almost every way than the products of billion-dollar companies is becoming a refrain. What an era.
Comments
Posted in News | Leave a comment

CrowdStrike issue is causing massive computer outages worldwide

Source: OSNews

Article note: Oh man, again? Bolt-on third party "security" company, of the appeals to Csuite-types for outsourcing liability style (run by a former McAfeee exec, the hustle never changes for these people), has a kernel driver on all their WinNT clients to enable file-scanning and monitoring (and remote shell and...). Apparently their Linux client is also failing but in a slightly less absurd way. This time (as opposed to when it was Solarwinds. Or Okta. Or...), instead of getting their infrastructure hacked in a multilevel supply-chain attack, they're apparently just grossly incompetent and pushed an automated update to the scanner definition file which breaks the parser - which is running as privileged code - killing the kernel module and blue-screening then bootlooping the system. 'Somehow' they didn't catch this in testing before deploying to half of the global enterprise market because their test setup is probably to spin a reference VM, apply the update, see that it applied, then automatically wipe the whole thing, because more than that would be expensive. And all their customers, because they're primarily a compliance tool, have automatic updates turned on so they don't have to explain their update test/hold/deploy scheme to regulators, so everyone, everywhere, all at once got this update. I've been hearing years of "Maximize homogeneity" "Continuous, Silent, Automatic update everything" and "Outsource your monitoring and Auth to security professionals" as best practice and uh... how's that goin? Minor global catastrophe? Again? Yea. Presumably ZScaler, their largest competitor, will have a good time until they inevitably do the same kind of bullshit because the whole product category is mostly a scam. Glad I'm not working in IT this week.

Well, this sure is something to wake up to: a massive worldwide outage of computer systems due to a problem with CrowdStrike software. Payment systems, airlines, hospitals, governments, TV stations – pretty much anything or anyone using computers could be dealing with bluescreens, bootloops, and similar issues today. Open-heart surgeries had to be stopped mid-surgery, planes can’t take off, people can’t board trains, shoppers can pay for their groceries, and much, much more, all over the world.

The problem is caused by CrowdStrike, a sort-of enterprise AV/monitoring software that uses a Windows NT kernel driver to monitor everything people do on corporate machines and logs it for… Security purposes, I guess? I’ve never worked in a corporate setting so I have no experience with software like this. From what I hear, software like this is deeply loathed by workers the world over, as it gets in the way and slows systems down. And, as can happen with a kernel driver, a bug can cause massive worldwide outages which is costing people billions in damages and may even have killed people.

There is a workaround, posted by CrowdStrike:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys”, and delete it. 
  4. Boot the host normally. 

This is a solution for individually fixing affected machines, but I’ve seen responses like “great, how do I apply this to 70k endpoints?”, indicating that this may not be a practical solution for many affected customers. Then there’s the issue that this may require a BitLocker password, which not everyone has on hand either. To add insult to injury, CrowdStrike’s advisory about the issue is locked behind a login wall. A shitshow all around.

Do note that while the focus is on Windows, Linux machines can run CrowdStrike software too, and I’ve heard from Linux kernel engineers who happen to also administer large numbers of Linux servers that they’re seeing a huge spike in Linux kernel panics… Caused by CrowdStrike, which is installed on a lot more Linux servers than you might think. So while Windows is currently the focus of the story, the problems are far more widespread than just Windows.

I’m sure we’re going to see some major consequences here, and my – misplaced, I’m sure – is that this will make people think twice about one, using these invasive anti-worker monitoring tools, and two, employing kernel drivers for this nonsense.

Posted in News | Leave a comment

Valve runs its massive PC gaming ecosystem with only about 350 employees

Source: Ars Technica

Article note: Holy _fuck_ is Valve a better functioning company than what you hear about everywhere else. It's a $6.5B company with like 350 employees, and only about 10% of them are administrative. They're chiefly an ownership-attacking middleman, but they're the least gross of a spread of such players.
Artist's conception of Valve's micro-employees hard at work inside your Steam installation.

Enlarge / Artist's conception of Valve's micro-employees hard at work inside your Steam installation. (credit: Getty Images)

As a private and generally secretive company, Valve doesn't offer much outside visibility into its inner workings. So when years' worth of data on the company's employee and aggregate payroll numbers leaked recently, we were eager to take a deep dive to see what those numbers could tell us about the operation and evolution of a company that has a hand in the majority of PC gaming transactions.

The recent data comes from a poorly redacted document in Wolfire's antitrust lawsuit against Steam, as first noticed over the weekend by SteamDB's Pavel Djundik. While the key data in the document has now been properly hidden in the court docket, The Verge captured the raw numbers from a table labeled "Employee Headcount and Gross Pay Data, 2003-2021."

Breaking down that data by year and department with some simple graphs and statistics, seen below, gives us outsiders a rare partial glimpse into Valve's organization. All told, it's a bit hard to believe that this lynchpin of the PC gaming world has rested on the work of just a few hundred people for many years now.

Read 15 remaining paragraphs | Comments

Posted in News | Leave a comment

How the Stream Deck rose from the ashes of a legendary keyboard

Source: The Verge - All Posts

Article note: Oh man, I remember the hype around the Art Lebedev Optimius keyboards, and tracing this line is the kind of thing I do for fun.
3D render of a keyboard with LED screens for keys.
Image: Richard Parry for The Verge

Back in 2005, a small firm offered a tantalizing vision of the future of computer keyboards.

What if your keyboard was filled with tiny screens that showed you exactly what any given press would do, each built into a crystal-clear key? The keys would morph and shift as you needed, transforming from letters and numbers to full-color icons and app shortcuts, depending on what you were doing.

Readers and tech bloggers adored the idea. “It’s about time someone shook up this stagnant keyboard market,” declared Engadget. “The concept is fantastic,” wrote Gizmodo. Slashdot lit up.

The keyboard was just a concept, dreamed up by Art Lebedev, a Russian design firm, and it was an ambitious idea at that: called the Optimus Maximus, it would require...

Continue reading…

Posted in News | Leave a comment

The Mafia of Pharma Pricing

Source: Hacker News

Article note: Everything I learn about the modern healthcare system makes it looks worse.
Comments
Posted in News | Leave a comment

A bit more regarding UTM SE on the iPad

Source: Hacker News

Article note: Sigh. Apple finally approved (For sale in notionally-third-party markets) a version of UTM... and they had to cripple it so thoroughly as to be useless to get it accepted. Terrible performance, jank integration, etc. An iPad with a keyboard is so close to a compelling computer, but they'll bait international regulatory agencies to make sure it stays a coercive consumption device.
Comments
Posted in News | Leave a comment

Pretty pictures, bootable floppy disks, and the first Canon Cat demo?

Source: Hacker News

Article note: Oh neat. The experiments and demos answer a bunch of questions I've had since I read about the Cat.
Comments
Posted in News | Leave a comment

Gpu.cpp: A lightweight library for portable low-level GPU computation

Source: Hacker News

Article note: Neat. Single-header wrapper around the WebGPU (which was a terrible name choice for a generic mid-level GPU API) bindings for doing compute. Less vendor-specific lock-in, less boilerplate.
Comments
Posted in News | Leave a comment