Amd link failure

9/24/2023

Then again, a number of our readers are going to think this is silly with regular security patches. If a typical server lifecycle is 5 years these days, then it means that one might need to do a minimum of a single reboot over its lifetime to avoid this bug, so long as the single reboot happens between days 9.

The reason that the system had such high uptime is that it was part of a lab project that was outside our normal management tools and we forgot it was there apparently. We checked the STH lab and it appears as though we actually had a HPE AMD EPYC 7002 Rome system that we forgot about hit 2 years and 261 days or 991 days total uptime running Proxmox VE before the system was decommissioned. At the same time, this is a fairly big deal since the remedy is effectively rebooting a system. ( Source: AMD Revision Guide for AMD Family 17h Models 30h-3Fh Processors)įor most of our readers, machines will be rebooted once every so often for things like security patches or other maintenance windows. The time of failure may vary depending on the spread spectrum and REFCLK frequency.Įither disable CC6 or reboot system before the projected time of failure. This is not just speculation, instead, this is an official AMD Errata 1474 in 56323-PUB_1.01.Ī core will fail to exit CC6 after about 1044 days after the last system reset. AMD EPYC 7002 Rome CPUs Hang After Less Than 3 Years of Uptime While there are many bugs in processors given their complexity, this one is particularly interesting. A second crisis this early in the year would be pretty embarrassing, of course.Thanks to a reader that sent in this Reddit post we were alerted that the AMD EPYC 7002 “Rome” series core can hang after just under 3 years of uptime, or around 1044 days. If there does turn out to be a problem with the graphics driver itself, that won’t be a good look for AMD, considering that 2023 is hardly underway, and Team Red has already undergone a GPU crisis with its new RX 7900 XTX flagship and a problem with the cooling. But we really need to wait and see if more reports of this problem with RX 6000 cards come in, now that the purported issue is out in the open. It’s also worth noting that there are scattered reports of coil whine after updating to the latest AMD driver, and of higher hotspot temperatures, which could again point to the GPU being pushed a bit harder. All of this is just speculation, naturally. If the latter does indeed push for more performance – which anecdotally, some owners have mentioned getting slight frame rate boosts – maybe this is tipping some GPUs, which were almost dead anyway, over the edge.

Perhaps the issue is with those ex-mining cards and the new driver. So, it’s not a stretch to think that a good many gamers have picked up second-hand stock of these ex-mining AMD GPUs – quite possibly without knowing they are ex-mining models, as sellers often hide this for obvious reasons – and these have become problematic. As mining operations threw in the towel last year, they sold off the graphics cards in their farms, and these are GPUs which have literally been flogged almost to death 24/7. Here’s where another suggestion comes into play – that the affected models are ex-mining GPUs. Certainly there are a good many people who stick with the same GPU driver for a while, and definitely don’t take on every single version released.Īnother possibility is that the problem is limited to certain graphics cards, and not all of them. On the other hand, you could argue about how many folks update their graphics driver regularly.

Indeed, if there really was an issue with RX 68 GPUs, surely we’d be hearing a lot more about this by now, a month after the driver came out? We’ve reached out to AMD for a comment on what might be going on here, and will update this story if we receive a response.Īnalysis: Is mining mayhem a possibility?Īs KrisFix points out, it’s difficult to say what might be going on here with any certainty, but given the commonality with the driver, he theorizes that maybe the latest Adrenalin version (released a month ago) is pushing performance a bit harder, and perhaps some RX 6000 GPUs can’t handle this.Īt the same time, he makes it very clear that he doesn’t yet have enough information to make that judgement, and it’s pretty obvious we can’t leap to any firm conclusions yet on whether the graphics driver is really at fault here.Īt the moment, we just have the word of one German repair outfit, with KrisFix admitting himself that we must be very cautious, and we need further evidence. There was one common thing with the problem boards, though – KrisFix notes that they were all running the same graphics driver, the latest one from AMD.

0 Comments

Amd link failure

Leave a Reply.

Author

Archives

Categories