Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECC error in PNOR flash in section offset 0x00091000 #5823

Open
adeelleo opened this issue Dec 17, 2023 · 7 comments
Open

ECC error in PNOR flash in section offset 0x00091000 #5823

adeelleo opened this issue Dec 17, 2023 · 7 comments

Comments

@adeelleo
Copy link

Hi Gurus,

I am trying to bring an S822LC (8335-GTB) back to life and use for AI workloads.
The System has 2 x Power8 10 Core Processors, 512GB RAM & 4 x Nvidia P100 GPUs.

After a Power Failure. The machine gets stuck at boot with the below error message:

ECC error in PNOR flash in section offset 0x00091000

System shutting down with error status 0x60F
System shutting down with error status 0x90000A79

Can anyone suggest how to recover from this.

I am willing to compensate anyone who can put in the efforts to help me resolve this for his time.

image

@IlyaSmirnov91
Copy link

There are very few people around the last couple of weeks of the year, and unfortunately nobody who's familiar with Power 8.

I can give you a couple of things to try though:

  1. Re-flash the PNOR image. There is a chance that the ECC error will go away.
  2. Guard the PNOR device or replace it.
  3. Or re-flash the BMC image if you're running with a BMC.

@adeelleo
Copy link
Author

There are very few people around the last couple of weeks of the year, and unfortunately nobody who's familiar with Power 8.

I can give you a couple of things to try though:

1. Re-flash the PNOR image. There is a chance that the ECC error will go away.

2. Guard the PNOR device or replace it.

3. Or re-flash the BMC image if you're running with a BMC.

Thanks for the replay.

The suggestions you gave should technically resolve the issue.

Any idea how i would re-flash the PNOR image?

Replacement is not an option. Since this part is not readily available and the few replacement options i got are costing more than the server itself.

I have downloaded the latest firmware package that contains the PNOR & BMC firmware. But unfortunately i can not access the machine through IPMI Tool to flash firmware since i don't remember the IP address of the machine. Any idea how i can find the IP address so that i can connect through IPMI?

I tried wireshark to sniff the IP but was not successful.

Thanks for your time,

@IlyaSmirnov91
Copy link

You could ping the machine if you remember the alias - that will give you it's IP.

I found this in our P8 documentation to flash the new images:

ipmitool -H <IP> -z 20000 -I lanplus -U <user> -P <password> hpm upgrade <image> component <0|1|2>

0,1 are BMC images, 2 is the PNOR

@adeelleo
Copy link
Author

adeelleo commented Dec 27, 2023 via email

@dcrowell77
Copy link
Contributor

Without the BMC's IP address your options are pretty limited. The entire service model is based around the BMC. Note that the BMC should have a completely separate ethernet connection compared to the "system" itself. The PDF at https://public.dhe.ibm.com/systems/power/docs/hw/p8/p8eik_install_8335.pdf has a good diagram in Figure 17. Use the left Ethernet port for the BMC/IPMI interface (as eth0). Use the right Ethernet port for any direct OS usage (as eth1). Once you get BMC access again there are a few things you can try.

Do you see multiple failed boot attempts on each power on? There are multiple sides to the PNOR and a golden side fallback that is supposed to kick in to recover from failures like this.

@adeelleo
Copy link
Author

adeelleo commented Jan 3, 2024

Without the BMC's IP address your options are pretty limited. The entire service model is based around the BMC. Note that the BMC should have a completely separate ethernet connection compared to the "system" itself. The PDF at https://public.dhe.ibm.com/systems/power/docs/hw/p8/p8eik_install_8335.pdf has a good diagram in Figure 17. Use the left Ethernet port for the BMC/IPMI interface (as eth0). Use the right Ethernet port for any direct OS usage (as eth1). Once you get BMC access again there are a few things you can try.

Do you see multiple failed boot attempts on each power on? There are multiple sides to the PNOR and a golden side fallback that is supposed to kick in to recover from failures like this.

Thanks for your time.

I am aware of the separate BMC Port and that is what I am connected to. I know since this Port gives the display output on serial connection with the machine.

The only issue is that I am unable to establish an IPMI connection since I don't remember the IP address or hostname of the machine. I tried sniffing the network connection with Wireshark but wast successful in detecting any IP address.

I only see the same boot failure message I attached the screenshot in my first message.

Is there a way to manually switch to the golden side of the PNOR image on this machine?

@dcrowell77
Copy link
Contributor

The BMC is where all of the control is, there are no other external interfaces. If you can't get into the BMC somehow there isn't much you can do. Have you gone through all of the service documents at the page I posted? There might be some other way of getting into the BMC. I'm pretty sure there is a raw serial port somewhere that you can use for BMC (vs Host) access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants