Americas

  • United States
sandra_henrystocker
Unix Dweeb

Unix tip: Finding a fault when your server’s shut down

Analysis
Feb 11, 20097 mins
Computers and PeripheralsData CenterIT Leadership

One of the most useful commands to run when you suspect a failing component on your Sun server is showboards. Showboards will display the status of all boards in your system. With the -ev arguments, the command will provide more complete information and also tell you about empty slots. The showboards -ev command will start with a display like that shown below. In this case, you can see that the PCI I/O Board has failed (look at the next to last line in the output below).

lom>showboards -ev

Slot     Pwr Component Type                 State      Status     
----     --- --------------                 -----      ------     
SSC1     On  System Controller              Main       Passed     
/N0/SCC  -   System Config Card             Assigned   OK         
/N0/BP   -   Baseplane                      Assigned   Passed     
/N0/SIB  -   Indicator Board                Assigned   Passed     
/N0/SPDB -   System Power Distribution Bd.  Assigned   Passed     
/N0/PS0  On  D142 Power Supply              -          OK         
/N0/PS1  On  D142 Power Supply              -          OK         
/N0/PS2  On  D142 Power Supply              -          OK         
/N0/PS3  On  D142 Power Supply              -          OK         
/N0/FT0  On  Fan Tray                       Auto Speed Passed     
/N0/RP0  On  Repeater Board                 Assigned   OK         
/N0/RP2  On  Repeater Board                 Assigned   OK         
/N0/SB0  On  CPU Board                      Assigned   Passed     
/N0/SB2  On  CPU Board                      Assigned   Passed     
/N0/SB4  On  CPU Board                      Assigned   Passed     
/N0/IB6  On  PCI I/O Board                  Active     Failed
/N0/MB   -   Media Bay                      Assigned   Passed 

This display will be followed by a series of additional tables. For example, the memory in the system might be display in a series of lines like those below that describe its configuration:

Component         J-No.   Size      Reason                                  
---------         -----   ----      ------                                  
/N0/SB0/P0/B0/D0  J13300  512 MB                                            
/N0/SB0/P0/B0/D1  J13400  512 MB                                            
/N0/SB0/P0/B0/D2  J13500  512 MB                                            
/N0/SB0/P0/B0/D3  J13600  512 MB                                            
/N0/SB0/P0/B1     -       -         DRAM DIMM Group 1 Empty
...

Additional components will be listed as well. In these top few lines of the following table, we see another reference to the failed I/O board:

Component   Segment Compatible In Date       Time  Build Version                
---------   ------- ---------- -- ----       ----  ----- -------                
SSC1/FP0    -       -          -  -          -     -     RTOS version: 38       
SSC1/FP1    ScApp   Reference  12 03/05/2004 11:32 6.4   5.17.0                 
SSC1/FP1    Ver     -          -  03/05/2004 11:32 6.4   5.17.0                 
/N0/IB6     -       -          -  -          -     -     Skipping failed board

Some of the information provided lists part numbers. In the table below, you can see the part numbers for the CPU boards. This particular system included 64 “SDRAM DIMM” lines like the last line shown below, indicating that the system includes 32 GB of memory.

Component         Part #         Serial #  Description                         
---------         ------         --------  -----------                         
/N0/SB0           540-5467-02-51 A24505    CPU Board (1280)                    
/N0/SB2           540-5467-02-51 A23126    CPU Board (1280)                    
/N0/SB4           540-5467-02-51 A26432    CPU Board (1280)                    
/N0/SB0/P0/B0/D0  501-5030-03-50 734814    512 MB NG SDRAM DIMM
...

The showenvironment command will report to you about temperatures, voltages, fan status and so on. In this partial listing, we again see evidence of the I/O board failure.

lom>showenvironment -ltuvw
...
PCI I/O Board 6
Slot    Device    Sensor    Min    LoWarn Value  HiWarn Max    Units     Age     Status
------- --------- --------- ------ ------ ------ ------ ------ --------- ------- ------
/N0/IB6 Board 0   1.5 VDC 0   1.35   1.42 1.50     1.57   1.65 Volts DC    6 sec OK
/N0/IB6 Board 0   3.3 VDC 0   2.97   3.13 3.31     3.46   3.63 Volts DC    6 sec OK
/N0/IB6 Board 0   5 VDC 0     4.50   4.75 4.98     5.25   5.50 Volts DC    6 sec OK
/N0/IB6 Board 0   Temp. 0   -12     -2    30      82     87    Degrees C   6 sec OK
/N0/IB6 Board 0   Temp. 1   -12     -2    31      82     87    Degrees C   6 sec OK
/N0/IB6 Board 0   12 VDC 0   10.80  11.40 11.88   12.60  13.20 Volts DC    6 sec OK
/N0/IB6 Board 0   3.3 VDC 1   2.97   3.13 3.30     3.47   3.63 Volts DC    6 sec OK
/N0/IB6 Board 0   3.3 VDC 2   2.97   3.13 3.28     3.47   3.63 Volts DC    6 sec OK
/N0/IB6 Board 0   Core 0      1.62   1.71 1.84     1.89   1.98 Volts DC    6 sec OK
/N0/IB6 Board 0   2.5 VDC 0   2.25   2.37 2.53     2.62   2.75 Volts DC    6 sec OK
/N0/IB6 Fan 0     Cooling 0               High                             6 sec OK
/N0/IB6 Fan 1     Cooling 0               High                             6 sec OK
/N0/IB6 SDC 0     Temp. 0   -12     -2    74     102    107    Degrees C   7 sec OK
/N0/IB6 AR 0      Temp. 0   -12     -2    ????   102    107    Degrees C  25 min failed
/N0/IB6 DX 0      Temp. 0   -12     -2    70     102    107    Degrees C   7 sec OK
/N0/IB6 DX 1      Temp. 0   -12     -2    62     102    107    Degrees C   7 sec OK
/N0/IB6 SBBC 0    Temp. 0   -12     -2    ????   102    107    Degrees C  25 min failed
/N0/IB6 IOASIC 0  Temp. 0   -12     -2    ????   102    107    Degrees C  25 min failed
/N0/IB6 IOASIC 1  Temp. 1   -12     -2    ????   102    107    Degrees C  25 min failed
...

The showlogs command will display logged messages as in the excerpt shown below.

lom>showlogs -v

Sat Feb 07 13:14:18 pinto2alom lom: [ID 739794 local0.error] SBBC Port is disabled: IB6.sbbc0.sram.388  
(11900388)
Sat Feb 07 13:14:46 pinto2alom lom: [ID 953312 local0.error] RepeaterEpld.maskEpldErrors:  
sun.serengeti.CommException: SBBC Port is disabled: IB6.epld.a (118e000a)
Sat Feb 07 13:14:47 pinto2alom lom: [ID 863407 local0.error] RepeaterEpld.maskEpldErrors:  
sun.serengeti.CommException: SBBC Port is disabled: IB6.epld.16 (118e0016)
Sat Feb 07 13:14:47 pinto2alom lom: [ID 390126 local0.error] Keyswitch.detachBoards:  
sun.serengeti.HpuFailedException: RepeaterHpu.setArbSync: sun.serengeti.FailedHwException: AR Port is  
disabled: IB6.ar.60 (12c80060)

If you’re not sure what troubleshooting commands are available at the lom prompt on your system, type help and you’ll see a list something like this one that provides a brief description of each command:

lom>help

bootmode           -- configure the way Solaris boots at the next reboot
break              -- send break to the Solaris console
console            -- connect to the Solaris console
disablecomponent   -- add a component to the blacklist
enablecomponent    -- remove a component from the blacklist
flashupdate        -- update firmware
help               -- show help for a command or list of commands
history            -- show command history
inventory          -- show seprom contents of a FRU/system
logout             -- logout from this connection
password           -- set the system controller (LOM) access password
poweroff           -- power off system or components 
poweron            -- power on system or components
reset              -- reset the Solaris system
resetsc            -- reset the system controller (LOM)
setalarm           -- set the alarm leds
setdate            -- set the date and time for the system
setescape          -- set system controller (LOM) escape sequence
seteventreporting  -- set event reporting
setlocator         -- set the system locator led
setls              -- set FRU location status
setupnetwork       -- setup system controller (LOM) network settings
setupsc            -- configure the system controller (LOM)
showalarm          -- show state of system alarms leds
showboards         -- show board information
showcomponent      -- show state of a component
showdate           -- show the current date and time for the system
showenvironment    -- show environmental information
showerrorbuffer    -- show the contents of the error buffer
showescape         -- show system controller (LOM) escape sequence
showeventreporting -- show status of event reporting
showfault          -- show state of system fault led
showhostname       -- show hostname
showlocator        -- show state of system locator led
showlogs           -- show the logs
showmodel          -- show the platform model
shownetwork        -- show system controller (LOM) network settings and MAC addresses
showresetstate     -- show CPU registers after reset
showsc             -- show system controller (LOM) version and uptime
shutdown           -- shutdown solaris and take to standby mode
testboard          -- test a CPU/Memory board

The showboards, showcomponent, showenvironment, showerrorbuffer and showlogs commands will display a lot of information about your system and are likely to highlight any hardware component on your system that is failing. Whether you send this information on to someone who services your equipment or use it yourself, it’s very useful to know how much data you have at your disposal when the system is powered down.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author