Professional Documents
Culture Documents
525
Chapter 9. TroubIeshooting, hints, and tips
This chapter contains information about how to troubleshoot your SONAS system. t includes
ways to check your system health by the GU and also collect logs by the CL command.
9
526 SONAS mplementation and Best Practices Guide
9.1 Introduction to troubIeshooting
SONAS provides you with mechanisms to monitor the system health and perform health
checks. You can view the logs and also collect for further analysis, understand and resolve
problems if any. SONAS GU and CL display the system logs which consist of system alerts,
warnings and also events. With SONAS 1.3 the audit logs are also included.
Both the GU and CL have provisions to do a health check. You can monitor the health of the
nodes to see if they have any warning or alert messages. You can further check the logs to
debug for more information.
SONAS also allows you to dump all the logs into a single package which can then be
uploaded and sent to BM support for analysis. t is described in more detail in this chapter.
Additional troubleshooting methods such as Assist on Site (AOS) and Call home are also
explained.
n this chapter, we cover the following topics:
Monitoring SONAS system details:
- Event logs
- Audit logs
Troubleshooting the SONAS System
Assist on Site
Call Home
Collect Logs
Upload Logs
9.2 Monitoring SONAS system in detaiI
n the SONAS system there are multiple ways to monitor the system. Most of the monitoring
tools can be found in GU under the Monitoring navigation panel. n this chapter, we describe
some of the ways to monitor your SONAS solution.
9.2.1 Event Iogs
The event log displays events collected using CM agents or SNMP traps. The event log
collects CM and SNMP event data from all the components in the system, and stores them
on the active Management node. t represents the history of all of the events that have
occurred in the system. The log displays the events, which are managed in the health center.
Events include successful and failed login and logout attempts by the GU, external SSH,
keyboard, and modem. A filter mask can be used to reduce the number of events that are
displayed. The available filter attributes provide the ability to filter by:
Severity
Time period
Source accessed by GU and CL
GUI access
Navigate to Monitoring Events in the upper right corner, where you can find filter options.
You can filter by originating device. f you have issues with Storage node 1 you can filter the
view to show events only from that node. This might ease searching through events. n
Figure 9-1 we see event log that is showing Current Critical/Warnings events from all nodes.
Chapter 9. Troubleshooting, hints, and tips 527
Figure 9-1 Listing current Critical/Warning events from all nodes
CLI access
Event logs can also be accessed by CL. After you log in to the active Management node, you
issue the IsIog command. n Example 9-1 you can see help the option for lslog command.
Example 9-1 Help option for lslog command
- -
- - -
- - -
-- -
-
-- - - -
-- - - - - -
- -
- - - -
--
-- - - - - -
- - -- --
-- -- - -
- - -- - - - -
- - -
-- - -
-- - -
- -- - - - -
- - -
- -
- - - -
-
Since listing the event log in CL can be hard to read we recommend usage of certain
parameters to avoid confusion in the screen. All of these parameters are listed in the help
example. n case there is a problem with Storage node 1. We recommend looking at the event
log for that node particularly. We use the parameter -n and node name. Example 9-2 shows
usage of filtering logs to certain node, in this case Storage node 1.
Example 9-2 Listing event log entries for a Storage node 1
- - --
- --
-- -- - -
-
-- -- - -
528 SONAS mplementation and Best Practices Guide
-- -- - -
-
-- -- - -
---
-
For Storage node 1 we can see four entries. The event log is very useful in directing BM
support to the root of the problem.
9.2.2 Audit Iogs
Audit logs provide detailed information what actions were performed on the system and
performed by what user type. They can be accessed by GU or CL.
GUI access
Follow this procedure:
Navigate to Access and select Audit Log. Figure shows Audit log entries. Entries can be
filtered by time: hour, minute or days. n the Audit log, we can see commands executed by
users. f the origin is the GU, it shows the corresponding CL command for actions made
in the GU. t also shows the result of the action done.
Figure 9-2 Audit log entries shown in GUI
CLI access
Follow this procedure:
Audit logs can also be accessed by CL. After we log in to the active Management node,
we issue the - command. n Example 9-3 you can see the help option for the lsaudit
command.
Chapter 9. Troubleshooting, hints, and tips 529
Example 9-3 Help option for lsaudit command
- -
- - - -
-- - - - -
- - - -
- -
-
-- -
- - -- - - -
-- - - -
-- - -
-- - -
-- -- - - --
- -- - - -
- -- - - -
- -- - - -
- -- - - - -
-
Audit logs can also be deleted. However, it is highly inadvisable, because logs can provide
valuable clues to BM support personal in case of any problems.
9.3 TroubIeshooting: System detaiIs
Here we describe checking system health by GU and CL. You can check for node state
details and important cluster parameters such as ctdb and gpfs.
9.3.1 System detaiIs in the GUI
n the GU we can monitor system state through different menus. Under Monitoring
System Details we can see details regarding the entire SONAS system. n this view a
navigational panel is opened as shown in Figure 9-3. From this panel select part of the
SONAS system that you would like to check. For easier explanation we divide this view into
three categories
nterface and Management nodes
Storage nodes
nfiniBand/Ethernet switches.
530 SONAS mplementation and Best Practices Guide
Figure 9-3 Navigational panel in Monitoring System details
9.3.2 DetaiIs for Interface nodes and Management nodes
For these nodes, we can monitor basic hardware state, operating system and cluster
connected services. When we click on a node name as shown in Figure 9-4 we see an
overview of that node, as shown for Management node 1. n this view we can see an overview
which includes:
Name: Node name
Status: Latest status
Rack identifier: Rack number where the node is located
Unit location: Rack position
Serial number: Node serial number
Build version: SONAS code level
Event view: View of events associated with this node
Chapter 9. Troubleshooting, hints, and tips 531
Figure 9-4 Basic view of mgmt001st001 node view
For each nterface and Management node we can see details for Hardware, Operating
System and SONAS-related services. Under Hardware, we can view details of:
Motherboard
CPU
Fan
HDD
Memory Modules
Power
Network Card
n the Operating System section:
Computer System Details: Displays details for computer system
Operating System Details: Displays details for operating system
Local File System: Displays details for file system that reside local on this node.
n SONAS-related services, the view is divided into three sections:
Network: Here all network connections are displayed. nternal and external. We can also
check current throughput and more. See Figure 9-5 for details.
532 SONAS mplementation and Best Practices Guide
Figure 9-5 Network details for internal and external connections
NAS services:
Figure 9-6 shows the details of the listed services.
Figure 9-6 NAS Services details listed
Status:
n Figure 9-7 you see the listed services needed for normal SONAS operations from the
nterface and Management nodes.
Chapter 9. Troubleshooting, hints, and tips 533
Figure 9-7 Status view for interface and Management node
Details for Storage nodes:
The detailed information for Storage nodes is similar to the information presented for the
nterface and Management nodes (this applies to the Hardware and Operating System
information). The main difference is in Status view as shown in Figure 9-8. Note there is
no CTDB or other services related to shares running on the Storage node that are needed
on nterface and Management nodes.
Figure 9-8 Storage node status view
Under Services you see only GPFS status as shown in Figure 9-9.
Figure 9-9 Storage nodes Services view
Another main difference is that Storage nodes are grouped together in Storage Building
Blocks. Each Storage Building Block is made from two Storage nodes and storage disks
system connected to nodes. Here you can also check the health of connected storage
systems and disk health.
534 SONAS mplementation and Best Practices Guide
f you select the Storage nodes option in the System Details menu, you see output as shown
in Figure 9-10.
Figure 9-10 Storage nodes basic view
Under the tab Status we see the combined status from the Storage node pair. n our case
Storage node 1 and 2 as shown in Figure 9-11.
Figure 9-11 Status view for both Storage nodes
n addition to Storage node details you can also check for details of the connected storage
controller and disks served. n Figure 9-12 you see Storage Controller basic information.
Figure 9-12 Storage controller basic information view
Chapter 9. Troubleshooting, hints, and tips 535
By selecting the controller name, you see a detailed view of the Storage Controller as shown
in Figure 9-13.
Figure 9-13 Status view for Storage Controller
For disk view simply select Disks from the navigational panel in Monitoring under System
Details. See Figure 9-14 for details. This figure is a small portion of the large amount of data
available in the GU.
Figure 9-14 Disk view
Details for nfiniBand and Ethernet switches:
Here we can check status for nfiniBand and Ethernet switches. The view is divided into
basic component information part and into detailed view for a particular component. See
Figure 9-15 for Ethernet switch and Figure 9-16 for nfiniBand switch basic information.
Figure 9-15 Ethernet switch basic information
536 SONAS mplementation and Best Practices Guide
Figure 9-16 InfiniBand switch basic information
n addition, by selecting the Status option below the switch, we can see a detailed view for
every switch. Figure 9-17 shows details for the Ethernet switch and Figure 9-18 for the
nfiniBand switch.
Figure 9-17 Detailed view for Ethernet switch
Figure 9-18 Detailed view for InfiniBand switch
9.4 Assist on Site (AOS)
Assist On-Site is a lightweight remote support program intended primarily for help desks and
support engineers to diagnose and fix problems without the need of any external
dependencies. Assist On-site is based on the BM Tivoli Remote Control technology.
Figure 9-19 is the main window for AOS. Here we click the Create new session option.
Chapter 9. Troubleshooting, hints, and tips 537
Figure 9-19 AOS pop-up window to create a new session
9.4.1 Creating a new session
We have multiple options for creating a new session:
List AOS Targets: This can be used in case that system triggers a call home. But has to be
configured.
Create HTTP Link: This option creates an http link and a pass code. Then this information
is provided to person on site. t connects to Relay Server.
Join a session: This option lets us join already running sessions.
Create new session: By this option we generate only a pass code. This pass code is
provided to client to enter on the Assist On-site Support Web site.
9.4.2 AOS session modes
Assist On-Site can establish remote connections for support sessions in different modes. You
choose the session mode after joining the support session or the support engineer can
request that you change the mode during the support session. The type of session or the
permissions associated with the support engineer also determine the session modes that are
available during sessions.
When Assist On-Site administrators create a team or a user, they can select the default
permissions for that team or user including the set of session modes that are available. For
example, a team might have default permissions to run sessions in View Only and Chat Only
session modes. Customers can further restrict the session mode when they consent to
sessions:
1. Chat Only mode: This session mode allows the support engineer to chat with the
customer in the Chat window, but does not allow the support engineer to view the target
system or have any control of the target mouse or keyboard.
538 SONAS mplementation and Best Practices Guide
The Chat window allows the support engineer to chat with you within another session
mode and provides an additional form of contact. The BM support engineer can also
request, or you can change to the Chat Only mode during a support session.
2. View Only mode: n this session mode the support engineer can view the target system,
but it does not allow the support engineer to have any control of the target mouse or
keyboard. n the View Only mode, the support engineer can select and mark areas of the
target desktop using the Remote Support Console tools. The support engineer can also
request, or you can change to the View Only mode during a support session.
3. Guidance mode: This session mode allows the support engineer to view the target system
and direct the client to perform tasks on the target system, but does not allow the support
engineer to have any control of the target mouse or keyboard. The support engineer can
use the Guidance mode symbols, Remote Support Console tools, and the chat function to
direct you through any task to perform on the target. The Guidance mode is often used in
training situations and in workplaces of very high sensitivity.
4. Shared Control mode: This session mode allows the support engineer to view the target
system and to have input control of the target mouse and keyboard. During a support
session, the support engineer can turn on local input control to perform actions on the
support engineer's machine rather than the target machine. The actions of the customer
take precedence over the actions performed through the Remote Support Console. When
you use the mouse or the keyboard, the input control icon changes to indicate that input
control in the Remote Support Console is temporarily blocked until you stop using the
mouse or the keyboard. The support engineer can use the Remote Support Console tools,
such as the drawing tools, to select and mark areas of the target desktop. The support
engineer can request, or you can change to the Shared Control mode during a support
session.
9.5 CaII home
The BM SONAS cluster currently supports automatic, electronic call home messaging by
using a configured data path and the Management node.
The BM SONAS cluster initiates call home messages against the machine type/model and
serial number of the hardware component that triggered the error. The call home messages
contain error codes that provide specifics on the problem. The following list shows the valid
machine types and models for call home messaging:
2851-Sx - nterface nodes
2851-SM1 - Management nodes
2851-SSx - Storage node
2851-DR1 - Storage controller
2851-36 - 36-port nfiniBand switch
2851-96 - 96-port nfiniBand switch
9.5.1 CaII home caveats
The SONAS call home messaging process contains the following caveats.
2851-DE1 storage expansion unit enclosure errors issue a call home message against the
parent 2851-DR1 storage controller enclosure. Ethernet switch errors issue a call home
message against the active Management node's machine type/model and serial number.
Chapter 9. Troubleshooting, hints, and tips 539
Frame assembly components lack an interface that supports call home messaging. Other
than the Ethernet switches, the only frame hardware that could be tied to system issues
would be power distribution units (PDUs) tied to power failures. n the event of a PDU
failure, a call home message is generated against a component that is plugged in to a
failing PDU.
9.5.2 EnabIing and disabIing caII home
SONAS call home messages include an 8-character error code that indicates the problem
that occurred. The call home feature is not mandatory. t can be turned off or on.
1. Turning call home on/off by the GU: To check and if needed change setting for call home
by GU navigate to Settings and select Support. n Figure 9-20 we can enable or disable
call come feature and enter necessary parameters that are needed for call home.
Figure 9-20 Call home options in SONAS GUI
2. Turning call home on/off by the CL: To change configuration for a call home feature issue
the command. n Example 9-4 you can see help option for this command.
Example 9-4 Help option for cfgcallhome command in CLI
-
- - -
- --
- - -
--
- -
-- -- - --
- -
- - -
- -- - --
-
- - ---
- -
- - - -
540 SONAS mplementation and Best Practices Guide
-
- --
- -
- -
- -
- - -
-- - - --
- -
- - - -
- - - -
-- - - - -
-
-
- - -
- - -
- - - - -
- -- -- -
-- - --
- - - -
-- - - - -
- - -
- -
- - - -
-
- -- - -
-
- - - -
-
9.6 CoIIecting Iogs
SONAS allows you to collect the dump of the logs from all the nodes in the cluster into one
single package. This package can be sent to BM support for further analysis and debugging
the issue in case of any.
9.6.1 CoIIecting Iogs using the CLI
The - command is used to manage dump files, including generation, listing, deletion
and sending to call home or media devices. Dump files are used to assist BM support in
performing problem determination and resolution.
The help for the srvdump command is shown in Example 9-5.
Example 9-5 CLI command srvdump to collect dump of logs
-
- - -
- - - -
-
-
- -
- -
Chapter 9. Troubleshooting, hints, and tips 541
- -- -
- - -
- - - - - -
- - - -
- - - -
The final package which is a tar ball is created, zipped and stored in /ftdc directory.
9.6.2 CoIIecting Iogs using the GUI
To collect the logs using the GU, access the Support menu by clicking on the Support
submenu under Settings as shown in Figure 9-21.
Figure 9-21 Support Menu option in GUI
When you click the link, you can see three different options as seen in Figure 9-22 below. To
collect logs, click on the DownIoad Logs option seen on the left frame. Other options on
that frame are explained in the previous sections. When you click on the link to download
logs, you see a page with a button to DownIoad Support package. Click on the same to
start download of the logs.
542 SONAS mplementation and Best Practices Guide
Figure 9-22 Download Support Package using GUI
The collection of logs is initiated on every node. The CL command is run to gather all the
logs. The final package which is a tar ball is created, zipped and downloaded on the client.
9.7 Overview of Iogs
The package that is prepared is a collection of logs from different components of SONAS.
The package contains different directories for each of the nodes. Example 9-6 shows the first
level of logs where you can see each node is represented as a directory. All the logs for each
of the nodes are collected under this directory. There are some logs which are collected for
each node and some which are node-specific. We see them in detail in the sections to follow.
Also, along with node-specific logs, there are some common logs which are cluster-wide.
Details for each are provided later.
Example 9-6 Directories in the srvdump package
-
---
---
--
-
--
-
-
-
-
-
--
--
--
--
--
---
Chapter 9. Troubleshooting, hints, and tips 543
--
---
--
---
Having a closer look, the files in Example 9-7 are node-specific logs. All nterface nodes,
Management nodes, and Storage nodes are seen as a directory which is named
"z_<node-name>.
Each of these directories has all the logs in the same directory structure as it would be on the
node. You also see the log of the command that was run to collect the logs on each node.
they are named with "*.cndump_node_log and "*.cngetlogs_log as part of their names. See
Example 9-7.
Example 9-7 List of node-specific logs in the srvdump package
-
-
-
-
-
--
--
--
--
---
--
---
--
---
The files in Example 9-8 are the cluster-wide logs. These logs are not per node. Most of them
are cluster or system details and cluster manager details. We provide information about each
in detail in later sections.
Example 9-8 List of cluster-wide logs in the srvdump package
-
---
---
--
-
--
--
544 SONAS mplementation and Best Practices Guide
9.7.1 Node-specific Iogs
n this section, we look at the details of the node-specific logs, one by one. Some of them are
specific to nterface nodes, Storage node, or Management nodes. Some can be seen for each
type of nodes.
GPFS Iogs
The GPFS logs from the cluster are collected. GPFS logs are usually stored in the path
/var/adm/ras/mmfs.log.* on each node. The same directory structure is used for the
package under each node. Here it is stored inside the directory that is created for each node.
GUI Iogs
The GU is mostly used by the administrators to perform administrative tasks and also for
monitoring the system. n case of any failure in the operation run from the GU the GU logs
can be looked into for problem determination. These logs are SONAS-specific logs.
The GU logs are a collection of logs for operations or events with the GU. These logs can be
errors or warnings that occurred in the software code base, database, external commands.
We can modify the log level to increase or decrease the log level.
Some of the key elements are SONAS databases, CM listeners, business logic that handle
administrative tasks.
The GU logs are found only on the Management node. The path for the logs is:
/var/log/sofsgui/logs
CLI Iogs
The CL is used by administrators to carry out their routine administrative tasks. n case of any
failure in the operation run from the CL, the CL logs can be looked into for problem
determination. These logs are also SONAS-specific logs like the GU logs.
The CL logs are a collection of logs for operations or events with the CL. These logs can be
errors or warnings that occurred in the software code base, database, external commands.
You can modify the log level to increase or decrease the log level.
Some of the key elements like the GU logs are, SONAS databases, CM listeners, business
logic that handle administrative tasks.
The CL logs are found in the path "/var/log/sofsgui/logs. These logs are only found on the
Management node.
CTDB Iogs
The CTDB logs are logs from the CTDB component. t is a per node log and hence each
nterface node and Management node that runs CTDB can log its messages. n case of any
issue in the cluster manager, you can analyze these logs. One thing to note is, CTDB can go
unhealthy or banned due to issues in the other components. The cluster manager tries to
manage the cluster and hence if any component not working well or has any critical error,
CTDB can go unhealthy. t is not always true that CTDB has the problems. However, the logs
help you to see what might have gone wrong and why CTDB behaved a certain way.
The CTDB logs can be found as part of the /var/log/messages on each nterface node and
Management node. You cannot find these logs in the Storage node, as they do not run the
CTDB service.
Chapter 9. Troubleshooting, hints, and tips 545
Samba Iogs
Samba logs are stored at "/var/log/cnlog/samba/*. You can find the file log.smbd. These logs
can have different levels and the "log level parameter in Samba configuration file can be
modified to get more or less logs. These logs are found on each of the Management nodes
and nterface nodes. You can also find some Samba logs in /var/log/messages file.
Winbind Iogs
The winbind logs are stored at /var/log/cnlog/samba/*. The files have winbindd in their names.
Some are related to authentication and id mapping. These logs are found on each of the
Management nodes and nterface nodes. You can also find winbind logs in /var/log/messages
on respective node.
System Iogs and kerneI messages
The system logs or the kernel messages are in the "var/log/messages file. These logs are on
all the nodes, namely, Management nodes, nterface nodes and Storage nodes.
NFS server Iogs
The NFS server logs can be seen in the var/log/messages file. These logs are only on the
nterface nodes.
httpd Iogs
The httpd logs can be found in the var/log/httpd folder. These logs are found only on the
nterface nodes.
SSHD Iogs
These logs are collected from all the nodes namely, Management nodes, nterface nodes and
Storage nodes. You can find sshd logs at "/var/log/messages and also "/var/log/secure
location.
vsftpd Iogs
The vsftpd logs can be found in the varlog/messages file. These logs are found only on the
nterface nodes.
System check-out Iogs
The system check-out is run against DDN the nfiniBand switches and the Ethernet switches.
You can check for warnings, errors or even status logs. Each time a check is made, it logs.
Also, each time state change occurs, it logs. This helps in tracking status and problem
determination of what happened in case of any errors.
CaII home moduIes Iogs
These modules are used to send call home information to Retain. These logs are stored in
/var/log/cnlog.
InstaIIation Iogs
The installation logs are logs that are written when the cluster or node is being installed or
upgraded. The logs can be found in "/var/log/messages, "/var/log/anaconda*,
"/var/sonas/log, "/var/log/yum.log and "/var/sonas/log/platform.log
These logs are written on all nodes like Management node, nterface node and Storage
nodes. These are all SONAS-specific logs.
546 SONAS mplementation and Best Practices Guide
CIM servers and providers Iogs
The CM servers spawn SONAS CM providers. These logs are written on all nterface,
Storage nodes and Management nodes. The log is saved at "/var/log/cnlog/ras/cim/*. You
can find two files, "ibmnas_cim.log and "cimserver.trc.
The CM provider gives status on SONAS hardware and software components and generate
indications. Some of the components it monitors are Network, Multipath, Service, Disk, CPU,
Samba, Checkout, SNMP, VPD
SNMP Iogs
The SNMP logs are used to collect traps from all hardware devices. Typically, BM xSeries
iMM HW traps, Voltaire Traps, DDN HW Traps, Enet Switch Traps, netsnmp (trap program).
The logs are "/var/log/ras/traphandlerlog, "/var/log/ras/snmptraplog,
"/var/log/ras/snmp2cim.log and "/var/log/ras/snmp2cim_backlog.xml
These are SONAS-specific logs and are collected from Storage nodes and Management
node.
TSM and HSM cIient Iogs
For TSM client or HSM client, the logs are at: "/var/log/dsmerr.log. These logs are written on
the nterface nodes and Management nodes.
Other Iogs seen in cndump per node
f you log in to the z_<node-name> directory in the cndump, you can see some logs. They
give you memory information, cluster manager status and statistics, registry content, some
gpfs configuration logs and more. See Example 9-7 on page 543.
9.7.2 CIuster-wide Iogs
The cluster-wide logs that are also contained in cndump are the cluster manager logs like the
CTDB status, CTDB statistics CTDB uptime and more.
You can also see SONAS-specific information like list of nodes and their roles along with
GPFS cluster and version number of the cluster. See Example 9-8 on page 543.
9.8 UpIoading Iogs to IBM support
When BM support requests a cndump, it needs to be provided to BM. One of the ways is to
upload it to BM ECuRep data portal. ECuRep was established as a data repository in case of
system problems. From the ECuRep portal BM support personnel working on a reported
problem can access the information and logs provided by the customer.
To upload data to ECuRep, use your web browser and navigate to: