Skip to content

Latest commit

 

History

History
451 lines (296 loc) · 29.9 KB

35-Lustre File System Troubleshooting.md

File metadata and controls

451 lines (296 loc) · 29.9 KB

Lustre File System Troubleshooting

This chapter provides information about troubleshooting a Lustre file system, submitting a bug to the Jira bug tracking system, and Lustre file system performance tips. It includes the following sections:

Lustre Error Messages

Several resources are available to help troubleshoot an issue in a Lustre file system. This section describes error numbers, error messages and logs.

Error Numbers

Error numbers are generated by the Linux operating system and are located in /usr/include/asm-generic/errno.h. The Lustre software does not use all of the available Linux error numbers. The exact meaning of an error number depends on where it is used. Here is a summary of the basic errors that Lustre file system users may encounter.

Error Number Error Name Description
-1 -EPERM Permission is denied.
-2 -ENOENT The requested file or directory does not exist.
-4 -EINTR The operation was interrupted (usually CTRL-C or a killing process).
-5 -EIO The operation failed with a read or write error.
-19 -ENODEV No such device is available. The server stopped or failed over.
-22 -EINVAL The parameter contains an invalid value.
-28 -ENOSPC The file system is out-of-space or out of inodes. Use lfs df (query the amount of file system space) or lfs df -i (query the number of inodes).
-30 -EROFS The file system is read-only, likely due to a detected error.
-43 -EIDRM The UID/GID does not match any known UID/GID on the MDS. Update etc/hosts and etc/group on the MDS to add the missing user or group.
-107 -ENOTCONN The client is not connected to this server.
-110 -ETIMEDOUT The operation took too long and timed out.
-122 -EDQUOT The operation exceeded the user disk quota and was aborted.

Viewing Error Messages

As Lustre software code runs on the kernel, single-digit error codes display to the application; these error codes are an indication of the problem. Refer to the kernel console log (dmesg) for all recent kernel messages from that node. On the node, /var/log/messages holds a log of all messages for at least the past day.

The error message initiates with "LustreError" in the console log and provides a short description of:

  • What the problem is
  • Which process ID had trouble
  • Which server node it was communicating with, and so on.

Lustre logs are dumped to /proc/sys/lnet/debug_path.

Collect the first group of messages related to a problem, and any messages that precede "LBUG" or "assertion failure" errors. Messages that mention server nodes (OST or MDS) are specific to that server; you must collect similar messages from the relevant server console logs.

Another Lustre debug log holds information for a short period of time for action by the Lustre software, which, in turn, depends on the processes on the Lustre node. Use the following command to extract debug logs on each of the nodes, run

$ lctl dk filename

Note

LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.

Reporting a Lustre File System Bug

If you cannot resolve a problem by troubleshooting your Lustre file system, other options are:

  • Post a question to the lustre-discuss email list or search the archives for information about your issue.
  • Submit a ticket to the Jira* bug tracking and project management tool used for the Lustre project. If you are a first-time user, you'll need to open an account by clicking on Sign up on the Welcome page.

To submit a Jira ticket, follow these steps:

  1. To avoid filing a duplicate ticket, search for existing tickets for your issue. For search tips, see the section called “Searching Jirafor Duplicate Tickets”*.

  2. To create a ticket, click +Create Issue in the upper right corner. Create a separate ticket for each issue you wish to submit.

  3. In the form displayed, enter the following information:

    • Project - Select Lustre or Lustre Documentation or an appropriate project.
    • Issue type - Select Bug.
    • Summary - Enter a short description of the issue. Use terms that would be useful for someone searching for a similar issue. A LustreError or ASSERT/panic message often makes a good summary.
    • Affects version(s) - Select your Lustre release.
    • Environment - Enter your kernel with version number.
    • Description - Include a detailed description of visible symptoms and, if possible, how the problem is produced. Other useful information may include the behavior you expect to see and what you have tried so far to diagnose the problem.
    • Attachments - Attach log sources such as Lustre debug log dumps (see the section called “ Diagnostic and Debugging Tools”), syslogs, or console logs. Note: Lustre debug logs must be processed using lctl df prior to attaching to a Jira ticket. For more information, see the section called “Using the lctl Tool to View Debug Messages”.

Other fields in the form are used for project tracking and are irrelevant to reporting an issue. You can leave these in their default state.

Searching Jira*for Duplicate Tickets

Before submitting a ticket, always search the Jira bug tracker for an existing ticket for your issue. This avoids duplicating effort and may immediately provide you with a solution to your problem.

To do a search in the Jira bug tracker, select the Issues tab and click on New filter. Use the filters provided to select criteria for your search. To search for specific text, enter the text in the "Contains text" field and click the magnifying glass icon.

When searching for text such as an ASSERTION or LustreError message, you can remove NIDs, pointers, and other installation-specific and possibly version-specific text from your search string such as line numbers by following the example below.

Original error message:

"(filter_io_26.c: 791:filter_commitrw_write()) ASSERTION(oti->oti_transno<=obd->obd_last_committed) failed: oti_transno 752 last_committed 750 "

Optimized search string

filter_commitrw_write ASSERTION oti_transno obd_last_committed failed:

Common Lustre File System Problems

This section describes how to address common issues encountered with a Lustre file system.

OST Object is Missing or Damaged

If the OSS fails to find an object or finds a damaged object, this message appears:

OST object missing or damaged (OST "ost1", object 98148, error -2)

If the reported error is -2 (-ENOENT, or "No such file or directory"), then the object is no longer present on the OST, even though a file on the MDT is referencing it. This can occur either because the MDT and OST are out of sync, or because an OST object was corrupted and deleted by e2fsck.

If you have recovered the file system from a disk failure by using e2fsck, then unrecoverable objects may have been deleted or moved to /lost+found in the underlying OST filesystem. Because files on the MDT still reference these objects, attempts to access them produce this error.

If you have restored the filesystem from a backup of the raw MDT or OST partition, then the restored partition is very likely to be out of sync with the rest of your cluster. No matter which server partition you restored from backup, files on the MDT may reference objects which no longer exist (or did not exist when the backup was taken); accessing those files produces this error.

If neither of those descriptions is applicable to your situation, then it is possible that you have discovered a programming error that allowed the servers to get out of sync. Please submit a Jira ticket (see the section called “Reporting a Lustre File System Bug”).

If the reported error is anything else (such as -5, "I/O error"), it likely indicates a storage device failure. The low-level file system returns this error if it is unable to read from the storage device.

Suggested Action

If the reported error is -2, you can consider checking in lost+found/ on your raw OST device, to see if the missing object is there. However, it is likely that this object is lost forever, and that the file that references the object is now partially or completely lost. Restore this file from backup, or salvage what you can using dd conv=noerrorand delete it using the unlink command.

If the reported error is anything else, then you should immediately inspect this server for storage problems.

OSTs Become Read-Only

If the SCSI devices are inaccessible to the Lustre file system at the block device level, then ldiskfs remounts the device read-only to prevent file system corruption. This is a normal behavior. The status in /proc/fs/lustre/health_check also shows "not healthy" on the affected nodes.

To determine what caused the "not healthy" condition:

  • Examine the consoles of all servers for any error indications
  • Examine the syslogs of all servers for any LustreErrors or LBUG
  • Check the health of your system hardware and network. (Are the disks working as expected, is the network dropping packets?)
  • Consider what was happening on the cluster at the time. Does this relate to a specific user workload or a system load condition? Is the condition reproducible? Does it happen at a specific time (day, week or month)?

To recover from this problem, you must restart Lustre services using these file systems. There is no other way to know that the I/O made it to disk, and the state of the cache may be inconsistent with what is on disk.

Identifying a Missing OST

If an OST is missing for any reason, you may need to know what files are affected. Although an OST is missing, the files system should be operational. From any mounted client node, generate a list of files that reside on the affected OST. It is advisable to mark the missing OST as 'unavailable' so clients and the MDS do not time out trying to contact it.

  1. Generate a list of devices and determine the OST's device number. Run:

    $ lctl dl 
    

    The lctl dl command output lists the device name and number, along with the device UUID and the number of references on the device.

  2. Deactivate the OST (on the OSS at the MDS). Run:

    $ lctl --device lustre_device_number deactivate
    

    The OST device number or device name is generated by the lctl dl command.

    The deactivate command prevents clients from creating new objects on the specified OST, although you can still access the OST for reading.

    Note

    If the OST later becomes available it needs to be reactivated, run:

    # lctl --device lustre_device_number activate
    
  3. Determine all files that are striped over the missing OST, run:

    # lfs find -O {OST_UUID} /mountpoint
    

    This returns a simple list of filenames from the affected file system.

  4. If necessary, you can read the valid parts of a striped file, run:

    # dd if=filename of=new_filename bs=4k conv=sync,noerror
    
  5. You can delete these files with the unlink command.

    # unlink filename {filename ...} 
    

    Note

    When you run the unlink command, it may return an error that the file could not be found, but the file on the MDS has been permanently removed.

If the file system cannot be mounted, currently there is no way that parses metadata directly from an MDS. If the bad OST does not start, options to mount the file system are to provide a loop device OST in its place or replace it with a newly-formatted OST. In that case, the missing objects are created and are read as zero-filled.

Fixing a Bad LAST_ID on an OST

Each OST contains a LAST_ID file, which holds the last object (pre-)created by the MDS. The MDT contains a lov_objid file, with values that represent the last object the MDS has allocated to a file.

During normal operation, the MDT keeps pre-created (but unused) objects on the OST, and normally LAST_IDshould be larger than lov_objid. Any small difference in the values is a result of objects being precreated on the OST to improve MDS file creation performance. These precreated objects are not yet allocated to a file, since they are of zero length (empty).

However, in the case where lov_objid is larger than LAST_ID, it indicates the MDS has allocated objects to files that do not exist on the OST. Conversely, if lov_objid is significantly less than LAST_ID (by at least 20,000 objects) it indicates the OST previously allocated objects at the request of the MDS (which likely contain data) but it doesn't know about them.

Introduced in Lustre 2.5Since Lustre 2.5 the MDS and OSS will resync the lov_objid and LAST_ID files automatically if they become out of sync. This may result in some space on the OSTs becoming unavailable until LFSCK is next run, but avoids issues with mounting the filesystem.

Introduced in Lustre 2.6Since Lustre 2.6 the LFSCK will repair the LAST_ID file on the OST automatically based on the objects that exist on the OST, in case it was corrupted.

In situations where there is on-disk corruption of the OST, for example caused by the disk write cache being lost, or if the OST was restored from an old backup or reformatted, the LAST_ID value may become inconsistent and result in a message similar to:

"myth-OST0002: Too many FIDs to precreate,
OST replaced or reformatted: LFSCK will clean up"

A related situation may happen if there is a significant discrepancy between the record of previously-created objects on the OST and the previously-allocated objects on the MDT, for example if the MDT has been corrupted, or restored from backup, which would cause significant data loss if left unchecked. This produces a message like:

"myth-OST0002: too large difference between
MDS LAST_ID [0x1000200000000:0x100048:0x0] (1048648) and
OST LAST_ID [0x1000200000000:0x2232123:0x0] (35856675), trust the OST"

In such cases, the MDS will advance the lov_objid value to match that of the OST to avoid deleting existing objects, which may contain data. Files on the MDT that reference these objects will not be lost. Any unreferenced OST objects will be attached to the .lustre/lost+found directory the next time LFSCK layout check is run.

Handling/Debugging "Bind: Address already in use" Error

During startup, the Lustre software may report a bind: Address already in use error and reject to start the operation. This is caused by a portmap service (often NFS locking) that starts before the Lustre file system and binds to the default port 988. You must have port 988 open from firewall or IP tables for incoming connections on the client, OSS, and MDS nodes. LNet will create three outgoing connections on available, reserved ports to each client-server pair, starting with 1023, 1022 and 1021.

Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do the following:

  • Start the Lustre file system before starting any service that uses sunrpc.

  • Use a port other than 988 for the Lustre file system. This is configured in /etc/modprobe.d/lustre.conf as an option to the LNet module. For example:

    options lnet accept_port=988
    
  • Add modprobe ptlrpc to your system startup scripts before the service that uses sunrpc. This causes the Lustre file system to bind to port 988 and sunrpc to select a different port.

    Note

You can also use the sysctl command to mitigate the NFS client from grabbing the Lustre service port. However, this is a partial workaround as other user-space RPC servers still have the ability to grab the port.

Handling/Debugging Error "- 28"

A Linux error -28 (ENOSPC) that occurs during a write or sync operation indicates that an existing file residing on an OST could not be rewritten or updated because the OST was full, or nearly full. To verify if this is the case, run on a client:

client$ lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
myth-MDT0000_UUID          12.9G        1.5G       10.6G  12% /myth[MDT:0]
myth-OST0000_UUID           3.6T        3.1T      388.9G  89% /myth[OST:0]
myth-OST0001_UUID           3.6T        3.6T       64.0K 100% /myth[OST:1]
myth-OST0002_UUID           3.6T        3.1T      394.6G  89% /myth[OST:2]
myth-OST0003_UUID           5.4T        5.0T      267.8G  95% /myth[OST:3]
myth-OST0004_UUID           5.4T        2.9T        2.2T  57% /myth[OST:4]

filesystem_summary:        21.6T       17.8T        3.2T  85% /myth
        

To address this issue, you can expand the disk space on the OST, or use the lfs_migrate command to migrate (move) files to a less full OST. For details on both of these options see the section called “ Adding a New OST to a Lustre File System”.

Introduced in Lustre 2.6

In some cases, there may be processes holding files open that are consuming a significant amount of space (e.g. runaway process writing lots of data to an open file that has been deleted). It is possible to get a list of all open file handles in the filesystem from the MDS:

mds# lctl get_param mdt.*.exports.*.open_files 
[email protected]_files= 
[0x200003ab4:0x435:0x0] 
[0x20001e863:0x1c1:0x0] 
[0x20001e863:0x1c2:0x0] 
: 
:         

These file handles can be converted into pathnames on any client via the lfs fid2path command (as root):

client# lfs fid2path /myth [0x200003ab4:0x435:0x0] [0x20001e863:0x1c1:0x0] [0x20001e863:0x1c2:0x0] 
lfs fid2path: cannot find '[0x200003ab4:0x435:0x0]': No such file or directory /myth/tmp/4M 
/myth/tmp/1G 
: 
:         

In some cases, if the file has been deleted from the filesystem, fid2path will return an error that the file is not found. You can use the client NID (192.168.20.159@tcp in the above example) to determine which node the file is open on, and lsof to find and kill the process that is holding the file open:

# lsof /myth 
COMMAND   PID   USER  FD TYPE    DEVICE      SIZE/OFF               NODE NAME logger  13806 mythtv  0r REG  35,632494 1901048576384 144115440203858997 /myth/logs/job.1283929.log (deleted) 	`

A Linux error -28 (ENOSPC) that occurs when a new file is being created may indicate that the MDT has run out of inodes and needs to be made larger. Newly created files are not written to full OSTs, while existing files continue to reside on the OST where they were initially created. To view inode information on the MDT, run on a client:

lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
myth-MDT0000_UUID        1910263     1910263           0 100% /myth[MDT:0]
myth-OST0000_UUID         947456      360059      587397  89% /myth[OST:0]
myth-OST0001_UUID         948864      233748      715116  91% /myth[OST:1]
myth-OST0002_UUID         947456      549961      397495  89% /myth[OST:2]
myth-OST0003_UUID        1426144      477595      948549  95% /myth[OST:3]
myth-OST0004_UUID        1426080      465248     1420832  57% /myth[OST:4]

filesystem_summary:      1910263     1910263           0 100% /myth
        

Typically, the Lustre software reports this error to your application. If the application is checking the return code from its function calls, then it decodes it into a textual error message such as No space left on device. The numeric error message may also appear in the system log.

For more information about the lfs df command, see the section called “Checking File System Free Space”.

You can also use the lctl get_param command to monitor the space and object usage on the OSTs and MDTs from any client:

lctl get_param {osc,mdc}.*.{kbytes,files}{free,avail,total}      

Note

You can find other numeric error codes along with a short name and text description in /usr/include/asm/errno.h.

Triggering Watchdog for PID NNN

In some cases, a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into /tmp (by default). The presence of a watchdog timer does NOT mean that the thread OOPSed, but rather that it is taking longer time than expected to complete a given operation. In some cases, this situation is expected.

For example, if a RAID rebuild is really slowing down I/O on an OST, it might trigger watchdog timers to trip. But another message follows shortly thereafter, indicating that the thread in question has completed processing (after some number of seconds). Generally, this indicates a transient problem. In other cases, it may legitimately signal that a thread is stuck because of a software error (lock inversion, for example).

Lustre: 0:0:(watchdog.c:122:lcw_cb()) 

The above message indicates that the watchdog is active for pid 933:

It was inactive for 100000ms:

Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack()) 

Showing stack for process:

933 ll_ost_25     D F896071A     0   933      1    934   932 (L-TLB)
f6d87c60 00000046 00000000 f896071a f8def7cc 00002710 00001822 2da48cae
0008cf1a f6d7c220 f6d7c3d0 f6d86000 f3529648 f6d87cc4 f3529640 f8961d3d
00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001

Call trace:

filter_do_bio+0x3dd/0xb90 [obdfilter]
default_wake_function+0x0/0x20
filter_direct_io+0x2fb/0x990 [obdfilter]
filter_preprw_read+0x5c5/0xe00 [obdfilter]
lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
ost_brw_read+0x18df/0x2400 [ost]
ost_handle+0x14c2/0x42d0 [ost]
ptlrpc_server_handle_request+0x870/0x10b0 [ptlrpc]
ptlrpc_main+0x42e/0x7c0 [ptlrpc]

Handling Timeouts on Initial Lustre File System Setup

If you come across timeouts or hangs on the initial setup of your Lustre file system, verify that name resolution for servers and clients is working correctly. Some distributions configure /etc/hosts so the name of the local machine (as reported by the 'hostname' command) is mapped to local host (127.0.0.1) instead of a proper IP address.

This might produce this error:

LustreError:(ldlm_handle_cancel()) received cancel for unknown lock cookie
0xe74021a4b41b954e from nid 0x7f000001 (0:127.0.0.1)

Handling/Debugging "LustreError: xxx went back in time"

Each time the MDS or OSS modifies the state of the MDT or OST disk filesystem for a client, it records a per-target increasing transaction number for the operation and returns it to the client along with the reply to that operation. Periodically, when the server commits these transactions to disk, the last_committed transaction number is returned to the client to allow it to discard pending operations from memory, as they will no longer be needed for recovery in case of server failure.

In some cases error messages similar to the following have been observed after a server was restarted or failed over:

LustreError: 3769:0:(import.c:517:ptlrpc_connect_interpret())
testfs-ost12_UUID went back in time (transno 831 was previously committed,
server now claims 791)!

This situation arises when:

  • You are using a disk device that claims to have data written to disk before it actually does, as in case of a device with a large cache. If that disk device crashes or loses power in a way that causes the loss of the cache, there can be a loss of transactions that you believe are committed. This is a very serious event, and you should run e2fsck against that storage before restarting the Lustre file system.
  • As required by the Lustre software, the shared storage used for failover is completely cache-coherent. This ensures that if one server takes over for another, it sees the most up-to-date and accurate copy of the data. In case of the failover of the server, if the shared storage does not provide cache coherency between all of its ports, then the Lustre software can produce an error.

If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor.

If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors.

Lustre Error: "Slow Start_Page_Write"

The slow start_page_write message appears when the operation takes an extremely long time to allocate a batch of memory pages. Use these pages to receive network traffic first, and then write to disk.

Drawbacks in Doing Multi-client O_APPEND Writes

It is possible to do multi-client O_APPEND writes to a single file, but there are few drawbacks that may make this a sub-optimal solution. These drawbacks are:

  • Each client needs to take an EOF lock on all the OSTs, as it is difficult to know which OST holds the end of the file until you check all the OSTs. As all the clients are using the same O_APPEND, there is significant locking overhead.
  • The second client cannot get all locks until the end of the writing of the first client, as the taking serializes all writes from the clients.
  • To avoid deadlocks, the taking of these locks occurs in a known, consistent order. As a client cannot know which OST holds the next piece of the file until the client has locks on all OSTS, there is a need of these locks in case of a striped file.

Slowdown Occurs During Lustre File System Startup

When a Lustre file system starts, it needs to read in data from the disk. For the very first mdsrate run after the reboot, the MDS needs to wait on all the OSTs for object pre-creation. This causes a slowdown to occur when the file system starts up.

After the file system has been running for some time, it contains more data in cache and hence, the variability caused by reading critical metadata from disk is mostly eliminated. The file system now reads data from the cache.

Log Message 'Out of Memory' on OST

When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre file system. If insufficient memory is available, an 'out of memory' message can be logged.

During normal operation, several conditions indicate insufficient RAM on a server node:

  • kernel "Out of memory" and/or "oom-killer" messages
  • Lustre "kmalloc of 'mmm' (NNNN bytes) failed..." messages
  • Lustre or kernel stack traces showing processes stuck in "try_to_free_pages"

For information on determining the MDS memory and OSS memory requirements, see the section called “Determining Memory Requirements”.

Setting SCSI I/O Sizes

Some SCSI drivers default to a maximum I/O size that is too small for good Lustre file system performance. we have fixed quite a few drivers, but you may still find that some drivers give unsatisfactory performance with the Lustre file system. As the default value is hard-coded, you need to recompile the drivers to change their default. On the other hand, some drivers may have a wrong default set.

If you suspect bad I/O performance and an analysis of Lustre file system statistics indicates that I/O is not 1 MB, check /sys/block/*device*/queue/max_sectors_kb. If the max_sectors_kb value is less than 1024, set it to at least 1024 to improve performance. If changing max_sectors_kb does not change the I/O size as reported by the Lustre software, you may want to examine the SCSI driver code.