Professional Documents
Culture Documents
Release 6.5
Technical support
For technical assistance, visit http://entsupport.symantec.com and select phone or email support. Use the Knowledge Base search feature to access resources such as TechNotes, product alerts, software downloads, hardware compatibility lists, and our customer email notification service.
Contents
Section I
Chapter 1
Chapter 2
Using NOM to monitor jobs ........................................................................ 52 Disaster recovery testing and job scheduling ......................................... 52 Miscellaneous considerations ............................................................................ 53 Selection of storage units ........................................................................... 53 Disk staging .................................................................................................. 55 File system capacity .................................................................................... 55 NetBackup catalog strategies ............................................................................ 56 Catalog backup types .................................................................................. 57 Guidelines for managing the catalog ........................................................ 57 Catalog backup not finishing in the available window .......................... 58 Catalog compression ................................................................................... 59 Merging, splitting, or moving servers .............................................................. 59 Guidelines for policies ........................................................................................ 60 Include and exclude lists ............................................................................ 60 Critical policies ............................................................................................. 61 Schedule frequency ..................................................................................... 61 Managing logs ...................................................................................................... 61 Optimizing the performance of vxlogview .............................................. 61 Interpreting legacy error logs .................................................................... 61
Chapter 3
Chapter 4
Chapter 5
Best practices
Best practices: SAN Client .................................................................................. 76 Best practices: Flexible Disk Option ................................................................. 77 Best practices: new tape drive technologies .................................................... 77 Best practices: tape drive cleaning ................................................................... 78 Best practices: recoverability ............................................................................. 80 Suggestions for data recovery planning .................................................. 81 Best practices: naming conventions ................................................................. 82 Policy names ................................................................................................. 83 Schedule names ........................................................................................... 83
Section II
Chapter 6
Performance tuning
Measuring performance
Overview ................................................................................................................86 Controlling system variables for consistent testing conditions ...................86 Server variables ............................................................................................86 Network variables ........................................................................................87 Client variables .............................................................................................88 Data variables ...............................................................................................88 Evaluating performance .....................................................................................89 Evaluating UNIX system components ..............................................................94 Monitoring CPU load ...................................................................................94 Measuring performance independent of tape or disk output ...............94 Evaluating Windows system components .......................................................95 Monitoring CPU load ...................................................................................96 Monitoring memory use .............................................................................97 Monitoring disk load ...................................................................................97
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Changing the HP-UX kernel parameters ................................................175 Kernel parameters on Linux .............................................................................176 Adjusting data buffer size (Windows) .............................................................176 Other Windows issues .......................................................................................178
Appendix A
Additional resources
Performance tuning information at Vision online .......................................179 Performance monitoring utilities ...................................................................179 Freeware tools for bottleneck detection .........................................................179 Mailing list resources ........................................................................................180
Index
181
Section I
NetBackup capacity planning Master server configuration guidelines Media server configuration guidelines Media configuration guidelines Database backup guidelines Best practices
Note: For a discussion of tuning factors and general recommendations that may be applied to an existing 6.5 installation, see Section II.
Chapter
Purpose on page 12 Disclaimer on page 12 Introduction on page 12 Analyzing your backup requirements on page 14 Designing your backup system on page 17 Questionnaire for capacity planning on page 44
Purpose
Veritas NetBackup is a high-performance data protection application. Its architecture is designed for large and complex distributed computing environments. NetBackup provides scalable storage servers (master and media servers) that can be configured for network backup, recovery, archiving, and file migration services. This manual is for administrators who want to analyze, evaluate, and tune NetBackup performance. This manual is intended to answer questions such as the following: How big should the NetBackup master server be? How can the server be tuned for maximum performance? How many CPUs and disk or tape drives are needed? How to configure backups to run as fast as possible? How to improve recovery times? What tools can characterize or measure how NetBackup is handling data? Note: Most critical factors in performance are based in hardware rather than software. Hardware selection and configuration have roughly four times the weight that software has in determining performance. Although this guide provides some hardware configuration assistance, it is assumed for the most part that your devices are correctly configured.
Disclaimer
It is assumed you are familiar with NetBackup and your applications, operating systems, and hardware. The information in this manual is advisory only, presented in the form of guidelines. Changes to an installation undertaken as a result of the information contained herein should be verified in advance for appropriateness and accuracy. Some of the information contained herein may apply only to certain hardware or operating system architectures. Note: The information in this manual is subject to change.
Introduction
The first step toward accurately estimating your backup requirements is a complete understanding of your environment. Many performance issues can be traced to hardware or environmental issues. A basic understanding of the entire backup data path is important in determining the maximum performance you can expect from your installation.
13
Every backup environment has a bottleneck. It may be a fast bottleneck, but it will determine the maximum performance obtainable with your system.
Example:
Consider the configuration in Figure 1-1. In this environment, backups run slowly (in other words, they are not completing in the scheduled backup window). Total throughput is 80 MB per second. What makes the backups run slowly? How can NetBackup or the environment be configured to increase backup performance in this situation? Figure 1-1 Dedicated NetBackup environment
LAN
The explanation is the following: the 4 LTO gen 3 tape drives have a combined throughput of 320 MB (megabytes) per second and the 2 Gbit SAN connection from the server to the tape drives has a theoretical throughput of 200 MB per second. The LAN, however, having a speed of 1 gigabit per second, has a theoretical throughput of 125 MB per second. In practice 1 gigabit throughput is unlikely to exceed 70% utilization. Therefore, the best delivered data rate is about 90 MB per second to the NetBackup server. The throughput can be even lower than this, when TCP/IP packet headers, TCP-window size constraints,
router hops (packet latency for ACK packets delays the sending of the next data packet), host CPU utilization, filesystem overhead, and other LAN users' activity are considered. Since the LAN is the slowest element in the backup path, it is the first place to look in order to increase backup performance in this configuration.
Which systems need to be backed up? Identify all systems that need to be backed up. List each system separately so that you can identify any that require more resources to back up. Document which machines have disk drives or tape drives or libraries attached and write down the model type of each drive or library. Identify any applications on these machines that need to be backed up, such as Oracle, DB2, or MS-Exchange. In addition, record each host name, operating system and version, database type and version, network technology (for example, 1000BaseT), and location. How much data will be backed up? Calculate how much data you need to back up. Include the total disk space on each individual system, including that for databases. Remember to add the space on mirrored disks only once. By calculating the total size for all disks, you can design a system that takes future growth into account. You should also consider the future by estimating how much data you will need to back up in six months to a few years from now.
Do you plan to back up databases or raw partitions? If you are planning to back up databases, you need to identify the database engines, their version numbers, and the method that you will use to back them up. NetBackup can back up several database engines and raw file systems, and databases can be backed up while they are online or offline. To back up any database while it is online, you need a NetBackup database agent for your particular database engine. If you use NetBackup Snapshot Client to back up databases using raw partitions, you are actually backing up as much data as the total size of your raw partition. Also, remember to add the size of your database backups to your final calculations when figuring out how much data you need to back up.
15
Will you be backing up special application servers such as MS-Exchange or Lotus Notes? If you are planning on backing up any application servers, you will need to identify their types and application release numbers. As previously mentioned, you may need a special NetBackup agent to properly back up your particular servers.
What types of backups are needed and how often should they take place? The frequency of your backups has a direct impact on your:
Restore opportunities. To properly size your backup system, you must decide on the type and frequency of your backups. Will you perform daily incremental and weekly full backups? Monthly or bi-weekly full backups?
How much time is available to run each backup? It is important to know the window of time that is available to complete each backup. The length of a window dictates several aspects of your backup strategy. For example, you may want a larger window of time to back up multiple, high-capacity servers. Or you may consider the use of advanced NetBackup features such as synthetic backups, a local snapshot method, or FlashBackup. How long should backups be retained? An important factor while designing your backup strategy is to consider your policy for backup expiration. The amount of time a backup is kept is also known as the retention period. A fairly common policy is to expire your incremental backups after one month and your full backups after six months. With this policy, you can restore any daily file change from the previous month and restore data from full backups for the previous six months. The length of the retention period depends on your own unique requirements and business needs, and perhaps regulatory requirements. However, keep in mind that the length of your retention period has a directly proportional effect on the number of tapes you will need and the size of your NetBackup catalog database. Your NetBackup catalog database keeps track of all the information on all your disk drives and tapes. The catalog size is tightly tied in to your retention period and the frequency of your backups. Also, database management daemons and services may become bottlenecks. If backups are sent off site, how long must they remain off site? If you plan to send tapes to an off site location as a disaster recovery option, you must identify which tapes to send off site and how long they remain off
site. You might decide to duplicate all your full backups, or only a select few. You might also decide to duplicate certain systems and exclude others. As tapes are sent off site, you will need to buy new tapes to replace them until they are recycled back from off site storage. If you forget this simple detail, you will run out of tapes when you most need them.
What is your network technology? If you are planning on backing up any system over a network, note the network types that you will be using. The next section, Designing your backup system, explains how to calculate the amount of data you can transfer over those networks in a given time. Depending on the amount of data that you want to back up and the frequency of those backups, you might want to consider installing a private network just for backups. What new systems will be added to your site in the next six months? It is important to plan for future growth when designing your backup system. By analyzing the potential future growth of your current or future systems, you can insure the backup solution that you have accommodates the kind of environment that you will have in the future. Remember to add any resulting growth factor that you incur to your total backup solution. Will user-directed backups or restores be allowed? Allowing users to do their own backups and restores can reduce the time it takes to initiate certain operations. However, user-directed operations can also result in higher support costs and the loss of some flexibility. User-directed operations can monopolize media and tape drives when you most need them. They can also generate more support calls and training issues while the users become familiar with the new backup system. You will need to decide whether allowing user access to some of your backup systems functions is worth the potential costs. Data type: What are the types of data: text, graphics, database? How compressible is the data? How many files are involved? Will the data be encrypted? (Note that encrypted backups may run slower. See Encryption on page 144 for more information.) Data location: Is the data local or remote? What are the characteristics of the storage subsystem? What is the exact data path? How busy is the storage subsystem? Change management: Because hardware and software infrastructure will change over time, is it worth the cost to create an independent test-backup environment to ensure your production environment will work with the changed components?
17
Calculate the required data transfer rate for your backups on page 17 Calculate how long it will take to back up to tape on page 18 Calculate how many tape drives are needed on page 20 Calculate the required data transfer rate for your network(s) on page 22 Calculate the size of your NetBackup catalog on page 23 Calculate the size of the EMM database on page 26 Calculate media needed for full and incremental backups on page 28 Calculate the size of the tape library needed to store your backups on page 30 Estimate your SharedDisk storage requirements on page 30 Design your master server based on your previous findings on page 33 Estimate the number of master servers needed on page 35 Design your media server on page 37 Estimate the number of media servers needed on page 38 Design your NOM server on page 39 Summary on page 43
If you are running cumulative-incremental backups, you need to take into account which data is changing, since that affects the size of your backups. For example, if the same 20% of the data is changing daily, your cumulative-incremental backup will be much smaller than if a completely different 20% changes every day.
Example: Calculating your ideal data transfer rate during the week
Assumptions: Amount of data to back up during a full backup = 500 gigabytes Amount of data to back up during an incremental backup = 20% of a full backup Daily backup window = 8 hours Solution 1: Full backup = 500 gigabytes Ideal data transfer rate = 500 gigabytes/8 hours = 62.5 gigabytes/hour Solution 2: Incremental backup = 100 gigabytes Ideal data transfer rate = 100 gigabytes/8 hours = 12.5 gigabytes/hour To calculate your ideal data transfer rate during the weekends, divide the amount of data that needs to be backed up by the length of the weekend backup window.
19
Actual data transfer rate = (Amount of data to back up)/((Number of drives) * (Tape drive transfer rate)) Table 1-1 Drive
LTO gen 1 LTO gen 2 LTO gen 3 SDLT 320 LTO gen 4 SDLT 600 STK 9940B STK T10000 TS1120
21
Solution 1: Tape drive type = LTO gen 2 Tape drive transfer rate = 105 gigabytes/hour Number of drives = 1000 gigabytes/ ((8 hours) * (105 gigabytes/hour)) = 1.19 = 2 drives Solution 2: Tape drive type = LTO gen 3 Tape drive transfer rate = 288 gigabytes/hour Number of drives = 1000 gigabytes/((8 hours) * (288 gigabytes/hour)) = 0.43 = 1 drive Although it is quite straightforward to calculate the number of drives needed to perform a backup, it is difficult to spread the data streams evenly across all drives. To effectively spread your data, you have to experiment with various backup schedules, NetBackup policies, and your hardware configuration. See Basic tuning suggestions for the data path on page 101 to determine your options. Another important aspect of calculating how many tape devices you will need is calculating how many tape devices you can attach to a drive controller. When calculating the maximum number of tape drives that you can attach to a controller, you must know the drive and controller maximum transfer rates as published by their manufacturers. Failure to use maximum transfer rates for your calculations can result in saturated controllers and unpredictable results. Table 1-2 displays the transfer rates for several drive controllers. In practice, your transfer rates might be slower because of the inherent overhead of several variables including your file system layout, system CPU load, and memory usage. Table 1-2 Drive Controller
ATA-5 (ATA/ATAPI-5) Wide Ultra 2 SCSI iSCSI 1 Gigabit Fibre Channel SATA/150 Ultra-3 SCSI 2 Gigabit Fibre Channel
Theoretical gigabytes/hour
237.6 288 360 360 540 576 720
Theoretical gigabytes/hour
1080 1152 1440
Network Technology
100BaseT (switched) 1000BaseT (switched) 10000BaseT (switched)
Typical gigabytes/hour
25 250 2500
Additional information is available on the importance of matching network bandwidth to your tape drives. See Network and SCSI/FC bus bandwidth on page 66.
23
rate of 62.5 gigabytes/hour. In this case, you would have to explore some other options, such as:
Backing up your data over a faster network (1000BaseT) Backing up large servers to dedicated tape drives (SAN media servers) Backing up over SAN connections using SAN Client Performing off-host backups using Snapshot Client to present a snapshot directly to a media server Performing your backups during a longer time window Performing your backups over faster dedicated private networks.
Solution 2: Network Technology = 1000BaseT (switched) Typical transfer rate = 250 gigabytes/hour Using the values from Table 1-3, a single 1000BaseT network has a transfer rate of 250 gigabytes/hour. This network technology will be able to handle your backups with room to spare. Calculating the data transfer rates for your networks can help you identify your potential bottlenecks by looking at the transfer rates of your slowest networks. Several solutions for dealing with multiple networks and bottlenecks are available. See Basic tuning suggestions for the data path on page 101.
The number of files being backed up The frequency and retention period of the backups
You can use either of two methods to calculate the NetBackup catalog size (described below). In both cases, since data volumes grow over time, you should factor in expected growth when calculating total disk space used. Note that catalog compression can reduce the size of your catalog. NetBackup automatically decompresses the catalog only for the images and time period that are needed for the restore. You can also use catalog archiving to reduce the space requirements for the catalog. More information is available on catalog compression and catalog archiving. See the NetBackup Administrator's Guide, Volume I.
Note: If you select NetBackup's True Image Restore option, your catalog will be larger than a catalog without this option selected. True Image Restore collects the information required to restore directories to their contents at the time of any selected full or incremental backup. Because the additional information that NetBackup collects for incremental backups is the same as that of a full backup, incremental backups take much more disk space when you collect True Image Restore information.
First method
This method for calculating catalog size can be quite precise. It requires certain details such as the number of files held on-line and the number of backups (full and incremental) that are retained at any time. To calculate the size in gigabytes for a particular backup, use the following formula: Catalog size = (120 * number of files in all backups)/ 1GB To use this method, you must determine the approximate number of copies of each file held in backups and the typical file size. The number of copies can usually be estimated as follows: Number of copies of each file held in backups = number of full backups + 10% of the number of incremental backups held Example: Calculating the size of your NetBackup catalog Assumptions: Number of full backups per month: 4 Retention period for full backups: 6 months Total number of full backups retained: 24 Number of incremental backups per month: 25 Total number of files held on-line (total number of files in a full backup): 17,500,000 Solution:
Number of copies of each file retained: 24 + (25 * 10%) = 26.5 NetBackup catalog size for each file retained: (120 * 26.5 copies) = 3180 bytes Total catalog space required: (3180 * 17,500,000 files) /1 GB = 54 GB
25
Second method
An alternative method for calculating the catalog size is to multiply by a small percentage (such as 2%) the total amount of data maintained in the production environment (not the total size of all backups). Note that 2% is just an example; this section helps you calculate a percentage that is appropriate for your environment. Note: Calculating catalog size based on a small percentage works only for environments where it is easy to determine the typical file size, typical retention policies, and typical incremental change rates. In some cases, the catalog size obtained using this method may vary significantly from the eventual catalog size. To use this method, you must determine the approximate number of copies of each file held in backups and the typical file size. The number of copies can usually be estimated as follows: Number of copies of each file held in backups = number of full backups + 10% of the number of incremental backups held The multiplying percentage can be calculated as follows: Multiplying percentage = (120 * number of files held in backups /average file size) * 100% Then, the size of the catalog can be estimated as: Size of the catalog = total disk space used * multiplying percentage Example: Calculating the size of your NetBackup catalog Assumptions: Number of full backups per month: 4 Retention period for full backups: 6 months Total number of full backups retained: 24 Number of incremental backups per month: 25 Average file size: 70 KB Total disk space used on all servers in the domain: 1.4 TB Solution:
Number of copies of each file retained: 24 + (25 * 10%) = 26.5 NetBackup catalog size for each file retained: (120 * 26.5 copies) = 3180 bytes Multiplying percentage: (3180/70000) * 100% = 4.5% Total catalog space required:
(1,400 GB * 4.5%) = 63 GB
Recommendation
The size of the catalog depends on the number of files in the backups and the number of copies of the backups that are retained. As a result, the catalog has the potential to grow quite large. You should consider two additional factors, however, when estimating the ultimate size of the catalog: can it be backed up in an acceptable time window, and can the general housekeeping activities complete within their execution windows. The time required to complete a catalog backup depends on the amount of space the catalog occupies. The time required for the housekeeping activities depends on the number of entries in the catalog. Note that NetBackups catalog archiving feature can be used to move older catalog data to other disk or tape storage. Catalog archiving can reduce the size of the catalog on disk and thus reduce the backup time. Catalog archiving, however, does not decrease the amount of time required for housekeeping activities. In general, Symantec recommends that you plan your environment so that the following criteria are met:
The amount of data held in the on-line catalog should not exceed 750 GB. Catalog archiving can be used to keep the on-line portion of the catalog below this value. The total number of catalog entries should not exceed 1,000,000. This number equals the total of all retained copies of all backups from all clients held both on-line and in the catalog archive.
Note that the actual limits of acceptable catalog performance are influenced by the speed of the storage and the power of the server. Your performance may vary significantly from the guideline figures provided in this section. Note: If you anticipate that your catalog will exceed these limits, consider deploying multiple NetBackup domains in your environment. More information on catalog archiving is available in the NetBackup Administrators Guide, Volume I.
27
The amount of space needed for the EMM database is determined by the size of the NetBackup database (NBDB), as explained below. Note: This space must be included when determining size requirements for a master or media server, depending on where the EMM database is installed. Space for the NBDB on the EMM database is required in the following two locations: UNIX
/usr/openv/db/data /usr/openv/db/staging
Windows
install_path\NetBackupDB\data install_path\NetBackupDB\staging
Calculate the required space for the NBDB in each of the two directories, as follows: 60 MB + (2 KB * number of volumes configured for EMM) + (number of images in disk storage other than BasicDisk * 5 KB) + (number of disk volumes * number of media severs * 5 KB) where EMM is the Enterprise Media Manager, and volumes are NetBackup (EMM) media volumes. Note that 60 MB is the default amount of space needed for the NBDB database used by the EMM database. It includes pre-allocated space for configuration information for devices and storage units. Note: During NetBackup installation, the install script looks for 60 MB of free space in the above /data directory; if there is insufficient space, the installation fails. The space in /staging is only required when a hot catalog backup is run. A portion of the space required in the /data directory is occupied by the EMM transaction log. This transaction log is only truncated (not removed) when a hot catalog backup is performed. The log will continue to grow indefinitely if a hot catalog backup is not made at regular intervals.
Note: The above 60 MB of space is pre-allocated. It is derived from the following separate databases that were consolidated into the EMM database in NetBackup 6.0: globDB, ltidevs, robotic_def, namespace.chksum, ruleDB, poolDB, volDB, mediaDB, storage_units, stunit_groups, SSOhosts, and media errors database. See the NetBackup Release Notes, in the section titled Enterprise Media Manager Databases, for additional details on files and database information included in the EMM database.
The amount of data that you are backing up The frequency of your backups The planned retention periods The capacity of the media used to store your backups.
If you expect your site's workload to increase over time, you can ease the pain of future upgrades by planning for expansion. Design your initial backup architecture so it can evolve to support more clients and servers. Invest in the faster, higher-capacity components that will serve your needs beyond the present. A simple formula for calculating your tape needs is shown here: Number of tapes = (Amount of data to back up) / (Tape capacity) To calculate how many tapes will be needed based on all your requirements, the above formula can be expanded to the following: Number of tapes = ((Amount of data to back up) * (Frequency of backups) * (Retention period)) / (Tape capacity) Table 1-4 Drive
LTO gen 1 LTO gen 2 LTO gen 3 LTO gen 4
29
Example: Calculating how many tapes are needed to store all your backups
Preliminary calculations: Size of full backups = 500 gigabytes * 4 (per month) * 6 months = 12 terabytes Size of incremental backups = (20% of 500 gigabytes) * 30 * 1 month = 3 terabytes Total data tracked = 12 terabytes + 3 terabytes = 15 terabytes Solution 1: Tape drive type = LTO gen 2 Tape capacity without compression = 200 gigabytes Tape capacity with compression = 400 gigabytes Without compression: Tapes needed for full backups = 12 terabytes/200 gigabytes = 60 Tapes needed for incremental backups = 3 terabytes/200 gigabytes = 15 Total tapes needed = 60 + 15 = 75 tapes With 2:1 compression: Tapes needed for full backups = 12 terabytes/400 gigabytes = 30 Tapes needed for incremental backups = 3 terabytes/400 gigabytes = 7.5 Total tapes needed = 30 + 7.5 = 37.5 tapes Solution 2: Tape drive type = LTO gen 3 Tape capacity without compression = 400 gigabytes Tape capacity with compression = 800 gigabytes Without compression: Tapes needed for full backups = 12 terabytes/400 gigabytes = 30 Tapes needed for incremental backups = 3 terabytes/400 gigabytes = 7.5 ~= 8 Total tapes needed = 30 + 8 = 38 tapes
With 2:1 compression: Tapes needed for full backups = 12 terabytes/800 gigabytes = 15 Tapes needed for incremental backups = 3 terabytes/800 gigabytes = 3.75 ~= 4 Total tapes needed = 15 + 4 = 19 tapes
Calculate the size of the tape library needed to store your backups
To calculate how many robotic library tape slots are needed to store all your backups, take the number of tapes for backup calculated in Calculate media needed for full and incremental backups on page 28 and add tapes for catalog backup and cleaning: Tape slots needed = (Number of tapes needed for backups) + (Number of tapes needed for catalog backups) + 1 (for a cleaning tape) A typical example of tapes needed for catalog backup is 2. Additional tapes may be needed for the following:
If you plan to duplicate tapes or to reserve some media for special (non-backup) use, add those tapes to the above formula. Add tapes needed for future data growth. Make sure your system has a viable upgrade path as new tape drives become available.
The number of media servers to use as storage servers in each SharedDisk disk pool. A storage server mounts the storage, and writes data to and reads data from disk storage. A SharedDisk disk pool is the storage destination of a NetBackup storage unit. See the NetBackup Shared Storage Guide. The number of disk volumes to use in each SharedDisk disk pool.
The more clients that use the disk pool, the more storage servers needed per disk pool. The more Fibre Channel connections that are configured on the media servers, the more storage servers that are needed.
31
Client data has to be transferred both in and out of the media server. The lower the average bandwidth of each media server, the more storage servers that are needed. The more often backups are run, and the larger each backup is, the greater the number of storage servers that are needed. NetBackup device selection takes longer with a larger number of storage servers, especially for backups that use the Multiple copies option (Inline Copy). As the number of storage servers increases, they may consume more resources from the disk arrays. The hardware and software implementation of the arrays may have to be adjusted accordingly. As the number of storage servers increases, background monitoring communication traffic also increases, which causes more database activities. The communication traffic consists of the following:
The Disk Polling Service and the Media Performance Monitor Service within nbrmms on the media server. The Resource Event Manager service and the Disk Service Manager within nbemm on the EMM server.
Recommendation
Symantec recommends a conservative estimate for number of storage servers per disk pool. For most NetBackup installations, a rough guideline is approximately 10 to 50 storage servers per disk pool. Note: it is easy to add or remove storage servers from a disk pool.
Limitations in disk volume size imposed by the disk array. If the maximum volume size allowed by the array is relatively small, more volumes may be required. Limitations in file system-size imposed by the operating system of the media server that is used as a storage server. A volume cannot exceed the size of the file system. The smaller each file system, the more volumes required to store a given amount of data. Number of storage servers in each SharedDisk disk pool. The more storage servers, the more disk volumes that are needed.
You can use the following to determineroughlythe minimum and maximum number of disk volumes needed in each SharedDisk disk pool:
Minimum: you should have at least two disk volumes per storage server. (Note that a disk volume can only be used by one storage server at a time.) If the total number of disk volumes is smaller than the number of storage servers, NetBackup runs out of disk volumes to mount. Maximum: if all clients connected to the storage servers use Fibre Channel, the number of disk volumes in a SharedDisk disk pool may not have to exceed the total number of Fibre Channel connections available. That is, NetBackup does not require more SharedDisk disk volumes than the total number of Fibre Channel connections available. Other site requirements, however, may suggest a higher number of disk volumes. Storage sharing: During media and device selection, NetBackup tries to find the best storage server. A larger number of disk volumes improves the probability of finding an eligible disk volume for the storage server. The result is better load balancing across media servers. Raw I/O performance: A larger number of disk volumes is more likely to allow a disk volume configuration that addresses the physical details of the disk array. The configuration could allow the use of separate spindles for separate backup jobs, to eliminate disk head contention. Image fragmentation: Given a constant amount of storage space, fewer disk volumes result in less image fragmentation, because images are less likely to span volumes. As a result, fewer multiple-disk-volume restores are required, which improves restore performance. Mount time: Mount times for disk volumes are in the range of one to three seconds. Fewer disk volumes make it more likely that a new backup job starts on a disk volume that is already mounted, which saves time. The use of LUN masking increases mount times, whereas the use of SCSI persistent reservation results in comparatively much short mount times. Note: The length of the mount time is not related to the size of the backup. The longer the backup job, the less noticeable is the delay due to mount time.
Media and device selection: As with a larger number of storage servers, a larger number of disk volumes makes media and device selection more time consuming. It is especially true for backups that use the Multiple copies option (Inline Copy).
33
System monitoring and administration: disk pool monitoring and administration activities increase with the number of disk volumes in a disk pool. Fewer disk volumes require less time to monitor or administer.
Recommendation
The recommended number of disk volumes to have in a disk pool depends on whether the NetBackup clients are connected to the storage server by means of a LAN (potentially slow) or Fibre Channel (considerably faster).
NetBackup clients connected to the storage server on the LAN: the number of disk volumes should be between two and four times the number of storage servers. NetBackup clients connected to the storage server on Fibre Channel (SAN clients): you should have at least one disk volume per logical client connection on the Fibre Channel. If four SAN clients are using two Fibre Channel connections (two clients per FC connection), then four disk volumes are recommended. In general, to avoid running out of volumes, the number of volumes in a SharedDisk disk pool should be equivalent to the number of storage servers multiplied by the number of Fibre Channel connections per storage server. For example: if you have 20 storage servers, with 8 Fibre Channel client connections on each storage server, a guideline for the number of disk volumes is 20 * 8 = 160. In this case, if the number of disk volumes is less than 160, NetBackup may run out of disk volumes.
Further information
Further information is available on SharedDisk performance considerations and best practices. See the tech note Best practices for disk layout with the NetBackup Flexible Disk option. (This tech note should be available by September 2008.) http://seer.entsupport.symantec.com/docs/305161.htm Further information on how to configure SharedDisk is available. See the NetBackup Shared Storage Guide.
Perform an initial backup requirements analysis, as outlined in the section Analyzing your backup requirements on page 14. Perform the calculations outlined in the previous steps of the current chapter.
To design a master server becomes a simple task once the basic design constraints are known:
Amount of data to back up Size of the NetBackup catalog Number of tape drives needed Number of disk drives needed Number of networks needed
Given the above, a simple approach to designing your master server can be outlined as follows:
Acquire a dedicated server Add tape drives and controllers (for saving your backups) Add disk drives and controllers (for saving your backups, and for OS and NetBackup catalog) Add network cards Add memory Add CPUs
In some cases, it may not be practical to design a server to back up all of your systems. You might have one or more large servers that cannot be backed up over a network within your backup window. In such cases, it is best to back up those servers by means of their own locally-attached drives or drives that use the Shared Storage Option. Although this section discusses how to design a master server, you can still use its information to properly add the necessary drives and components to your other servers. The next example shows how to configure a master server using the design elements gathered from the previous sections.
35
Solution: CPUs needed for network cards = 1 CPUs needed for tape drives = 1 CPUs needed for OS = 1 Total CPUs needed = 1 + 1 + 1 = 3 Memory needed for network cards = 16 megabytes Memory needed for tape drives = 256 megabytes Memory needed for OS and NetBackup = 1 gigabyte Total memory needed = 16 + 256 + 1000 = 1.272 gigabytes The values in this solution are based on these tables:
Table 1-6 CPUs needed per master/media server component Table 1-7 Memory needed per master/media server component
Based on the above solution, your master server needs 3 CPUs and 1.272 gigabytes of memory. In addition, you need 60 gigabytes of disk space to store your NetBackup catalog, along with the necessary disks and drive controllers to install your operating system and NetBackup (2 gigabytes should be ample for most installations). This server also requires one SCSI card, or another, faster, adapter for use with the tape drive (and robot arm) and a single 100BaseT card for network backups. When designing your master server, begin with a dedicated server for optimum performance. In addition, consult with your servers hardware manufacturer to ensure that the server can handle your other components. In most cases, servers have specific restrictions on the number and mixture of hardware components that can be supported concurrently. Overlooking this last detail can cripple even the best of plans.
The master server must be able to periodically communicate with all its media servers. If there are too many media servers, master server processing may be overloaded. Consider business-related requirements. For example, if an installation has different applications which require different backup windows, a single master may have to run backups continually. As a result, resources for catalog cleaning, catalog backup, or other maintenance activity may be insufficient. If you use the legacy offline catalog backup feature, the time window for the catalog backup may not be adequate.
If possible, design your configuration with one master server per firewall domain. In addition, do not share robotic tape libraries between firewall domains. As a rule, the number of clients (separate physical hosts) per master server is not a critical factor for NetBackup. The backup processing that clients perform has little or no impact on the NetBackup server. Exceptions do exist. For example, if all clients have database extensions, or all clients run ALL_LOCAL_DRIVES backups at the same time, server performance may be affected. Plan your configuration so that it contains no single point of failure. Provide sufficient redundancy to ensure high availability of the backup process. Having more tape drives or media may reduce the number of media servers needed per master server. Do not plan to run more than about 10,000 jobs per 12-hour period on a single master server. See Limits to scalability on page 37. Consider limiting the number of media servers handled by a master to the lower end of the estimates in Table 1-5. Although a well-managed NetBackup environment can handle more media servers than the numbers listed in this table, you may find your backup operations more efficient and manageable with fewer but larger media servers. The variation in the number of media servers per master server for each scenario in the table depends on the following: number of jobs submitted, multiplexing, multi-streaming, and network capacity.
Table 1-5 provides an estimate of processor and memory requirements on a master server based on number of media servers and number of jobs the master server must support. This table is for guidance purposes only; it is based on test data and customer experiences. The amount of RAM and number of processors may need to be increased based on other site-specific factors. In this table, a processor is defined as a state-of-the-art CPU (such as a 3GHz processor for an x86 system or the equivalent on a RISC platform such as Sun SPARC). Note: For a NetBackup master server, Symantec recommends using multiple discrete processors instead of a single multi-core processor. The individual cores in a multi-core processor lack the resources to support some of the CPU intensive processes running on a master server. It is better, for example, to have two physical dual core processors (for a total of four processors) rather than a single quad core processor.
37
Maximum number Maximum number of media of jobs per day servers per master server
500 2000 5000 10000 5 20 50 100
Definition of a job
For the purposes of Table 1-5, a job is defined as an individual backup stream. Note that database and application backup policies, and policies that use the ALL_LOCAL_DRIVES directive, usually launch multiple backup streams (thus multiple jobs) at the same time.
Limits to scalability
Note the following limits to scalabilityregardless of the size of the master server:
The maximum rate at which jobs can be launched is about 1 job every 4 seconds. Therefore, a domain cannot run much more than 10,000 jobs in a 12 hour backup window. A single domain cannot support significantly more than 100 media servers and SAN media servers.
CPUs needed per master/media server component How many and what kind of component
2-3 100BaseT cards 1 ATM card
CPUs needed per master/media server component How many and what kind of component
1-2 Gigabit Ethernet cards with coprocessor
Tape drives
2 LTO gen 3 drives 2-3 SDLT 600 drives 2-3 LTO gen 2 drives
OS and NetBackup
Memory needed per master/media server component Type of component Memory per component
16 MB LTO gen 3 drive SDLT 600 drive LTO gen 2 drive 256 MB 128 MB 128 MB 1 GB 1 or more GB 8 MB * (# streams) * (# drives)
The information in the above tables is a rough estimate only, intended as a guideline for initial planning. In addition to the above media server components, you must also add the necessary disk drives to store the NetBackup catalog and your operating system. The size of the disks needed to store your catalog depends on the calculations explained earlier under Calculate the size of your NetBackup catalog on page 23.
I/O performance is generally more important than CPU performance. Consider CPU, I/O, and memory expandability when choosing a server. Consider how many CPUs are needed (see CPUs needed per master/media server component on page 37). Here are some general guidelines:
39
Experiments (with Sun Microsystems) have shown that a useful, conservative estimate is 5MHz of CPU capacity per 1MB/second of data movement in and out of the NetBackup media server. Keep in mind that the operating system and other applications also use the CPU. This estimate is for the power available to NetBackup itself. Example: A system backing up clients over the network to a local tape drive at the rate of 10MB/second would need 100MHz of available CPU power:
50MHz to move data from the network to the NetBackup server 50MHz to move data from the NetBackup server to tape.
Consider how much memory is needed (see Memory needed per master/media server component on page 38). At least 512 megabytes of RAM is recommended if the server is running a Java GUI. NetBackup uses shared memory for local backups. NetBackup buffer usage will affect how much memory is needed. See the Tuning the NetBackup data transfer path chapter for more information on NetBackup buffers. Keep in mind that non-NetBackup processes need memory in addition to what NetBackup needs. A media server moves data from disk (on relevant clients) to storage (usually disk or tape). The server must be carefully sized to maximize throughput. Maximum throughput is attained when the server keeps its tape devices streaming. (For an explanation of streaming, see Tape streaming on page 138.)
Media server factors to consider for sizing include: Disk storage access time Number of SharedDisk storage servers Adapter (for example, SCSI) speed Bus (for example, PCI) speed Tape or disk device speed Network interface (for example, 100BaseT) speed Amount of system RAM Other applications, if the host is non-dedicated
The platform chosen must be able to drive all network interfaces and keep all tape devices streaming.
NetBackup Operations Manager Guide. Some of the considerations are the following:
The NOM server should be configured as a fixed host with a static IP address. Symantec recommends that you not install the NOM server software on the same server as NetBackup master or media server software. Installing NOM on a master server may impact security and performance.
Sizing considerations
The size of your NOM server depends largely on the number of NetBackup objects that NOM manages. The NetBackup objects that determine the NOM server size are the following:
Number of master servers to manage Number of policies Number of jobs run per day Number of media
Based on the above factors, the following NOM server components should be sized accordingly. NOM server components
Disk space (for installed NOM binary + NOM database, described below) Type and number of CPUs RAM
The next section describes the NOM database and how it affects disk space requirements, followed by a description of sizing guidelines for NOM.
NOM database
The Sybase database used by NOM is similar to that used by NetBackup and is installed as part of the NOM server installation.
Once you configure and add master servers to NOM, the disk space occupied by NOM depends on the volume of data initially loaded on the NOM server from the managed NetBackup servers. The initial data load on the NOM server is in turn dependent on the following data present in the managed master servers:
41
The rate of NOM database growth depends on the quantity of managed data. This data can be policy data, job data, or media data.
For optimal performance and scalability, Symantec recommends that you manage approximately a month of historical data. Information is available on adjusting database values for better NOM performance. See under NetBackup Operations Manager (NOM) on page 149.
Sizing guidelines
The following guidelines are presented in groups based on the number of objects that your NOM server manages. The guidelines are intended for basic planning purposes, and do not represent fixed recommendations or restrictions. It is assumed that your NOM server is a standalone host (the host is not acting as a NetBackup master server). Note: Installation of NOM server software on the NetBackup master server is not recommended.
NetBackup Maximum Maximum Total Maximum Maximum Maximum Installation master jobs per number of policies alerts media Category servers day jobs in the database
A B C 3 10 40 1000 10000 50000 100000 250000 500000 5000 50000 50000 1000 10000 200000 10000 300000 300000
Note: If your NetBackup installation is larger than those listed here (as to the number of NetBackup master servers, number of jobs per day, and so forth), the behavior of NOM is unpredictable. In this case, Symantec recommends using multiple NOM servers. Table 1-9 lists the minimum hardware requirements and recommended settings for installing NOM 6.5 based on your NetBackup installation category (A, B, or C). Table 1-9 Minimum hardware requirements and recommended settings for NOM 6.5 CPU Type Number RAM Average Recommended Recommended of CPUs database cache size for heap size (Web growth rate Sybase server and NOM per day server)
1 2 GB 3 MB 512 MB 512 MB
Solaris 8/9/10 B
Sun SPARC
2 GB
3 MB
512 MB
512 MB
Windows Pentium III 2000/2003 or higher/ or Xeon Solaris 8/9/10 Sun SPARC
4 GB
30 MB
1 GB
1 GB
4 GB
30 MB
1 GB
1 GB
8 GB
150 MB
2 GB
2 GB
Solaris 8/9/10
Sun SPARC
8 GB
150 MB
2 GB
2 GB
For example, if your NetBackup setup falls in installation category B (Windows environment), you must ensure that your system on which NOM is installed meets the following minimum hardware requirements:
CPU Type: Pentium III or higher / Xeon Number of CPUs required: 2 RAM: 4 GB
43
For optimal performance, the average database growth rate per day on your NOM system should be 30 MB per day or lower. The recommended Sybase cache size is 1 GB. The recommended heap size for the NOM server and Web server is 1 GB. Related topics: See Adjusting the NOM server heap size on page 149. See Adjusting the NOM web server heap size on page 150. Symantec recommends that you adjust the Sybase cache size after installing NOM. After you install NOM, the database size can grow rapidly as you add more master servers. See Adjusting the Sybase cache size on page 150.
Summary
Using the guidelines provided in this chapter, design a solution that can do a full backup and incremental backups of your largest system within your time window. The remainder of the backups can happen over successive days. Eventually, your site may outgrow its initial backup solution. By following these guidelines, you can add more capacity at a future date without having to redesign your basic strategy. With proper design and planning, you can create a backup strategy that will grow with your environment. As outlined in the previous sections, the number and location of the backup devices are dependent on a number of factors.
The amount of data on the target systems The available backup and restore windows The available network bandwidth The speed of the backup devices
If one drive causes backup-window time conflicts, another can be added, providing an aggregate rate of two drives. The trade-off is that the second drive imposes extra CPU, memory, and I/O loads on the media server. If you find that you cannot complete backups in the allocated window, one approach is to either increase your backup window or decrease the frequency of your full and incremental backups. Another approach is to reconfigure your site to speed up overall backup performance. Before you make any such change, you should understand what determines your current backup performance. List or diagram your site network and systems configuration. Note the maximum data transfer rates for all the components of your backup configuration and compare these against the rate you must meet for your backup window. This will identify the slowest
components and, consequently, the cause of your bottlenecks. Some likely areas for bottlenecks include the networks, tape drives, client OS load, and filesystem fragmentation.
Backup window
Retention policy
45
Chapter
Managing NetBackup job scheduling on page 48 Miscellaneous considerations on page 53 NetBackup catalog strategies on page 56 Merging, splitting, or moving servers on page 59 Guidelines for policies on page 60 Managing logs on page 61
Host Properties > Master Server > Properties > Global Attributes > Maximum jobs per client (should be greater than 1). Policy attribute Limit jobs per policy (should be greater than 1). Policy schedule attribute Media multiplexing (should be greater than 1). Check the storage unit properties:
Is the storage unit enabled to use multiple drives (Maximum concurrent write drives)? If you want to increase this value, remember to set it to fewer than the number of drives available to this storage unit. Otherwise, restores and other non-backup activities will not be able to run while backups to the storage unit are running. Is the storage unit enabled for multiplexing (Maximum streams per drive)? You can write a maximum of 32 jobs to one tape at the same time.
49
For example, you can run multiple jobs to a single storage unit if you have multiple drives (Maximum concurrent write drives set to greater than 1). Or, you can set up multiplexing to a single drive if Maximum streams per drive is set to greater than 1. If both Maximum concurrent write drives and Maximum streams per drive are greater than 1, you can run multiple streams to multiple drives, assuming Maximum jobs per client is set high enough.
Tape jobs
The NetBackup Job Manager (nbjm) requests resources from the NetBackup Resource Broker (nbrb) for the job. nbrb allocates the resources and gives the resources to nbjm. nbjm makes the job active. nbjm starts bpbrm which in turn starts bptm; bptm mounts the tape medium in the drive.
Shared disks
The NetBackup Job Manager (nbjm) requests job resources from nbrb. nbrb allocates the resources and initiates the shared disk mount. When the mount completes, nbrb gives the resources to nbjm. nbjm makes the job active.
knows the media server is down and the NetBackup Resource Broker (nbrb) queues the request to be retried later. If a media server is configured in EMM but has been physically removed, powered off, or disconnected from the network, or if the network is down for any reason, the media and device selection logic of EMM will queue the job if no other media servers are available. The Activity Monitor should display the reason for the job queuing, such as media server is offline. Once the media server is online again in EMM, the job will start. In the meantime, if other media servers are available, the job runs on another media server. If a media server is not configured in EMM (removed from the configuration), regardless of the physical state of the media server, EMM will not select that media server for use. If no other media servers are available, the job fails. To permanently remove a media server from the system, consult the following document:
The Decommissioning a media server section in the NetBackup Administrators Guide, Volume II.
51
submission of jobs is staggered, the NetBackup resource broker (nbrb) can allocate job resources more quickly. Use the following formula to calculate the number of jobs to be submitted at one time: number of jobs to be submitted at one time = (number of physical drives * MPX value * 1.5) + (number of shared disk volumes * 4) This formula provides a basic guideline. Further tuning of your system may be required.
For an explanation of the CONNECT_OPTIONS values, refer to the NetBackup Administrators Guide for UNIX and Linux, Volume II.
The NetBackup Troubleshooting Guide also provides information on network connectivity issues.
Web-based interface for efficient, remote administration across multiple NetBackup servers from a single, centralized console. Policy-based alert notification, using predefined alert conditions to specify typical issues or thresholds within NetBackup. Operational reporting, on issues such as backup performance, media utilization, and rates of job success. Consolidated job and job policy views per server (or group of servers), for filtering and sorting job activity.
For more information on the capabilities of NOM, refer to the NOM online help by clicking Help from the title bar of the NOM console. Or see the NetBackup Operations Manager Guide. Information is available on adjusting NOM performance. See the topics under NetBackup Operations Manager (NOM) on page 149.
Windows
install_path\NetBackup\bin
Windows
cd install_path\NetBackup
53
Windows
echo 0 > NOexpire
See Remove the NOexpire file when it is not needed on page 53. To prevent backups from starting 1 Shut down bprd (NetBackup Request Manager) to suspend scheduling of new jobs by nbpem. You can use the Activity Monitor in the NetBackup Administration Console. Restart bprd to resume scheduling.
Miscellaneous considerations
Consider the following issues when planning for or troubleshooting NetBackup.
Any Available
For most backup operations, the default is to let NetBackup choose the storage unit. This is done by specifying Any Available for the storage destination. Any Available may work well in small configurations that include relatively few storage units and media servers. However, Any Available is NOT recommended for the following:
Configurations with many storage units and media servers. Configurations with more sophisticated disk technologies (such as SharedDisk, AdvancedDisk, PureDisk, OpenStorage). With these newer disk technologies, Any Available causes NetBackup to analyze all options to choose the best one available.
Note: In the above cases, Symantec recommends NOT using Any Available. In general, if the configuration includes many storage units, many volumes within many disk pools, and many media servers, the deep analysis required by Any Available could delay job initiation when many jobs (backup or duplication) are being requested during busy periods of the day. It is much better to specify a particular storage unit, or to narrow NetBackups search by using storage unit groups (depending on how storage units and groups are defined). For more details on Any Available, see the NetBackup Administrators Guide, Volume I. In addition, note the following about Any Available:
When Any Available is specified, NetBackup operates in prioritized mode, as described in the next section. In other words, NetBackup selects the first available storage unit in the order in which they were originally defined. Do not specify Any Available if you are creating multiple copies (Inline Copy) from a backup or from any method of duplication. The methods of duplication include Vault, staging disk storage units, lifecycle policies, or manual duplication through the Administration Console or command line. Instead, specify a particular storage unit.
Prioritized: Choose the first storage unit in the list that is not busy, down, or out of media. Failover: Choose the first storage unit in the list that is not down or out of media. Round robin: Choose the least recently selected storage unit in the list.
55
Media server load balancing: NetBackup avoids sending jobs to busy media servers. This option is not available for storage unit groups that contain a BasicDisk storage unit.
For the above options, NetBackup gives preference to a storage unit that can be accessed by a local media server. For more information on the above options, see the NetBackup online help for storage unit groups, or the NetBackup Administrators Guide, Volume I. In addition, note the following about storage unit groups: the simpler or more narrowly defined your storage units and storage unit groups are, the less time it takes NetBackup to select a resource to start a job. In complex environments with large numbers of jobs required, the following are good choices:
Fewer storage units per storage unit group. Fewer media servers per storage unit. In the storage unit definition, avoid using Any Available media server when drives are shared among multiple media servers. Fewer disk volumes in a disk pool. Fewer concurrent jobs.
Disk staging
With disk staging, images can be created on disk initially, then copied later to another media type (as determined in the disk staging schedule). The media type for the final destination is typically tape, but could be disk. This two-stage process leverages the advantages of disk-based backups in the near term, while preserving the advantages of tape-based backups for long term. Note that disk staging can be used to increase backup speed. For more information, refer to the NetBackup Administrators Guide, Volume I.
Image database: The image database contains information about what has been backed up. It is by far the largest part of the catalog. NetBackup data stored in relational databases: This includes the media and volume data describing media usage and volume information which is used during the backups. NetBackup configuration files: Policy, schedule and other flat files used by NetBackup.
For more information on the catalog, refer to Catalog Maintenance and Performance Optimization in the NetBackup Administrator's Guide Volume 1. The NetBackup catalogs on the master server tend to grow large over time and eventually fail to fit on a single tape. Here is the layout of the first few directory levels of the NetBackup catalogs on the master server:
57
Figure 2-1
/db/data
/Netbackup/db
/var
/Netbackup/vault
/var/global
License key and authentication information NBDB.db EMM_DATA.db EMM_INDEX.db NBDB.log BMRDB.db BMRDB.log BMR_DATA.db BMR_INDEX.db vxdbms.conf /client_1 Relational database files /client_n Configuration files /class /class_template /client /config /error /images /jobs /media /vault server.conf databases.conf
Image database
NetBackup catalog pathnames (cold catalog backups) When defining the file list, use absolute pathnames for the locations of the NetBackup and Media Manager catalog paths and include the server name
in the path. This is in case the media server performing the backup is changed.
Back up the catalog using online, hot catalog backup This type of catalog backup is for highly active NetBackup environments in which continual backup activity is occurring. It is considered an online, hot method because it can be performed while regular backup activity is taking place. This type of catalog is policy-based and can span more than one tape. It also allows for incremental backups, which can significantly reduce catalog backup times for large catalogs. Store the catalog on a separate file system (UNIX systems only) The NetBackup catalog can grow quickly depending on backup frequency, retention periods, and the number of files being backed up. If you store the NetBackup catalog data on its own file system, this ensures that other disk resources, root file systems, and the operating system are not impacted by the catalog growth. Information is available on how to move the catalog (UNIX systems only). See Catalog compression on page 59. Change the location of the NetBackup relational database files The location of the NetBackup relational database files can be changed and/or split into multiple directories for better performance. For example, by placing the transaction log file, NBDB.log, on a physically separate drive, you gain better protection against disk failure and increased efficiency in writing to the log file. Refer to the procedure in the section Moving NBDB database files after installation in the NetBackup relational database appendix of the NetBackup Administrators Guide, Volume I. Delay to compress catalog The default value for this parameter is 0, which means that NetBackup does not compress the catalog. As your catalog increases in size, you may want to use a value between 10 and 30 days for this parameter. When you restore old backups, which requires looking at catalog files that have been compressed, NetBackup automatically uncompresses the files as needed, with minimal performance impact. For information on how to compress the catalog, refer to Catalog compression on page 59.
Use catalog archiving. Catalog archiving reduces the size of online catalog data by relocating the large catalog .f files to secondary storage. NetBackup administration will continue to require regularly scheduled catalog backups,
59
but without the large amount of online catalog data, the backups will be faster.
Split the NetBackup domain to form two or more smaller domains, each with its own master server and media servers. Off load some policies, clients, and backup images from the current master server to a master in one of the new domains, so that each master has a window large enough to allow its catalog backup to finish. Since a media server can be connected to one master server only, additional media servers may be needed. For assistance in splitting the current NetBackup domain, contact Symantec Consulting. If the domain includes any NetBackup 5.x media servers (or has included any since the master server was upgraded to NetBackup 6.x), make sure that these media servers still exist and are on-line and reachable from the master server. Unreachable 5.x media servers can cause the image cleaning operation to time out repeatedly, slowing down the backup. You should upgrade all media servers to NetBackup 6.x as soon as possible after upgrading the master server, to avoid such issues.
Catalog compression
When the NetBackup image catalog becomes too large for the available disk space, there are two ways to manage this situation:
Compress the image catalog Move the image catalog (UNIX systems only).
For details, refer to Moving the image catalog and Compressing and uncompressing the Image Catalog in the NetBackup Administrators Guide, Volume I. Note that NetBackup compresses images after each backup session, regardless of whether or not any backups were successful. This happens right before the execution of the session_notify script and the backup of the catalog. The actual backup session is extended until compression is complete.
Centralized management, reporting, and maintenance are the benefits of working in a centralized NetBackup environment. Once a master server has been established, it is possible to merge its databases with another master server, giving control over its set of server backups to the new master server. Conversely, if the backup load on a master server has grown to the point where backups are not finishing in the backup window, it may be desirable to split that master server into two master servers. It is possible to merge or split NetBackup master servers or EMM servers. It is also possible to convert a media server to a master server or a master server to a media server. However, the procedures to accomplish this are complex and require a detailed knowledge of NetBackup database interactions. Merging or splitting NetBackup and EMM databases to another server is not recommended without involving a Symantec consultant to determine the changes needed, based on your specific configuration and requirements.
Do not use excessive wild cards in file lists. When wildcards are used, NetBackup compares every filename against the wild cards. This decreases NetBackup performance. Instead of placing /tmp/* (UNIX) or C:\Temp\* (Windows) in an include or exclude list, use /tmp/ or C:\Temp. Use exclude files to exclude large useless files. Reduce the size of your backups by using exclude lists for the files your installation does not need to preserve. For instance, you may decide to exclude temporary files. Use absolute paths for your exclude list entries, so that valuable files are not inadvertently excluded. Before adding files to the exclude list, confirm with the affected users that their files can be safely excluded. Should disaster (or user error) strike, not being able to recover files costs much more than backing up extra data. When a policy specifies that all local drives be backed up (ALL_LOCAL_DRIVES), nbpem initiates a parent job (nbgenjob) that connects to the client and runs bpmount -i to get a list of mount points. Then nbpem initiates a job with its own unique job identification number for each mount point. Next the client bpbkar starts a stream for each job. Then, and only then, the exclude list is read by NetBackup. When the entire job is excluded, bpbkar exits with a status 0, stating that it sent 0 of 0 files to backup. The resulting image files are treated just as any other successful
61
backup's image files. They expire in the normal fashion when the expiration date in the image header files specifies they are to expire.
Critical policies
For online, hot catalog backups, make sure to identify those policies that are crucial to recovering your site in the event of a disaster. For more information on hot catalog backup and critical policies, refer to the NetBackup Administrators Guide, Volume I.
Schedule frequency
To minimize the number of times you back up files that have not changed, and to minimize your consumption of bandwidth, media, and other resources, consider limiting the frequency of your full backups to monthly or even quarterly, followed by weekly cumulative incremental backups and daily incremental backups.
Managing logs
Optimizing the performance of vxlogview
As explained in the NetBackup Troubleshooting Guide, the vxlogview command is used for viewing logs created by unified logging (VxUL). The vxlogview command will deliver optimum performance when a file ID is specified in the query. For example: when viewing messages logged by the NetBackup Resource Broker (nbrb) for a given day, you can filter out the library messages while viewing the nbrb logs. To achieve this, run vxlogview as follows:
vxlogview o nbrb i nbrb n 0
Note that -i nbrb specifies the file ID for nbrb. Specifying the file ID improves the performance, because the search is confined to a smaller set of files. More vxlogview examples are available in the NetBackup Troubleshooting Guide.
legacy logging and unified logging (VxUL), refer to the NetBackup Troubleshooting Guide. Here is a sample message from an error log:
1021419793 1 2 4 nabob 0 0 0 *NULL* bpjobd TERMINATED bpjobd
The meaning of the various fields in this message (the fields are delimited by blanks) is defined in Table 2-1. Table 2-2 lists the values for the message type, which is the third field in the log message. Table 2-1 Field
1
Value
1021419793 (= number of seconds since 1970) 1 2 4
2 3 4
Error database entry version Type of message Severity of error: 1: Unknown 2: Debug 4: Informational 8: Warning 16: Error 32: Critical
5 6 7 8 9
Server on which error was reported Job ID (included if pertinent to the log entry) (optional entry) (optional entry) Client on which error occurred, if applicable, otherwise *NULL* Process which generated the error message Text of error message
nabob 0 0 0 *NULL*
10 11
63
Chapter
Network and SCSI/FC bus bandwidth on page 66 How to change the threshold for media errors on page 66 How to reload the st driver without rebooting Solaris on page 69 Media Manager drive selection on page 70
Windows
install_path\NetBackup\bin\goodies\available_media
The NetBackup Media List report may show that some media is frozen and therefore cannot be used for backups. One of the reasons NetBackup freezes media is because of recurring I/O errors. The NetBackup Troubleshooting Guide describes the recommended approaches for dealing with this issue, for example, under NetBackup error code 96. It is also possible to configure the NetBackup error threshold value. The method for doing this is described in this section. Each time a read, write, or position error occurs, NetBackup records the time, media ID, type of error, and drive index in the EMM database. Then NetBackup scans to see whether that media has had m of the same errors within the past n hours. The variable m is a tunable parameter known as media_error_threshold. The default value of media_error_threshold is 2 errors. The variable n is known as time_window. The default value of time_window is
Media server configuration guidelines How to change the threshold for media errors
67
12 hours. If a tape volume has more than media_error_threshold errors, NetBackup will take the appropriate action:
If the volume has not been previously assigned for backups, then NetBackup will:
set the volume status to FROZEN select a different volume log an error
If the volume is in the NetBackup media catalog and has been previously selected for backups, then NetBackup will:
set the volume to SUSPENDED abort the current backup log an error
Adjusting media_error_threshold
To configure the NetBackup media error thresholds, use the nbemmcmd command on the media server as follows. NetBackup freezes a tape volume or downs a drive for which these values are exceeded. For more detail on the nbemmcmd command, refer to the man page or to the NetBackup Commands Guide. UNIX
/usr/openv/netbackup/bin/admincmd/nbemmcmd -changesetting -time_window unsigned integer -machinename string -media_error_threshold unsigned integer -drive_error_threshold unsigned integer
Windows
<install_path>\NetBackup\bin\admincmd\nbemmcmd.exe -changesetting -time_window unsigned integer -machinename string -media_error_threshold unsigned integer -drive_error_threshold unsigned integer
For example, if the -drive_error_threshold is set to the default value of 2, the drive is downed after 3 errors in 12 hours. If the -drive_error_threshold is set to a value of 6, it would take 7 errors in the same 12 hour period before the drive would be downed.
68 Media server configuration guidelines How to change the threshold for media errors
Note: The following description has nothing to do with the number of times NetBackup retries a backup/restore that fails. That situation is controlled by the global configuration parameter Backup Tries for backups and the bp.conf entry RESTORE_RETRIES for restores. This algorithm merely deals with whether I/O errors on tape should cause media to be frozen or drives to be downed. When a read/write/position error occurs on tape, the error returned by the operating system does not distinguish between whether the error is caused by the tape or the drive. To prevent the failure of all backups in a given timeframe, bptm tries to identify a bad tape volume or drive based on past history, using the following logic:
Each time an I/O error occurs on a read/write/position, bptm logs the error in the file /usr/openv/netbackup/db/media/errors (UNIX) or install_path\NetBackup\db\media\errors (Windows). The error message includes the time of the error, media ID, drive index and type of error. Examples of the entries in this file are these:
07/21/96 04:15:17 A00167 4 WRITE_ERROR 07/26/96 12:37:47 A00168 4 READ_ERROR
Each time an entry is made, the past entries are scanned to determine if the same media ID and/or drive has had this type of error in the past n hours. n is known as the time_window. The default time window is 12 hours. When performing the history search for the time_window entries, EMM notes past errors that match the media ID, the drive, or both the drive and the media ID. The purpose of this is to determine the cause of the error. For example, if a given media ID gets write errors on more than one drive, it is assumed that the tape volume is bad and NetBackup freezes the volume. If more than one media ID gets a particular error on the same drive, it is assumed the drive is bad and the drive goes to a down state. If only past errors are found on the same drive with the same media ID, then EMM assumes that the volume is bad and freezes it. Freezing or downing does not occur on the first error. There are two other parameters, media_error_threshold and drive_error_threshold. The default value of both of these parameters is 2. For a freeze or down to happen, more than the threshold number of errors must occur (by default, at least three errors must occur) in the time window for the same drive/media ID.
Media server configuration guidelines How to reload the st driver without rebooting Solaris
69
Note: If either media_error_threshold or drive_error_threshold is 0, freezing or downing occurs the first time any I/O error occurs. media_error_threshold is looked at first, so if both values are 0, freezing will override downing. It is not recommended that these values be set to 0. Changing the default values is not recommended, unless there is a good reason to do so. One obvious change would be to put very large numbers in the THRESHOLD files, thus basically disabling the mechanism such that to freeze a tape or down a drive should never occur. Freezing and downing is primarily intended to benefit backups. If read errors occur on a restore, freezing media has little effect. NetBackup still accesses the tape to perform the restore. In the restore case, downing a bad drive may help.
Use devfsadm to recreate the device nodes in /devices and the device links in /dev for tape devices by running any one (not all) of the following commands:
/usr/sbin/devfsadm -i st /usr/sbin/devfsadm -c tape /usr/sbin/devfsadm -C -c tape (Use this command to enforce cleanup if dangling logical links are present in /dev.)
5 6
Chapter
Dedicated or shared backup environment on page 72 Pooling on page 72 Disk versus tape on page 72
Pooling
Here are some useful conventions for media pools (formerly known as volume pools):
Configure a scratch pool for management of scratch tapes. If a scratch pool exists, EMM can move volumes from that pool to other pools that do not have volumes available. Use the available_media script in the goodies directory. You can put the available_media report into a script which redirects the report output to a file and emails the file to the administrators daily or weekly. This helps track which tapes are full, frozen, suspended, and so on. By means of a script, you can also filter the output of the available_media report to generate custom reports. To monitor media, you can also use the NetBackup Operations Manager (NOM). For instance, NOM can be configured to issue an alert if there are fewer than X number of media available, or if more than X% of the media is frozen or suspended. Use the none pool for cleaning tapes. Do not create more pools than you need. In most cases, you need only 6 to 8 pools, including a global scratch pool, catalog backup pool, and the default pools created by the installation. The existence of too many pools causes the library capacity to become fragmented across the pools. Consequently, the library becomes filled with many tapes that are partially full.
73
Disk-based storage can be useful if you have a lot of incremental backups and the percentage of data change is small. If the volume of data in incremental copies is insufficient to ensure efficient writing to tape, consider disk storage. After writing the data to disk, you can use staging or storage lifecycle policies to copy batches of the images to tape. This arrangement can produce faster backups and prevent wear and tear on your tape drives. Here are some factors to consider when choosing to back up a given dataset to disk or tape:
Short or long retention period: disk is well suited for short retention periods, tape is better suited for longer retention periods. Intermediate (staging) or long-term storage: disk is suited for staging, tape for long term storage. Incremental or full backup: disk is better suited to low volume incremental backups. Synthetic backups: synthetic full backups are faster when incrementals are stored on disk. Data recovery time: restore from disk is usually faster than from tape. Multi-streamed restore: will a restore of the data need to be multi-streamed from tape? If so, do not stage the multi-streamed backup to disk prior to writing it to tape. Speed of the backups: if client backups are too slow to keep the tape moving, send the backups to disk. Later, staging or lifecycle policies can move the backup images to tape. Size of the backups: if the backups are too small to keep the tape moving (such as incrementals, or frequent backups of small database log files), send the backups to disk. Staging or lifecycle policies can later move the backup images to tape.
Here are some benefits of backing up to disk rather than tape: No need to multiplex Backups to disk do not need to be multiplexed. Multiplexing is important with tape because it creates a steady flow of data which keeps the tape moving efficiently (called tape streaming). However, multiplexing to tape slows down a subsequent restore. More information is available on tape streaming. See Tape streaming on page 138. Instant access to data Most tape drives have a time to data of close to two minutes. Time is required to move the tape from its slot, load it into the drive, and seek to an appropriate place on the tape. Disk has an effective time to data of
zero seconds. Restoring a large file system from 30 different tapes could add almost two hours to the restore: a two-minute delay per tape for load and seek, and a possible two-minute delay per tape for rewind and unload.
Fewer full backups. With tape-based systems, full backups must be done regularly because of the time to data issue described above. Otherwise, the number of tapes required for a restore from many incrementals significantly increases the following: the time to restore, and the chance that a single tape will cause the restore to fail.
Chapter
Best practices
This chapter describes an assortment of best practices, and includes the following sections:
Best practices: SAN Client on page 76 Best practices: Flexible Disk Option on page 77 Best practices: new tape drive technologies on page 77 Best practices: tape drive cleaning on page 78 Best practices: recoverability on page 80 Best practices: naming conventions on page 82 Best practices: duplication on page 83
Contains critical data that requires high bandwidth for backups Is not a candidate for converting to a media server
A SAN Client performs a fast backup over a Fibre Channel SAN to a media server. Media servers that have been enabled for SAN Client backups are called Fibre Transport Media Servers. Note: The FlashBackup option should not be used with SAN Clients. Restores of FlashBackup backups will always use the LAN path rather than the SAN path from media server to the client. The LAN may not be fast enough to handle a full volume (raw partition) restore of a FlashBackup backup. A Symantec technote contains information on SAN Client performance parameters and best practices: http://entsupport.symantec.com/docs/293110 The main point of the technote is the following: that effective use of a SAN Client depends on the proper configuration of the correct hardware. (Refer to the technote for in-depth details.) In brief, the technote contains the following information.
A list of the HBAs that NetBackup 6.5 supports. Further information is available on supported hardware. See the NetBackup 6.5 hardware compatibility list: http://entsupport.symantec.com/docs/284599 Tips on how to deploy a Fibre Channel SAN between the SAN Client and the Fibre Transport Media Server. A list of supported operating systems and HBAs for the Fibre Transport Media Server, plus a list of tuning parameters that affect the media server performance. A similar list of supported operating systems and tuning parameters for the SAN Client. A description of recommended architectures as a set of best practices. In brief:
To make best use of high-speed disk storage, or to manage a pool of SAN Clients for maximum backup speed: dedicate Fibre Transport Media Servers for SAN Client backup only. Do not share media servers for LAN-based backups.
77
For a more cost-effective use of media servers, Fibre Transport Media Servers can be shared for both SAN Client and LAN-based backups. Use multi-streaming for higher data transfer speed from the SAN Client to the Fibre Transport Media Server. Multi-streaming can be combined with multiple HBA ports on the Fibre Transport Media Server. The maximum number of simultaneous connections to the Fibre Transport Media Server is 16.
Further information is available on SAN Client. See the NetBackup SAN Client and Fibre Transport Troubleshooting Guide: http://entsupport.symantec.com/docs/288437
Frequency-based cleaning
NetBackup does frequency-based cleaning by tracking the number of hours a drive has been in use. When this time reaches a configurable parameter, NetBackup creates a job that mounts and exercises a cleaning tape. This cleans the drive in a preventive fashion. The advantage of this method is that typically there are no drives unavailable awaiting cleaning. There is also no limitation on platform or robot type. On the downside, cleaning is done more often than necessary. This adds system wear and consumes time that could be used to write to the drive. Another limitation is that this method is hard to tune. When new tapes are used, drive cleaning is needed less frequently; the need for cleaning increases as the tape inventory ages. This increases the amount of tuning administration needed and, consequently, the margin of error.
79
How TapeAlert works To understand NetBackup's behavior with drive-cleaning TapeAlerts, it is important to understand the TapeAlert interface to a drive. The TapeAlert interface to a tape drive is via the SCSI bus, based on a Log Sense page, which contains 64 alert flags. The conditions that cause a flag to be set and cleared are device-specific and are determined by the device vendor. The configuration of the Log Sense page is via a Mode Select page. The Mode Sense/Select configuration of the TapeAlert interface is compatible with the SMART diagnostic standard for disk drives. NetBackup reads the TapeAlert Log Sense page at the beginning and end of a write/read job. TapeAlert flags 20 to 25 are used for cleaning management although some drive vendors implementations may vary from this. NetBackup uses TapeAlert flag 20 (Clean Now) and TapeAlert flag 21 (Clean Periodic) to determine when it needs to clean a drive. When a drive is selected by NetBackup for a backup, the Log Sense page is reviewed by bptm for status. If one of the clean flags is set, the drive will be cleaned before the job starts. If a backup is in progress and one of the clean flags is set, the flag is not read until a tape is dismounted from the drive. If a job spans media and, during the first tape, one of the clean flags is set, the cleaning light comes on and the drive will be cleaned before the second piece of media is mounted in the drive. The implication is that the present job will conclude its ongoing write despite a TapeAlert Clean Now or Clean Periodic message. That is, the TapeAlert will not require the loss of what has been written to tape so far. This is true regardless of the number of NetBackup jobs involved in writing out the rest of the media. Note that the behavior described here may change in the future. If a large number of media become FROZEN as a result of having implemented TapeAlert, there is a strong likelihood of underlying media and/or tape drive issues. Disabling TapeAlert To disable TapeAlert, create a touch file called NO_TAPEALERT: UNIX: /usr/openv/volmgr/database/NO_TAPEALERT Windows: install_path\volmgr\database\NO_TAPEALERT
Robotic cleaning
Robotic cleaning is not proactive, and is not subject to the limitations detailed above. By being reactive, unnecessary cleanings are eliminated, frequency
tuning is not an issue, and the drive can spend more time moving data, rather than in maintenance operations. Library-based cleaning is not supported by EMM for most robots, since robotic library and operating systems vendors have implemented this type of cleaning in many different ways.
Yes Yes
Yes Yes
Yes
The Resilient Enterprise, Recovering Information Services from Disasters, by Symantec and industry authors, published by Symantec Software Corporation. Blueprints for High Availability, Designing Resilient Distributed Systems, by Evan Marcus and Hal Stern, published by John Wiley and Sons.
81
Implementing Backup and Recovery: The Readiness Guide for the Enterprise, by David B. Little and David A. Chapa, published by Wiley Technology Publishing.
Always use a regularly scheduled hot catalog backup Refer to Catalog Recovery from an Online Backup in the NetBackup Troubleshooting Guide. Review the disaster recovery plan often Review your site-specific recovery procedures and verify that they are accurate and up-to-date. Also, verify that the more complex systems, such as the NetBackup master and media servers, have current procedures for rebuilding the machines with the latest software. Perform test recoveries on a regular basis Implement a plan to perform restores of various systems to alternate locations. This plan should include selecting random production backups and restoring the data to a non-production system. A checksum can then be performed on one or many of the restored files and compared to the actual production data. Be sure to include offsite storage as part of this testing. The end-user or application administrator can also be involved in determining the integrity of the restored data. Support NetBackup recoverability:
Back up the NetBackup catalog to two tapes. The catalog contains information vital for NetBackup recovery. Its loss could result in hours or days of recovery time through manual processes. The cost of a single tape is a small price to pay for the added insurance of rapid recovery in the event of an emergency. Back up the catalog after each backup. If a hot catalog backup is used, an incremental catalog backup can be done after each backup session. Extremely busy backup environments should also use a scheduled hot catalog backup, since their backup sessions end infrequently. In the event of a catastrophic failure, the recovery of images is slowed by not having all images available. If a manual backup occurs just before the master server or the drive that contains the backed-up files
crashes, the manual backup must be imported to recover the most recent version of the files.
Record the IDs of catalog backup tapes. Record the catalog tapes in the site run book or another public location to ensure rapid identification in the event of an emergency. If the catalog tapes are not identified ahead of time, a significant amount of time may be lost by scanning every tape in a library to find them. The utility vmphyinv can be used to mount all tapes in a robotic library and identify the catalog tape(s). The vmphyinv utility will identify cold catalog tapes. Designate label prefixes for catalog backups. Make it easy to identify the NetBackup catalog data in times of emergency. Label the catalog tapes with a unique prefix such as DB on the tape barcodes, so your operators can find the catalog tapes without delay. Place NetBackup catalogs in specific robot slots. Place a catalog backup tape in the first or last slot of a robot to more easily identify the tape in an emergency. This also allows for easy tape movement if manual tape handling is necessary. Put the NetBackup catalog on different online storage than the data being backed up. In the case of a site storage disaster, the catalogs of the backed-up data should not reside on the same disks as production data. The reason behind this is straightforward: you want to avoid the case where, if a disk drive loses production data, it also loses the catalog of the production data, resulting in increased downtime. Regularly confirm the integrity of the NetBackup catalog. On a regular basis, such as quarterly or after major operations or personnel changes, walk through the process of recovering a catalog from tape. This essential part of NetBackup administration can save hours in the event of a catastrophe.
83
Policy names
One good naming convention for policies is platform_datatype_server(s). Example 1: w2k_filesystems_trundle This policy name designates a policy for a single Windows server doing file system backups. Example 2: w2k_sql_servers This policy name designates a policy for backing up a set of Windows 2000 SQL servers. Several servers may be backed up by this policy. Servers that are candidates for being included in a single policy are those running the same operating system and with the same backup requirements. Grouping servers within a single policy reduces the number of policies and eases the management of NetBackup.
Schedule names
Create a generic scheme for schedule naming. One recommended set of schedule names is daily, weekly, and monthly. Another recommended set of names is incremental, cumulative, and full. This convention keeps the management of NetBackup at a minimum. It also helps with the implementation of Vault, if your site uses Vault.
When duplicating an image, specify a volume pool (in the Setup Duplication Variables dialog of the NetBackup Administration Console) that is different from the volume pool of the original image. If multiple duplication jobs will be active at the same time (such as duplication jobs started by Vault), specify a different storage unit or volume pool for each job. Using different destinations may prevent media swapping among the duplication jobs.
Section II
Performance tuning
Section II explains how to measure your current NetBackup performance, and gives general recommendations and examples for tuning NetBackup. Section II includes these chapters:
Measuring performance Tuning the NetBackup data transfer path Tuning other NetBackup components Tuning disk I/O performance OS-related tuning factors Additional resources
Chapter
Measuring performance
This chapter provides suggestions for measuring NetBackup performance. This chapter includes the following sections:
Overview on page 86 Controlling system variables for consistent testing conditions on page 86 Evaluating performance on page 89 Evaluating UNIX system components on page 94 Evaluating Windows system components on page 95
Overview
The final measure of NetBackup performance is the length of time required for backup operations to complete (usually known as the backup window), or the length of time required for a critical restore operation to complete. However, to measure and improve performance calls for performance metrics more reliable and reproducible than simple wall clock time. This chapter discusses these metrics in more detail. After establishing accurate metrics as described here, you can measure the current performance of NetBackup and your system components to compile a baseline performance benchmark. With a baseline, you can apply changes in a controlled way. By measuring performance after each change, you can accurately measure the effect of each change on NetBackup performance.
Server variables
It is important to eliminate all other NetBackup activity from your environment when you are measuring the performance of a particular NetBackup operation. One area to consider is the automatic scheduling of backup jobs by the NetBackup scheduler. When policies are created, they are usually set up to allow the NetBackup scheduler to initiate the backups. The NetBackup scheduler will initiate backups based on the traditional NetBackup frequency-based scheduling or on certain days of the week, month, or other time interval. This process is called calendar-based scheduling. As part of the backup policy definition, the Start Window is used to indicate when the NetBackup scheduler can start backups using either frequency-based or calendar-based scheduling. When you perform backups for the purpose of performance testing, this setup might interfere since the NetBackup scheduler may initiate backups unexpectedly, especially if the operations you intend to measure run for an extended period of time.
87
The simplest way to prevent the NetBackup scheduler from running backup jobs during your performance testing is to create a new policy specifically for use in performance testing and to leave the Start Window field blank in the schedule definition for that policy. This prevents the NetBackup scheduler from initiating any backups automatically for that policy. After creating the policy, you can run the backup on demand by using the Manual Backup command from the NetBackup Administration Console. To prevent the NetBackup scheduler from running backup jobs unrelated to the performance test, you may want to set all other backup policies to inactive by using the Deactivate command from the NetBackup Administration Console. Of course, you must reactivate the policies to start running backups again. You can use a user-directed backup to run the performance test as well. However, using the Manual Backup option for a policy is preferred. With a manual backup, the policy contains the entire definition of the backup job, including the clients and files that are part of the performance test. Running the backup manually, straight from the policy, means there is no doubt which policy will be used for the backup, and makes it easier to change and test individual backup settings: from the policy dialog. Before you start the performance test, check the Activity Monitor to make sure there is no NetBackup processing currently in progress. Similarly, check the Activity Monitor after the performance test for unexpected activity (such as an unanticipated restore job) that may have occurred during the test. Additionally, check for non-NetBackup activity on the server during the performance test and try to reduce or eliminate it. Note: By default, NetBackup logging is set to a minimum level. To gather more logging information, set the legacy and unified logging levels higher and create the appropriate legacy logging directories. For details on how to use NetBackup logging, refer to the logging chapter of the NetBackup Troublshooting Guide. Keep in mind that higher logging levels will consume more disk space.
Network variables
Network performance is key to achieving optimum performance with NetBackup. Ideally, you would use a completely separate network for performance testing to avoid the possibility of skewing the results by encountering unrelated network activity during the course of the test. In many cases, a separate network is not available. Ensure that non-NetBackup activity is kept to an absolute minimum during the time you are evaluating performance. If possible, schedule testing for times when backups are not active. Even occasional short bursts of network activity may be enough to skew
the results during portions of the performance test. If you are sharing the network with production backups occurring for other systems, you must account for this activity during the performance test. Another network variable you must consider is host name resolution. NetBackup depends heavily upon a timely resolution of host names to operate correctly. If you have any delays in host name resolution, including reverse name lookup to identify a server name from an incoming connection from a certain IP address, you may want to eliminate that delay by using the HOSTS (Windows) or /etc/hosts (UNIX) file for host name resolution on systems involved in your performance test environment.
Client variables
Make sure the client system is in a relatively quiescent state during performance testing. A lot of activity, especially disk-intensive activity such as virus scanning on Windows, will limit the data transfer rate and skew the results of your tests. One possible mistake is to allow another NetBackup server, such as a production server, to have access to the client during the course of the test. NetBackup may attempt to back up the same client to two different servers at the same time. The results of a performance test that is in progress at the same time could be severely affected. Different file systems have different performance characteristics. For example, comparing data throughput results from operations on a UNIX VxFS or Windows FAT file system to those from operations on a UNIX NFS or Windows NTFS system may not be valid, even if the systems are otherwise identical. If you do need to make such a comparison, factor the difference between the file systems into your performance evaluation testing, and into any conclusions you may draw from that testing.
Data variables
Monitoring the data you are backing up improves the repeatability of performance testing. If possible, move the data you will use for testing backups to its own drive or logical partition (not a mirrored drive), and defragment the drive before you begin performance testing. For testing restores, start with an empty disk drive or a recently defragmented disk drive with ample empty space. This will reduce the impact of disk fragmentation on the NetBackup performance test and yield more consistent results between tests. Similarly, for testing backups to tape, always start each test run with an empty piece of media. You can do this by expiring existing images for that piece of media through the Catalog node of the NetBackup Administration Console, or by
89
running the bpexpdate command. Another approach is to use the bpmedia command to freeze any media containing existing backup images so that NetBackup selects a new piece of media for the backup operation. This step will help reduce the impact of tape positioning on the NetBackup performance test and will yield more consistent results between tests. It will also reduce the impact of tape mounting and unmounting of media that has NetBackup catalog images and that cannot be used for normal backups. When you test restores from tape, always restore from the same backup image on the tape to achieve consistent results between tests. In general, using a large data set will generate a more reliable and reproducible performance test than a small data set. A performance test using a small data set would probably be skewed by startup and shutdown overhead within the NetBackup operation. These variables are difficult to keep consistent between test runs and are therefore likely to produce inconsistent test results. Using a large data set will minimize the effect of start up and shutdown times. Design the makeup of the dataset to represent the makeup of the data in the intended production environment. For example, if the data set in the production environment contains many small files on file servers, then the data set for the performance testing should also contain many small files. A representative test data set will more accurately predict the NetBackup performance that you can reasonably expect in a production environment. The type of data can help reveal bottlenecks in the system. Files consisting of non-compressible (random) data cause the tape drive to run at its lower rated speed. As long as the other components of the data transfer path are keeping up, you may identify the tape drive as the bottleneck. On the other hand, files consisting of highly-compressible data can be processed at higher rates by the tape drive when hardware compression is enabled. This may result in a higher overall throughput and possibly expose the network as the bottleneck. Many values in NetBackup provide data amounts in kilobytes and rates in kilobytes per second. For greater accuracy, divide by 1024 rather than rounding off to 1000 when you convert from kilobytes to megabytes or from kilobytes per second to megabytes per second.
Evaluating performance
There are two primary locations from which to obtain NetBackup data throughput statistics: the NetBackup Activity Monitor and the NetBackup All Log Entries report. The choice of which location to use is determined by the type of NetBackup operation you are measuring: non-multiplexed backup, restore, or multiplexed backup.
You can obtain statistics for all three types of operations from the NetBackup All Log Entries report. You can obtain statistics for non-multiplexed backup or restore operations from the NetBackup Activity Monitor. For multiplexed backup operations, you can obtain the overall statistics from the All Log Entries report after all the individual backup operations which are part of the multiplexed backup are complete. In this case, the statistics available in the Activity Monitors for each of the individual backup operations are relative only to that operation, and do not reflect the actual total data throughput to the tape drive. There may be small differences between the statistics available from these two locations due to slight differences in rounding techniques between the entries in the Activity Monitor and the entries in the All Logs report. For a given type of operation, choose either the Activity Monitor or the All Log Entries report and consistently record your statistics only from that location. In both the Activity Monitor and the All Logs report, the data-streaming speed is reported in kilobytes per second. If a backup or restore is repeated, the reported speed can vary between repetitions depending on many factors, including the availability of system resources and system utilization, but the reported speed can be used to assess the performance of the data-streaming process. The statistics from the NetBackup error logs show the actual amount of time spent reading and writing data to and from tape. This does not include time spent mounting and positioning the tape. Cross-referencing the information from the error logs with data from the bpbkar log on the NetBackup client (showing the end-to-end elapsed time of the entire process) indicates how much time was spent on operations unrelated to reading and writing to and from the tape. To evaluate performance through the NetBackup activity monitor 1 2 3 4 Run the backup or restore job. Open the NetBackup Activity Monitor. Verify that the backup or restore job completed successfully. The Status column should contain a zero (0). View the log details for the job by selecting the Actions > Details menu option, or by double-clicking on the entry for the job. Select the Detailed Status tab. Obtain the NetBackup performance statistics from the following fields in the Activity Monitor:
Start Time/End Time: These fields show the time window during which the backup or restore job took place.
91
Elapsed Time: This field shows the total elapsed time from when the job was initiated to job completion and can be used as in indication of total wall clock time for the operation. KB per Second: This is the data throughput rate. Kilobytes: Compare this value to the amount of data. Although it should be comparable, the NetBackup data amount will be slightly higher because of administrative information, known as metadata, saved for the backed up data. For example, if you display properties for a directory containing 500 files, each 1 megabyte in size, the directory shows a size of 500 megabytes, or 524,288,000 bytes, which is equal to 512,000 kilobytes. The NetBackup report may show 513,255 kilobytes written, reporting 1255 kilobytes more than the file size of the directory. This is true for a flat directory. Subdirectory structures may diverge due to the way the operating system tracks used and available space on the disk. Also, be aware that the operating system may be reporting how much space was allocated for the files in question, not just how much data is actually there. For example, if the allocation block size is 1 kilobyte, 1000 1-byte files will report a total size of 1 megabyte, even though 1 kilobyte of data is all that exists. The greater the number of files, the larger this discrepancy may become.
To evaluate performance using the all log entries report 1 2 Run the backup or restore job. Run the All Log Entries report from the NetBackup reports node in the NetBackup Administrative Console. Be sure that the Date/Time Range that you select covers the time period during which the job was run. Verify that the job completed successfully by searching for an entry such as the requested operation was successfully completed for a backup, or successfully read (restore) backup id... for a restore. Obtain the NetBackup performance statistics from the following entries in the report. Note: The messages shown here vary according to the locale setting of the master server.
Entry
started backup job for client <name>, policy <name>, schedule <name> on storage unit <name>
Statistic
The Date and Time fields for this entry show the time at which the backup job started.
Entry
successfully wrote backup id <name>, copy <number>, <number> Kbytes
Statistic
For a multiplexed backup, this entry shows the size of the individual backup job and the Date and Time fields show the time at which the job finished writing to the storage device. The overall statistics for the multiplexed backup group, including the data throughput rate to the storage device, are found in a subsequent entry below. For multiplexed backups, this entry shows the overall statistics for the multiplexed backup group including the data throughput rate.
successfully wrote <number> of <number> multiplexed backups, total Kbytes <number> at Kbytes/sec
successfully wrote backup id <name>, copy <number>, fragment <number>, <number> Kbytes at <number> Kbytes/sec
For non-multiplexed backups, this entry essentially combines the information in the previous two entries for multiplexed backups into one entry showing the size of the backup job, the data throughput rate, and the time, in the Date and Time fields, at which the job finished writing to the storage device. The Date and Time fields for this entry show the time at which the backup job completed. This value is later than the successfully wrote entry above because it includes extra processing time at the end of the job for tasks such as NetBackup image validation. The Date and Time fields for this entry show the time at which the restore job started reading from the storage device. (Note that the latter part of the entry is not shown for restores from disk, as it does not apply.) For a multiplexed restore (generally speaking, all restores from tape are multiplexed restores as non-multiplexed restores require additional action from the user), this entry shows the size of the individual restore job and the Date and Time fields show the time at which the job finished reading from the storage device. The overall statistics for the multiplexed restore group, including the data throughput rate, are found in a subsequent entry below. For multiplexed restores, this entry shows the overall statistics for the multiplexed restore group, including the data throughput rate.
begin reading backup id <name>, (restore), copy <number>, fragment <number> from media id <name> on drive index <number>
successfully restored <number> of <number> requests <name>, read total of <number> Kbytes at <number> Kbytes/sec
93
Entry
successfully read (restore) backup id media <number>, copy <number>, fragment <number>, <number> Kbytes at <number> Kbytes/sec
Statistic
For non-multiplexed restores (generally speaking, only restores from disk are treated as non-multiplexed restores), this entry essentially combines the information from the previous two entries for multiplexed restores into one entry showing the size of the restore job, the data throughput rate, and the time, in the Date and Time fields, at which the job finished reading from the storage device.
Additional information
The NetBackup All Log Entries report will also have entries similar to those described above for other NetBackup operations such as image duplication operations used to create additional copies of a backup image. Those entries have a very similar format and may be useful for analyzing the performance of NetBackup for those operations. The bptm debug log file for tape backups (or bpdm log file for disk backups) will contain the entries that are in the All Log Entries report, as well as additional detail about the operation that may be useful for performance analysis. One example of this additional detail is the intermediate data throughput rate message for multiplexed backups, as shown below:
... intermediate after number successful, number Kbytes at number Kbytes/sec
This message is generated whenever an individual backup job completes that is part of a multiplexed backup group. In the debug log file for a multiplexed backup group consisting of three individual backup jobs, for example, there could be two intermediate status lines, then the final (overall) throughput rate. For a backup operation, the bpbkar debug log file will also contain additional detail about the operation that may be useful for performance analysis. Keep in mind, however, that writing the debug log files during the NetBackup operation introduces some overhead that would not normally be present in a production environment. Factor that additional overhead into any calculations done on data captures while debug log files are in use. The information in the All Logs report is also found in /usr/openv/netbackup/db/error (UNIX) or install_path\NetBackup\db\error (Windows). See the NetBackup Troubleshooting Guide to learn how to set up NetBackup to write these debug log files.
2 3
Where X:\ is the path being backed up. 4 Check the time it took NetBackup to move the data from the client disk: UNIX: The start time is the first PrintFile entry in the bpbkar log, the end time is the entry Client completed sending data for backup, and the amount of data is given in the entry Total Size. Windows: Check the bpbkar log for the entry Elapsed time.
95
To measure disk I/O using the bpdm_dev_null touch file (UNIX only) For UNIX systems, the procedure below can be useful as a follow-on to the bpbkar procedure (above). If the bpbkar procedure shows that the disk read performance is not the bottleneck and does not help isolate the problem, then the bpdm_dev_null procedure described below may be helpful. If the bpdm_dev_null procedure shows poor performance, the bottleneck is somewhere in the data transfer between the bpbkar process on the client and the bpdm process on the server. The problem may involve the network, or shared memory (such as not enough buffers, or buffers that are too small). To change shared memory settings, see Shared memory (number and size of data buffers) on page 111. Caution: If not used correctly, the following procedure can lead to data loss. Touching the bpdm_dev_null file redirects all disk backups to /dev/null, not just those backups using the storage unit created by this procedure. You should disable active production policies for the duration of this test and remove /dev/null as soon as this test is complete. 1 Enter the following:
touch /usr/openv/netbackup/bpdm_dev_null
Note: The bpdm_dev_null file re-directs any backup that uses a disk storage unit to /dev/null. 2 3 4 Create a new disk storage unit, using /tmp or some other directory as the image directory path. Create a policy that uses the new disk storage unit. Run a backup using this policy. NetBackup will create a file in the storage unit directory as if this were a real backup to disk. This degenerate image file will be zero bytes long. To remove the zero-length file and clear the NetBackup catalog of a backup that cannot be restored, run this command: where backupid is the name of the file residing in the storage unit directory.
The Performance Monitor organizes information by object, counter, and instance. An object is a system resource category, such as a processor or physical disk. Properties of an object are counters. Counters for the Processor object include %Processor Time, which is the default counter, and Interrupts/sec. Duplicate counters are handled via instances. For example, to monitor the %Processor Time of a specific CPU on a multiple CPU system, the Processor object is selected, then the %Processor Time counter for that object is selected, followed by the specific CPU instance for the counter. When you use the Performance Monitor, you can view data in real time format or collect the data in a log for future analysis. Specific components to evaluate include CPU load, memory use, and disk load. Note: It is recommended that a remote host be used for monitoring of the test host, to reduce load that might otherwise skew results.
97
Committed Bytes. Committed Bytes displays the size of virtual memory that has been committed, as opposed to reserved. Committed memory must have disk storage available or must not require the disk storage because the main memory is large enough. If the number of Committed Bytes approaches or exceeds the amount of physical memory, you may encounter issues with page swapping. Page Faults/sec. Page Faults/sec is a count of the page faults in the processor. A page fault occurs when a process refers to a virtual memory page that is not in its Working Set in main memory. A high Page Fault rate may indicate insufficient memory.
To disable these counters and cancel disk monitoring 1 2 From a command prompt, type: diskperf -n Reboot the system.
When you monitor disk performance, use the %Disk Time counter for the PhysicalDisk object to track the percentage of elapsed time that the selected disk drive is busy servicing read or write requests.
Also monitor the Avg. Disk Queue Length counter and watch for values greater than 1 that last for more than one second. Values greater than 1 for more than a second indicate that multiple processes are waiting for the disk to service their requests. Several techniques may be used to increase disk performance, including:
Check the fragmentation level of the data. A highly fragmented disk limits throughput levels. Use a disk maintenance utility to defragment the disk. Consider adding additional disks to the system to increase performance. If multiple processes are attempting to log data simultaneously, dividing the data among multiple physical disks may help. Determine if the data transfer involves a compressed disk. The use of Windows compression to automatically compress the data on the drive adds additional overhead to disk read or write operations, adversely affecting the performance of NetBackup. Use Windows compression only if it is needed to avoid a disk full condition. Consider converting to a system based on a Redundant Array of Inexpensive Disks (RAID). Though more expensive, RAID devices generally offer greater throughput, and, (depending on the RAID level employed), improved reliability. Determine what type of controller technology is being used to drive the disk. Consider if a different system would yield better results. See Table 1-2 Drive controller data transfer rates on page 21 for throughput rates for common controllers.
Chapter
Overview on page 100 The data transfer path on page 100 Basic tuning suggestions for the data path on page 101 NetBackup client performance on page 104 NetBackup network performance on page 106 NetBackup server performance on page 111 NetBackup storage device performance on page 137
Overview
This chapter contains information on ways to optimize NetBackup. This chapter is not intended to provide tuning advice for particular systems. If you would like help fine-tuning your NetBackup installation, please contact Symantec Consulting Services. Before examining the factors that affect backup performance, please note that an important first step is to ensure that your system meets NetBackups recommended minimum requirements. Refer to the NetBackup Installation Guide and NetBackup Release Notes for information about these requirements. Additionally, Symantec recommends that you have the most recent NetBackup software patch installed. Many performance issues can be traced to hardware or other environmental issues. A basic understanding of the entire data transfer path is essential in determining the maximum obtainable performance in your environment. Poor performance is often the result of poor planning, which can be based on unrealistic expectations of any particular component of the data transfer path.
Tuning the NetBackup data transfer path Basic tuning suggestions for the data path
101
Tuning suggestions:
Use multiplexing. Multiplexing is a NetBackup option that lets you write multiple data streams from several clients at once to a single tape drive or several tape drives. Multiplexing can be used to improve the backup performance of slow clients, multiple slow networks, and many small backups (such as incremental backups). Multiplexing reduces the time each job spends waiting for a device to become available, thereby making the best use of the transfer rate of your storage devices. See Restore of a multiplexed image on page 132. Refer also to the NetBackup Administrators Guide, Volume II, for more information about using multiplexing. Consider striping a disk volume across drives. A striped set of disks will pull data from all disk drives concurrently, allowing faster data transfers between disk drives and tape drives. Maximize the use of your backup windows. You can configure all your incremental backups to happen at the same time every day and stagger the execution of your full backups across multiple days. Large systems could be backed up over the weekend while smaller systems are spread over the week. You can even start full backups earlier than the incremental backups. They might finish before the incremental backups and give you back all or most of your backup window to finish the incremental backups.
102 Tuning the NetBackup data transfer path Basic tuning suggestions for the data path
Convert large clients to SAN Clients or SAN Media Servers A SAN Client is a client that is backed up over a SAN connection to a media server rather than over a LAN. SAN Client technology is suitable for large databases and application servers where large data files can be rapidly read from disk and streamed across the SAN connection. SAN Client is not suitable for file servers where the disk read speed is relatively slow. See Best practices: SAN Client on page 76. Use dedicated private networks to decrease backup times and network traffic. If you configure one or more networks dedicated to backups, you can reduce the time it takes to back up the systems on those networks and reduce or eliminate network traffic on your enterprise networks. In addition, you can convert to faster technologies and even backup your systems at any time without affecting the enterprise networks performance (assuming that users do not mind the system loads while backups take place). Avoid a concentration of servers on one network. If you have a concentration of large servers that you back up over the same general network, you might want to convert some of these into media servers or attach them to private backup networks. Doing either will decrease backup times and reduce network traffic for your other backups. Use dedicated backup servers to perform your backups. When selecting a server for performing your backups, use a dedicated system just for performing backups. A server that shares the load of running several applications unrelated to backups can severely affect your performance and maintenance windows. Consider the requirements of backing up your catalog. Remember that the NetBackup catalog needs to be backed up. To facilitate NetBackup catalog recovery, it is highly recommended that the master server have access to a dedicated tape drive, either standalone or within a robotic library. Try to level the backup load. You can use multiple drives to reduce backup times; however, since you may not be able to split data streams evenly, you may need to experiment with the configuration of the streams or the configuration of the NetBackup policies to spread the load across multiple drives. Bandwidth limiting The bandwidth limiting feature lets you restrict the network bandwidth consumed by one or more NetBackup clients on a network. The bandwidth setting appears under Host Properties > Master Servers, Properties. The
Tuning the NetBackup data transfer path Basic tuning suggestions for the data path
103
actual limiting occurs on the client side of the backup connection. This feature only restricts bandwidth during backups. Restores are unaffected. When a backup starts, NetBackup reads the bandwidth limit configuration and then determines the appropriate bandwidth value and passes it to the client. As the number of active backups increases or decreases on a subnet, NetBackup dynamically adjusts the bandwidth limiting on that subnet. If additional backups are started, the NetBackup server instructs the other NetBackup clients running on that subnet to decrease their bandwidth setting. Similarly, bandwidth per client is increased if the number of clients decreases. Changes to the bandwidth value occur on a periodic basis rather than as backups stop and start. This characteristic can reduce the number of bandwidth value changes.
Load balancing NetBackup provides ways to balance loads between servers, clients, policies, and devices. Note that these settings may interact with each other: compensating for one issue can cause another. The best approach is to use the defaults unless you anticipate or encounter an issue.
Adjust the backup load on the server. Change the Limit jobs per policy attribute for one or more of the policies that the server is backing up. For example, decreasing Limit jobs per policy reduces the load on a server on a specific subnetwork. Reconfiguring policies or schedules to use storage units on other servers also reduces the load. Another possibility is to use bandwidth limiting on one or more clients. Adjust the backup load on the server during specific time periods only. Reconfigure schedules that execute during the time periods of interest, so they use storage units on servers that can handle the load (assuming you are using media servers). Adjust the backup load on the clients. Change the Maximum jobs per client global attribute. For example, increasing Maximum jobs per client increases the number of concurrent jobs that any one client can process and therefore increases the load. Reduce the time to back up clients. Increase the number of jobs that clients can perform concurrently, or use multiplexing. Another possibility is to increase the number of jobs that the server can perform concurrently for the policy or policies that are backing up the clients. Give preference to a policy.
104 Tuning the NetBackup data transfer path NetBackup client performance
Increase the Limit jobs per policy attribute value for the preferred policy relative to other policies. Alternatively, increase the priority for the policy.
Adjust the load between fast and slow networks. Increase the values of Limit jobs per policy and Maximum jobs per client for the policies and clients on a faster network. Decrease these values for slower networks. Another solution is to use bandwidth limiting. Limit the backup load produced by one or more clients. Use bandwidth limiting to reduce the bandwidth used by the clients. Maximize the use of devices Use multiplexing. Also, allow as many concurrent jobs per storage unit, policy, and client as possible without causing server, client, or network performance issues. Prevent backups from monopolizing devices. Limit the number of devices that NetBackup can use concurrently for each policy or limit the number of drives per storage unit. Another approach is to exclude some of your devices from Media Manager control.
Disk fragmentation. Fragmentation is a condition where data is scattered around the disk in non-contiguous blocks. This condition severely impacts the data transfer rate from the disk. Fragmentation can be repaired using hard disk management utility software offered by a variety of vendors. Number of disks. Consider adding additional disks to the system to increase performance. If multiple processes are attempting to log data simultaneously, dividing the data among multiple physical disks may help. Disk arrays. Consider converting to a system based on a Redundant Array of Inexpensive Disks (RAID). Though more expensive, RAID devices generally offer greater throughput, and, (depending on the RAID level employed), improved reliability. SAN client. For critical data that requires high bandwidth for backups, consider the SAN client feature. Refer to the following documents for more information:
105
NetBackup Shared Storage Guide SAN Client Deployment: Best Practices and Performance Metrics tech note, at: http://entsupport.symantec.com/docs/293110
The type of controller technology being used to drive the disk. Consider if a different system would yield better results. Virus scanning. If virus scanning is turned on for the system, it may severely impact the performance of the NetBackup client during a backup or restore operation. This may be especially true for systems such as large Windows file servers. You may wish to disable virus scanning during backup or restore operations to avoid the impact on performance. NetBackup notify scripts. The bpstart_notify.bat and bpend_notify.bat scripts are very useful in certain situations, such as shutting down a running application to back up its data. However, these scripts must be written with care to avoid any unnecessary lengthy delays at the start or end of the backup job. If the scripts are not performing tasks essential to the backup operation, you may want to remove them. NetBackup software location. If the data being backed up is located on the same physical disk drive as the NetBackup installation, performance may be adversely affected, especially if NetBackup debug log files are being used. If they are being used, the extent of the degradation will be greatly influenced by the NetBackup verbose setting for the debug logs. If possible, install NetBackup on a separate physical disk drive to avoid this disk drive contention. Snapshots (hardware or software). If snapshots need to be taken before the actual backup of data, the time needed to take the snapshot will affect the overall performance. Job tracker. If the NetBackup Client Job Tracker is running on the client, then NetBackup will gather an estimate of the data to be backed up prior to the start of a backup job. Gathering this estimate will affect the startup time, and therefore the data throughput rate, because no data is being written to the NetBackup server during this estimation phase. You may wish to avoid running the NetBackup Client Job Tracker to avoid this delay. Determining the theoretical performance of the NetBackup client software. You can use the NetBackup client command bpbkar (UNIX) or bpbkar32 (Windows) to determine the speed at which the NetBackup client can read the data to be backed up from the disk drive. This may eliminate data read speed as a possible performance bottleneck. For the procedure, see Measuring performance independent of tape or disk output on page 94.
106 Tuning the NetBackup data transfer path NetBackup network performance
Network interface cards (NICs) for NetBackup servers and clients must be set to full-duplex. Both ends of each network cable (the NIC card and the switch) must be set identically as to speed and mode (both NIC and switch must be at full duplex). Otherwise, link down, excessive/late collisions, and errors will result. If auto-negotiate is being used, make sure that both ends of the connection are set at the same mode and speed. The higher the speed, the better. In addition to NICs and switches, all routers must be set to full duplex.
Consult the operating system documentation for instructions on how to determine and change the NIC settings. Note: Using AUTOSENSE may cause network problems and performance issues.
Network load
There are two key considerations to monitor when you evaluate remote backup performance:
The amount of network traffic The amount of time that network traffic is high
Small bursts of high network traffic for short durations will have some negative impact on the data throughput rate. However, if the network traffic remains consistently high for a significant amount of time during the operation, the network component of the NetBackup data transfer path will very likely be the bottleneck. Always try to schedule backups during times when network traffic is low. If your network is heavily loaded, you may wish to implement a secondary network which can be dedicated to backup and restore traffic.
107
108 Tuning the NetBackup data transfer path NetBackup network performance
If both files exist, the NET_BUFFER_SZ file will specify the network buffer size for backups, and the NET_BUFFER_SZ_REST file will specify the network buffer size for restores. Because local backup or restore jobs on the media server do not send data over the network, this parameter has no effect on those operations. It is used only by the NetBackup media server processes which read from or write to the network, specifically, the bptm or bpdm processes. It is not used by any other NetBackup processes on a master server, media server, or client. This parameter is the counterpart on the media server to the Communications Buffer Size parameter on the client, which is described below. The network buffer sizes are not required to be the same on all of your NetBackup systems for NetBackup to function properly; however, setting the Network Buffer Size parameter on the media server and the Communications Buffer Size parameter on the client (see below) to the same value has significantly improved the throughput of the network component of the NetBackup data transfer path in some installations. Similarly, the network buffer size does not have a direct relationship with the NetBackup data buffer size (described under Shared memory (number and size of data buffers) on page 111). They are separately tunable parameters. However, setting the network buffer size to a substantially larger value than the data buffer has achieved the best performance in many NetBackup installations.
Changing this value can affect backup and restore operations on the media servers. Test backups and restores to ensure that the change you make does not negatively impact performance.
109
buffer sizes are not required to be the same on all of your NetBackup systems for NetBackup to function properly. However, setting the Network Buffer Size parameter on the media server (see above) and the Communications Buffer Size parameter on the client to the same value achieves the best performance in some NetBackup installations. To set the communications buffer size parameter on UNIX clients Create the /usr/openv/netbackup/NET_BUFFER_SZ file. As with the media server, it should contain a single integer specifying the communications buffer size. Generally, performance is better when the value in the NET_BUFFER_SZ file on the client matches the value in the NET_BUFFER_SZ file on the media server. Note: The NET_BUFFER_SZ_REST file is not used on the client. The value in the NET_BUFFER_SZ file is used for both backups and restores. To set the communications buffer size parameter on Windows clients 1 From Host Properties in the NetBackup Administration Console, expand Clients and open the Client Properties > Windows Client > Client Settings dialog for the Windows client on which the parameter is to be changed. Enter the desired value in the Communications buffer field. This parameter is specified in the number of kilobytes. The default value is 32. An extra kilobyte is added internally for backup operations. Therefore, the default network buffer size for backups is 33792 bytes. In some NetBackup installations, this default value is too small. Increasing the value to 128 improves performance in these installations. Because local backup jobs on the media server do not send data over the network, this parameter has no effect on these local operations. This parameter is used by only the NetBackup client processes which write to the network, specifically, the bpbkar32 process. It is not used by any other NetBackup for Windows processes on a master server, media server, or client. If you modify the NetBackup buffer settings, test the performance of restores with the new settings.
110 Tuning the NetBackup data transfer path NetBackup network performance
shared memory during a backup. Touching the file NOSHM in the /usr/openv/netbackup (UNIX) directory or the install_path\NetBackup (Windows) directory causes the client and server to use socket communications rather than shared memory to interchange the backup data. (Touching a file means changing the files modification and access times.) The file name is NOSHM and should not contain any extension. Each time a backup runs, NetBackup checks for the existence of this file, so no services need to be stopped and started for it to take effect. One example of when it might be necessary to use NOSHM is when the master or media server hosts another application that uses a large amount of shared memory, for instance, Oracle. NOSHM is also useful for testing, both as a workaround while solving a shared memory issue and to verify that an issue is caused by shared memory. Note: NOSHM only affects backups when it is applied to a system with a directly-attached storage unit. NOSHM forces a local backup to run as though it were a remote backup. A local backup is a backup of a client that has a directly-attached storage unit, such as a client that happens to be a master or media server. A remote backup is a backup that passes the data across a network connection from the client to a master or media servers storage unit. A local backup normally has one or more bpbkar processes that read from the disk and write into shared memory, and a bptm process that reads from shared memory and writes to the tape. A remote backup has one or more bptm (child) processes that read from a socket connection to bpbkar and write into shared memory, and a bptm (parent) process that reads from shared memory and writes to the tape. NOSHM forces the remote backup model even when the client and the media server are the same system. For a local backup without NOSHM, shared memory is used between bptm and bpbkar. Whether the backup is remote or local, and whether NOSHM exists or not, shared memory is always used between bptm (parent) and bptm (child). Note: NOSHM does not affect the shared memory used by the bptm process to buffer data being written to tape. bptm uses shared memory for any backup, local or otherwise.
111
performance. This can be achieved by configuring a unique hostname for the server for each network interface and setting up bp.conf entries for these hostnames. For example, suppose the server is configured with three network interfaces. Each of the network interfaces connects to one or more NetBackup clients. The following configuration allows NetBackup to use all three network interfaces:
In the servers bp.conf file, add one entry for each network interface:
SERVER=server-neta SERVER=server-netb SERVER=server-netc
It is okay for a client to have an entry for a server that is not currently on the same network.
Shared memory (number and size of data buffers) Parent/child delay values Using NetBackup wait and delay counters Fragment size and NetBackup restores Other restore performance issues
112 Tuning the NetBackup data transfer path NetBackup server performance
Note: Restores use the same buffer size that was used to back up the images being restored.
NetBackup Operation
Windows
30 12 30
12
12
30 30 30 30
30 30 30 30
NetBackup Operation
Windows
64K (tape), 256K (disk) 64K (tape), 256K (disk) same size as used for the backup
113
Table 8-2
NetBackup Operation
Duplicate
NDMP backup
On Windows, a single tape I/O operation is performed for each shared data buffer. Therefore, this size must not exceed the maximum block size for the tape device or operating system. For Windows systems, the maximum block size is generally 64K, although in some cases customers are using a larger value successfully. For this reason, the terms tape block size and shared data buffer size are synonymous in this context.
For disk
/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK
For disk
114 Tuning the NetBackup data transfer path NetBackup server performance
install_path\NetBackup\db\config\NUMBER_DATA_BUFFERS_DISK
The NUMBER_DATA_BUFFERS file also applies to remote NDMP backups, but does not apply to local NDMP backups or to NDMP three-way backups. See Note on shared memory and NetBackup for NDMP on page 116. The NUMBER_DATA_BUFFERS_RESTORE file is only needed for restore from tape, not from disk. These files contain a single integer specifying the number of shared data buffers NetBackup will use. The integer represents the number of data buffers. For backups (in the NUMBER_DATA_BUFFERS and NUMBER_DATA_BUFFERS_DISK files), the integers value must be a power of 2. If the NUMBER_DATA_BUFFERS file exists, its contents will be used to determine the number of shared data buffers to be used for multiplexed and non-multiplexed backups. NUMBER_DATA_BUFFERS_DISK allows for a different value when doing backup to disk instead of tape. If NUMBER_DATA_BUFFERS exists but NUMBER_DATA_BUFFERS_DISK does not, NUMBER_DATA_BUFFERS applies to all backups. If both files exist, NUMBER_DATA_BUFFERS applies to tape backups and NUMBER_DATA_BUFFERS_DISK applies to disk backups. If only NUMBER_DATA_BUFFERS_DISK is present, it applies to disk backups only. If the NUMBER_DATA_BUFFERS_RESTORE file exists, its contents will be used to determine the number of shared data buffers to be used for multiplexed restores from tape. The NetBackup daemons do not have to be restarted for the new values to be used. Each time a new job starts, bptm checks the configuration file and adjusts its behavior.
115
/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS
For disk
/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_DISK
The SIZE_DATA_BUFFERS_DISK file also affects NDMP to disk backups. Windows For tape
install_path\NetBackup\db\config\SIZE_DATA_BUFFERS
For disk
install_path\NetBackup\db\config\SIZE_DATA_BUFFERS_DISK
The SIZE_DATA_BUFFERS_DISK file also affects NDMP to disk backups. This file contains a single integer specifying the size of each shared data buffer in bytes. The integer must be a multiple of 1024 (a multiple of 32 kilobytes is recommended). Table 8-3 lists the valid values. The integer represents the size of one tape or disk buffer in bytes. For example, to use a shared data buffer size of 64 Kilobytes, the file would contain the integer 65536. The NetBackup daemons do not have to be restarted for the parameter values to be used. Each time a new job starts, bptm checks the configuration file and adjusts its behavior. Analyze the buffer usage by checking the bptm debug log before and after altering the size of buffer parameters. Table 8-3 Absolute byte value to be entered in SIZE_DATA_BUFFERS or SIZE_DATA_BUFFERS_NDMP SIZE_DATA_BUFFER Value
32768 65536 98304 131072 163840 196608 229376 262144
116 Tuning the NetBackup data transfer path NetBackup server performance
IMPORTANT: Because the data buffer size equals the tape I/O size, the value specified in SIZE_DATA_BUFFERS must not exceed the maximum tape I/O size supported by the tape drive or operating system. This is usually 256 or 128 Kilobytes. Check your operating system and hardware documentation for the maximum values. Take into consideration the total system resources and the entire network. The Maximum Transmission Unit (MTU) for the LAN network may also have to be changed. NetBackup expects the value for NET_BUFFER_SZ and SIZE_DATA_BUFFERS to be in bytes, so in order to use 32k, use 32768 (32 x 1024). Note: Some Windows tape devices are not able to write with block sizes higher than 65536 (64 Kilobytes). Backups created on a UNIX media server with SIZE_DATA_BUFFERS set to more than 65536 cannot be read by some Windows media servers. This means that the Windows media server would not be able to import or restore any images from media that were written with SIZE_DATA_BUFFERS greater than 65536.
Note: The size of the shared data buffers used for a restore operation is determined by the size of the shared data buffers in use at the time the backup was written. This file is not used by restores.
117
The effect of SIZE_DATA_BUFFERS_NDMP depends on the type of backup, as shown in Table 8-5. Table 8-5 NetBackup for NDMP and size of data buffers Use of shared memory
NetBackup does not use shared memory (no data buffers). You can use SIZE_DATA_BUFFERS_NDMP to change the size of the records that are written to tape. Use SIZE_DATA_BUFFERS_DISK to change record size for NDMP disk backup. Remote NDMP backup NetBackup uses shared memory. You can use SIZE_DATA_BUFFERS_NDMP to change the size of the memory buffers and the size of the records that are written to tape. Use SIZE_DATA_BUFFERS_DISK to change buffer size and record size for NDMP disk backup.
The following is a brief description of NDMP three-way backup and remote NDMP backup:
In an NDMP three-way backup, the backup is written to an NDMP storage unit on a different NAS filer. In remote NDMP backup, the backup is written to a NetBackup Media Manager-type storage device.
More information is available on these backup types. See the NetBackup for NDMP Administrators Guide.
118 Tuning the NetBackup data transfer path NetBackup server performance
If large amounts of memory are to be allocated, the kernel may require additional tuning to allow enough shared memory to be available for NetBackup's requirements. For more information, see Kernel tuning (UNIX) on page 170. Note: Note that AIX media servers do not need to tune shared memory because AIX uses dynamic memory allocation. Be cautious if you change these parameters. Make your changes carefully, monitoring for performance changes with each modification. For example, increasing the tape buffer size can cause some backups to run slower. Also, there have been cases with restore issues. After any changes, be sure to include restores as part of your validation testing.
or
15:26:01 [21544] <2> mpx_setup_restore_shm: using 12 data buffers, buffer size is 65536
When you change these settings, take into consideration the total system resources and the entire network. The Maximum Transmission Unit (MTU) for the local area network (LAN) may also have to be changed.
119
/usr/openv/netbackup/db/config/CHILD_DELAY
Windows
<install_path>\NetBackup\db\config\PARENT_DELAY <install_path>\NetBackup\db\config\CHILD_DELAY
These files contain a single integer specifying the value in milliseconds to be used for the delay corresponding to the name of the file. For example, to use a parent delay of 50 milliseconds, the PARENT_DELAY file would contain the integer 50. See Using NetBackup wait and delay counters below for more information about how to determine if you should change these values.
NetBackup Client
Network
Consumer
Shared buffers
Producer
Tape or disk
120 Tuning the NetBackup data transfer path NetBackup server performance
Local clients
When the NetBackup media server and the NetBackup client are part of the same system, the NetBackup client is referred to as a local client.
Backup of local client For a local client, the bpbkar (UNIX) or bpbkar32 (Windows) process reads data from the disk during a backup and places it in the shared buffers. The bptm process reads the data from the shared buffer and writes it to tape or disk. Restore of local client During a restore of a local client, the bptm process reads data from the tape or disk and places it in the shared buffers. The tar (UNIX) or tar32 (Windows) process reads the data from the shared buffers and writes it to disk.
Remote clients
When the NetBackup media server and the NetBackup client are part of two different systems, the NetBackup client is referred to as a remote client.
Backup of remote client The bpbkar (UNIX) or bpbkar32 (Windows) process on the remote client reads data from the disk and writes it to the network. Then a child bptm process on the media server receives data from the network and places it in the shared buffers. The parent bptm process on the media server reads the data from the shared buffers and writes it to tape or disk. Restore of remote client During the restore of the remote client, the parent bptm process reads data from the tape or disk and places it into the shared buffers. The child bptm process reads the data from the shared buffers and writes it to the network. The tar (UNIX) or tar32 (Windows) process on the remote client receives the data from the network and writes it to disk.
121
while a data consumer needs a full buffer. The following chart provides a mapping of processes and their roles during backup and restore operations: Operation
Local Backup
Data Producer
bpbkar (UNIX) or bpbkar32 (Windows) bptm (child) bptm
Data Consumer
bptm
Remote Restore
bptm (parent)
If a full buffer is needed by the data consumer but is not available, the data consumer increments the wait and delay counters to indicate that it had to wait for a full buffer. After a delay, the data consumer will check again for a full buffer. If a full buffer is still not available, the data consumer increments the Delay counter to indicate that it had to delay again while waiting for a full buffer. The data consumer will repeat the delay and full buffer check steps until a full buffer is available. This sequence is summarized in the following algorithm: while (Buffer_Is_Not_Full) { ++Wait_Counter; while (Buffer_Is_Not_Full) { ++Delay_Counter; delay (DELAY_DURATION); } } If an empty buffer is needed by the data producer but is not available, the data producer increments the wait and delay counter to indicate that it had to wait for an empty buffer. After a delay, the data producer will check again for an empty buffer. If an empty buffer is still not available, the data producer increments the Delay counter to indicate that it had to delay again while waiting for an empty buffer. The data producer will relate the delay and empty buffer check steps until an empty buffer is available. The algorithm for a data producer has a similar structure: while (Buffer_Is_Not_Empty) { ++Wait_Counter; while (Buffer_Is_Not_Empty) { ++Delay_Counter;
122 Tuning the NetBackup data transfer path NetBackup server performance
delay (DELAY_DURATION); } } Analysis of the wait and delay counter values indicates which process, producer or consumer, has had to wait most often and for how long. There are four basic wait and delay counter relationships:
Data Producer >> Data Consumer. The data producer has substantially larger wait and delay counter values than the data consumer. The data consumer is unable to receive data fast enough to keep the data producer busy. Investigate means to improve the performance of the data consumer. For a back up operation, check if the data buffer size is appropriate for the tape or disk drive being used (see below). If data consumer still has a substantially large value in this case, try increasing the number of shared data buffers to improve performance (see below). Data Producer = Data Consumer (large value). The data producer and the data consumer have very similar wait and delay counter values, but those values are relatively large. This may indicate that the data producer and data consumer are regularly attempting to used the same shared data buffer. Try increasing the number of shared data buffers to improve performance (see below). Data Producer = Data Consumer (small value). The data producer and the data consumer have very similar wait and delay counter values, but those values are relatively small. This indicates that there is a good balance between the data producer and data consumer, which should yield good performance from the NetBackup server component of the NetBackup data transfer path. Data Producer << Data Consumer. The data producer has substantially smaller wait and delay counter values than the data consumer. The data producer is unable to deliver data fast enough to keep the data consumer busy. Investigate ways to improve the performance of the data producer. For a restore operation, check if the data buffer size (see below) is appropriate for the tape or disk drive being used. If the data producer still has a relatively large value in this case, try increasing the number of shared data buffers to improve performance. See Changing the number of shared data buffers on page 113.
The bullets above describe the four basic relationships possible. Of primary concern is the relationship and the size of the values. Information on determining substantial versus trivial values appears on the following pages.
123
The relationship of these values only provides a starting point in the analysis. Additional investigative work may be needed to positively identify the cause of a bottleneck within the NetBackup data transfer path.
Windows
install_path\NetBackup\logs\bpbkar install_path\NetBackup\logs\bptm
Execute your backup. Look at the log for the data producer (bpbkar on UNIX or bpbkar32 on Windows) process in: UNIX
/usr/openv/netbackup/logs/bpbkar
Windows
install_path\NetBackup\logs\bpbkar
The line you are looking for should be similar to the following, and will have a timestamp corresponding to the completion time of the backup:
... waited 224 times for empty buffer, delayed 254 times
In this example the Wait counter value is 224 and the Delay counter value is 254. 3 Look at the log for the data consumer (bptm) process in: UNIX
/usr/openv/netbackup/logs/bptm
Windows
install_path\NetBackup\logs\bptm
124 Tuning the NetBackup data transfer path NetBackup server performance
The line you are looking for should be similar to the following, and will have a timestamp corresponding to the completion time of the backup:
... waited for full buffer 1 times, delayed 22 times
In this example, the Wait counter value is 1 and the Delay counter value is 22. To determine wait and delay counter values for a remote client backup: 1 Activate debug logging by creating this directory on the media server: UNIX
/usr/openv/netbackup/logs/bptm
Windows install_path\NetBackup\logs\bptm 2 3 Execute your backup. Look at the log for the bptm process in: UNIX
/usr/openv/netbackup/logs/bptm
Windows
install_path\NetBackup\Logs\bptm
Delays associated with the data producer (bptm child) process will appear as follows:
... waited for empty buffer 22 times, delayed 151 times, ...
In this example, the Wait counter value is 22 and the Delay counter value is 151. 5 Delays associated with the data consumer (bptm parent) process will appear as:
... waited for full buffer 12 times, delayed 69 times
In this example the Wait counter value is 12, and the Delay counter value is 69. To determine wait and delay counter values for a local client restore: 1 Activate logging by creating the two directories on the NetBackup media server: UNIX
/usr/openv/netbackup/logs/bptm /usr/openv/netbackup/logs/tar
Windows
install_path\NetBackup\logs\bptm install_path\NetBackup\logs\tar
Execute your restore. Look at the log for the data consumer (tar or tar32) process in the tar log directory created above.
125
The line you are looking for should be similar to the following, and will have a timestamp corresponding to the completion time of the restore:
... waited for full buffer 27 times, delayed 79 times
In this example, the Wait counter value is 27, and the Delay counter value is 79. 3 Look at the log for the data producer (bptm) process in the bptm log directory created above. The line you are looking for should be similar to the following, and will have a timestamp corresponding to the completion time of the restore:
... waited for empty buffer 1 times, delayed 68 times
In this example, the Wait counter value is 1 and the delay counter value is 68. To determine wait and delay counter values for a remote client restore: 1 Activate debug logging by creating the following directory on the media server: UNIX
/usr/openv/netbackup/logs/bptm
Windows
install_path\NetBackup\logs\bptm
2 3 4
Execute your restore. Look at the log for bptm in the bptm log directory created above. Delays associated with the data consumer (bptm child) process will appear as follows:
... waited for full buffer 36 times, delayed 139 times
In this example, the Wait counter value is 36 and the Delay counter value is 139. 5 Delays associated with the data producer (bptm parent) process will appear as follows:
... waited for empty buffer 95 times, delayed 513 times
In this example the Wait counter value is 95 and the Delay counter value is 513.
126 Tuning the NetBackup data transfer path NetBackup server performance
Note: When you run multiple tests, you can rename the current log file. NetBackup will automatically create a new log file, which prevents you from erroneously reading the wrong set of values. Deleting the debug log file will not stop NetBackup from generating the debug logs. You must delete the entire directory. For example, to stop bptm logging, you must delete the bptm subdirectory. NetBackup will automatically generate debug logs at the specified verbose setting whenever the directory is detected.
Data buffer size. The size of each shared data buffer can be found on a line similar to:
... io_init: using 65536 data buffer size
Number of data buffers. The number of shared data buffers may be found on a line similar to:
... io_init: using 16 data buffers
Parent/child delay values. The values in use for the duration of the parent and child delays can be found on a line similar to:
... io_init: child delay = 10, parent delay = 15 (milliseconds)
NetBackup Media Server Network Buffer Size. The values in use for the Network Buffer Size parameter on the media server can be found on lines similar to these in debug log files: The receive network buffer is used by the bptm child process to read from the network during a remote backup.
...setting receive network buffer to 263168 bytes
The send network buffer is used by the bptm child process to write to the network during a remote restore.
...setting send network buffer to 131072 bytes
See NetBackup media server network buffer size on page 107 for more information about the Network Buffer Size parameter on the media server. Suppose you wanted to analyze a local backup in which there was a 30-minute data transfer duration baselined at 5 Megabytes/second with a total data transfer of 9,000 Megabytes. Because a local backup is involved, if you refer to Roles of processes during backup and restore operations on page 120, you can
127
determine that bpbkar (UNIX) or bpbkar32 (Windows) is the data producer and bptm is the data consumer. You would next want to determine the wait and delay values for bpbkar (or bpbkar32) and bptm by following the procedures described in Determining wait and delay counter values on page 123. For this example, suppose those values were: Process
bpbkar (UNIX) bpbkar32 (Windows) bptm 95 105
Wait
29364
Delay
58033
Using these values, you can determine that the bpbkar (or bpbkar32) process is being forced to wait by a bptm process which cannot move data out of the shared buffer fast enough. Next, you can determine time lost due to delays by multiplying the Delay counter value by the parent or child delay value, whichever applies. In this example, the bpbkar (or bpbkar32) process uses the child delay value, while the bptm process uses the parent delay value. (The defaults for these values are 10 for child delay and 15 for parent delay.) The values are specified in milliseconds. See Parent/child delay values on page 118 for more information on how to modify these values. Use the following equations to determine the amount of time lost due to these delays:
bpbkar (UNIX) bpbkar32 (Windows) = 58033 delays X 0.020 seconds =1160 seconds =19 minutes 20 seconds bptm =105 X 0.030 seconds =3 seconds
This is useful in determining that the delay duration for the bpbkar (or bpbkar32) process is significant. If this delay were entirely removed, the resulting transfer time of 10:40 (total transfer time of 30 minutes minus delay of 19 minutes and 20 seconds) would indicate a throughput value of 14 Megabytes/sec, nearly a threefold increase. This type of performance increase would warrant expending effort to investigate how the tape or disk drive performance can be improved.
128 Tuning the NetBackup data transfer path NetBackup server performance
The number of delays should be interpreted within the context of how much data was moved. As the amount of data moved increases, the significance threshold for counter values increases as well. Again, using the example of a total of 9,000 Megabytes of data being transferred, assume a 64-Kilobytes buffer size. You can determine the total number of buffers to be transferred using the following equation:
Number_Kbytes = 9,000 X 1024 = 9,216,000 Kilobytes Number_Slots =9,216,000 / 64 =144,000
The Wait counter value can now be expressed as a percentage of the total divided by the number of buffers transferred:
bpbkar (UNIX) bpbkar32 (Windows) bptm = 29364 / 144,000 = 20.39% = 95 / 144,000 = 0.07%
In this example, in the 20 percent of cases where the bpbkar (or bpbkar32) process needed an empty shared data buffer, that shared data buffer has not yet been emptied by the bptm process. A value this large indicates a serious issue, and additional investigation would be warranted to determine why the data consumer (bptm) is having issues keeping up. In contrast, the delays experienced by bptm are insignificant for the amount of data transferred. You can also view the Delay and Wait counters as a ratio:
bpbkar (UNIX) bpbkar32 (Windows) = 58033/29364 = 1.98
In this example, on average the bpbkar (or bpbkar32) process had to delay twice for each wait condition that was encountered. If this ratio is substantially large, you may wish to consider increasing the parent or child delay value, whichever one applies, to avoid the unnecessary overhead of checking for a shared data buffer in the correct state too often. Conversely, if this ratio is close to 1, you may wish to consider reducing the applicable delay value to check more often and see if that increases your data throughput performance. Keep in mind
129
that the parent and child delay values are rarely changed in most NetBackup installations. The preceding information explains how to determine if the values for wait and delay counters are substantial enough for concern. The wait and delay counters are related to the size of data transfer. A value of 1,000 may be extreme when only 1 Megabyte of data is being moved. The same value may indicate a well-tuned system when gigabytes of data are being moved. The final analysis must determine how these counters affect performance by considering such factors as how much time is being lost and what percentage of time a process is being forced to delay.
bptm read waits The bptm debug log contains messages such as,
...waited for full buffer 1681 times, delayed 12296 times
The first number in the message is the number of times bptm waited for a full buffer, which is the number of times bptm write operations waited for data from the source. If, using the technique described in the section Determining wait and delay counter values on page 123, you determine that the Wait counter indicates a performance issue, then changing the number of buffers will not help, but adding multiplexing may help.
bptm write waits The bptm debug log contains messages such as,
...waited for empty buffer 1883 times, delayed 14645 times
The first number in the message is the number of times bptm waited for an empty buffer, which is the number of times bptm experienced data arriving from the source faster than the data could be written to tape or disk. If, using the technique described in the section Determining wait and delay counter values on page 123, you determine that the Wait counter indicates a performance issue, then reduce the multiplexing factor if you are using multiplexing. Also, adding more buffers may help.
bptm delays The bptm debug log contains messages such as,
...waited for empty buffer 1883 times, delayed 14645 times
The second number in the message is the number of times bptm waited for an available buffer. If, using the technique described in the section Determining wait and delay counter values on page 123, you determine
130 Tuning the NetBackup data transfer path NetBackup server performance
that the Delay counter indicates a performance issue, this will need investigation. Each delay interval is 30 ms.
131
The Reduce fragment size to setting on the Storage Unit dialog limits the largest fragment size of the image. By limiting the size of the fragment, the size of the largest read during restore is minimized, reducing restore time. This is especially important when restoring a small number of individual files rather than entire directories or file systems. For many sites, a fragment size of approximately 10 gigabytes will result in good performance for both backup and restore. When choosing a fragment size, consider the following:
Larger fragment sizes usually favor backup performance, especially when backing up large amounts of data. Creating smaller fragments will slow down large backups: each time a new fragment is created, the backup stream is interrupted. Larger fragment sizes do not hinder performance when restoring large amounts of data. But when restoring a few individual files, larger fragments may slow down the restore. Larger fragment sizes do not hinder performance when restoring from non-multiplexed backups. For multiplexed backups, larger fragments may slow down the restore. In multiplexed backups, blocks from several images can be mixed together within a single fragment. During restore, NetBackup positions to the nearest fragment and starts reading the data from there, until it comes to the desired file. Splitting multiplexed backups into smaller fragments can improve restore performance. During restores, newer, faster devices can handle large fragments well. Slower devices, especially if they do not use fast locate block positioning, will restore individual files faster if fragment size is smaller. (In some cases, SCSI fast tape positioning can improve restore performance.)
Note: Unless you have particular reasons for creating smaller fragments (such as when restoring a few individual files, restoring from multiplexed backups, or restoring from older equipment), larger fragment sizes are likely to yield better overall performance.
132 Tuning the NetBackup data transfer path NetBackup server performance
After that, for every subsequent file to be restored, bptm determines where that file is, relative to the current position. If it is faster for bptm to position to that spot rather than to read all the data in between (and if fast locate is available), bptm uses positioning to get to the next file instead of reading all the data in between. If fast-locate is not available, bptm can read the data as quickly as it can position with MTFSR (forward space record). Therefore, fragment sizes for non-multiplexed restores matter if fast-locate is NOT available. In general, given smaller fragments, a restore reads less extraneous data. You can set the maximum fragment size for the storage unit on the Storage Unit dialog in the NetBackup Administration Console (Reduce fragment size to).
133
Some examples may help illustrate when smaller fragments do and do not help restores.
Example 1:
Assume you are backing up four streams to a multiplexed tape, and each stream is a single, 1 gigabyte file and a default maximum fragment size of 1 TB has been specified. The resultant backup image logically looks like the following. TM denotes a tape mark, or file mark, that indicates the start of a fragment. TM <4 gigabytes data> TM When restoring any one of the 1 gigabyte files, the restore positions to the TM and then has to read all 4 gigabytes to get the 1 gigabyte file. If you set the maximum fragment size to 1 gigabyte: TM <1 gigabyte data> TM <1 gigabyte data> TM <1 gigabyte data> TM <1 gigabyte data> TM this does not help, since the restore still has to read all four fragments to pull out the 1 gigabyte of the file being restored.
Example 2:
This is the same as Example 1, but assume four streams are backing up 1 gigabyte worth of /home or C:\. With the maximum fragment size (Reduce fragment size) set to a default of 1 TB (and assuming all streams are relatively the same performance), you again end up with: TM <4 gigabytes data> TM Restoring /home/file1 or C:\file1 and/home/file2 or C:\file2 from one of the streams will have to read as much of the 4 gigabytes as necessary to restore all the data. But, if you set Reduce fragment size to 1 gigabyte, the image looks like this: TM <1 gigabyte data> TM <1 gigabyte data> TM <1 gigabyte data> TM <1 gigabyte data> TM In this case, home/file1 or C:\file1 starts in the second fragment, and bptm positions to the second fragment to start the restore of home/file1 or C:\file1 (this has saved reading 1 gigabyte so far). After /home/file1 is done, if /home/file2 or C:\file2 is in the third or forth fragment, the restore can position to the beginning of that fragment before it starts reading as it looks for the data. These examples illustrate that whether fragmentation benefits a restore depends on what the data is, what is being restored, and where in the image the data is. In Example 2, reducing the fragment size from 1 gigabyte to half a gigabyte (512 Megabytes) increases the chance the restore can locate by skipping instead of reading when restoring relatively small amounts of an image.
134 Tuning the NetBackup data transfer path NetBackup server performance
NUMBER_DATA_BUFFERS_RESTORE setting
This parameter can help keep other NetBackup processes busy while a multiplexed tape is positioned during a restore. Increasing this value causes NetBackup buffers to occupy more physical RAM. This parameter only applies to multiplexed restores. For more information on this parameter, see Shared memory (number and size of data buffers) on page 111.
Windows
install_directory\bin\admincmd\bpimage -create_image_list -client client_name
where client_name is the name of the client with many small backup images. In the directory: UNIX
135
/usr/openv/netbackup/db/images/client_name
Windows
install_path\NetBackup\db\images\client_name
Do not edit these files, because they contain offsets and byte counts that are used for seeking to and reading the image information. Note: These files increase the size of the client directory.
where value is the number which provides the best performance for the system. This may have to be tried and tested as it may vary from system to system. A suggested starting value is 20. In any case, the value must not exceed 500ms as this may break TCP/IP. Once the optimum value for the system is found, the command for setting the value can be permanently set in a script under the directory /etc/rc2.d so that it can be executed at boot time.
136 Tuning the NetBackup data transfer path NetBackup server performance
Example (Oracle):
Suppose that MPX_RESTORE_DELAY is not set in the bp.conf file, so its value is the default of 30 seconds. Suppose also that you initiate a restore from an Oracle RMAN backup that was backed up using 4 channels or 4 streams, and you use the same number of channels to restore. RMAN passes NetBackup a specific data request, telling NetBackup what information it needs to start and complete the restore. The first request is passed and received by NetBackup in 29 seconds, causing the MPX_RESTORE_DELAY timer to be reset. The next request is passed and received by NetBackup in 22 seconds, so again the timer is reset. The third request is received 25 seconds later, resetting the timer a third time, but the fourth request is received 31 seconds after the third. Since the fourth request was not received within the restore delay interval, NetBackup only starts three of the four restores. Instead of reading from the tape once, NetBackup queues
Tuning the NetBackup data transfer path NetBackup storage device performance
137
the fourth restore request until the previous three requests are completed. Since all of the multiplexed images are on the same tape, NetBackup mounts, rewinds, and reads the entire tape again to collect the multiplexed images for the fourth restore request. Note that in addition to NetBackup's reading the tape twice, RMAN waits to receive all the necessary header information before it begins the restore. If MPX_RESTORE_DELAY had been larger than 30 seconds, NetBackup would have received all four restore requests within the restore delay windows and collected all the necessary header information with one pass of the tape. Oracle would have started the restore after this one tape pass, improving the restore performance significantly. MPX_RESTORE_DELAY needs to be set with caution, because it can decrease performance if its value is set too high. Suppose, for instance, that the MPX_RESTORE_DELAY is set to 1800 seconds. When the final associated restore request arrives, NetBackup resets the request delay timer as it did with the previous requests. NetBackup then must wait for the entire 1800-second interval before it can start the restore. Therefore, try to set the value of MPX_RESTORE_DELAY so it is neither too high or too low.
Media positioning
When a backup or restore is performed, the storage device must position the tape so that the data is over the read/write head. Depending on the location of the data and the overall performance of the media device, this can take a significant amount of time. When you conduct performance analysis with media containing multiple images, it is important to account for the time lag that occurs before the data transfer starts.
138 Tuning the NetBackup data transfer path NetBackup storage device performance
Tape streaming
If a tape device is being used at its most efficient speed, it is said to be streaming the data onto the tape. Generally speaking, if a tape device is streaming, the media rarely has to stop and restart. Instead, the media constantly spins within the tape drive. If the tape device is not being used at its most efficient speed, it may continually start and stop the media from spinning. This behavior is the opposite of tape streaming and usually results in a poor data throughput rate.
Data compression
Most tape devices support some form of data compression within the tape device itself. Compressible data (such as text files) yields a higher data throughput rate than non-compressible data, if the tape device supports hardware data compression. Tape devices typically come with two performance rates: maximum throughput and nominal throughput. Maximum throughput is based on how fast compressible data can be written to the tape drive when hardware compression is enabled in the drive. Nominal throughput refers to rates achievable with non-compressible data. Note: Tape drive data compression cannot be set by NetBackup. Follow the instructions provided with your OS and tape drive to be sure data compression is set correctly. In general, tape drive data compression is preferable to client (software) compression such as that available in NetBackup. Client compression may be desirable in some cases, such as for reducing the amount of data transmitted across the network for a remote client backup. See Tape versus client compression on page 145 for more information.
Chapter
Multiplexing and multi-streaming on page 140 Resource allocation on page 143 Encryption on page 144 Compression on page 144 Using encryption and compression on page 146 NetBackup Java on page 146 Vault on page 146 Fast recovery with bare metal restore on page 146 Backing up many small files on page 146 NetBackup Operations Manager (NOM) on page 149
141
Figure 9-1
Multiplexing diagram
server
Multi-streaming writes multiple data streams, each to its own tape drive, unless multiplexing is used. Multistreaming diagram
Figure 9-2
server
backup to tape
Here are some things to consider with regard to multiplexing: Experiment with different multiplexing factors to find one where the tape drive is just streaming, that is, where the writes just fill the maximum bandwidth of your drive. This is the optimal multiplexing factor. For instance, if you determine that you can get 5 Megabytes/second from each of multiple concurrent read streams, then you would use a multiplexing factor of two to get the maximum throughput to a DLT7000 (that is, 10 Megabytes/second).
Use a higher multiplexing factor for incremental backups. Use a lower multiplexing factor for local backups. Expect the duplication of a multiplexed tape to take longer if it is demultiplexed (unless Preserve Multiplexing is specified on the duplication). Without Preserve Multiplexing, the duplication may take
longer because multiple read passes of the source tape must be made. Using Preserve Multiplexing, however, may affect the restore time (see next bullet).
When you duplicate a multiplexed backup, demultiplex it. By demultiplexing the backups when they are duplicated, the time for recovery is significantly reduced.
Do not use multi-streaming on single mount points. Multi-streaming takes advantage of the ability to stream data from several devices at once. This permits backups to take advantage of Read Ahead on a spindle or set of spindles in RAID environments. Multi-streaming from a single mount point encourages head thrashing and may result in degraded performance. Only conduct multistreamed backups against single mount points if they are mirrored (RAID 0). However, this also is likely to result in degraded performance.
Multiplexing To use multiplexing effectively, you must understand the implications of multiplexing on restore times. Multiplexing may decrease overall backup time when you are backing up large numbers of clients over slow networks, but it does so at the cost of recovery time. Restores from multiplexed tapes must pass over all nonapplicable data. This action increases restore times. When recovery is required, demultiplexing causes delays in the restore process. This is because NetBackup must do more tape searching to accomplish the restore. Restores should be tested to determine the impact of multiplexing on restore performance. Also, a smaller maximum fragment size when multiplexing may help restore performance. See Fragment size and NetBackup restores on page 130. When you initially set up a new environment, keep the multiplexing factor low. Typically, a multiplexing factor of four or less does not highly impact the speed of restores, depending on the type of drive and the type of system. If the backups do not finish within their assigned window, multiplexing can be increased to meet the window. However, increasing the multiplexing factor provides diminishing returns as the number of multiplexing clients increases. The optimum multiplexing factor is the number of clients needed to keep the buffers full for a single tape drive. Set the multiplexing factor to four and do not multistream. Run benchmarks in this environment. Then, if needed, you can begin to change the values involved until both the backup and restore window parameters are met. Multi-streaming
143
The NEW_STREAM directive is useful for fine-tuning streams so that no disk subsystem is under-utilized or over-utilized.
Resource allocation
The following adjustments can be made to improve resource allocation.
Sharing reservations
For image duplication or synthetic backup, NetBackup reserves all the required source (read) media exclusively. No other job can use the media while it is reserved. If two or more duplication or synthetic jobs requiring a common set of source media are started at the same time, one job starts and the other job waits until the first job terminates. To improve performance, NetBackup 6.5 allows shared reservations. With shared reservations, multiple jobs can reserve the same media (though only one job can use it at a time). In other words, the second job does not have to wait for the first job to terminate. The second job can access the media as soon as the first job is done with it. To enable the sharing of reservations
On Windows
install_path\Veritas\NetBackup\db\config\RB_USE_SHARED_RESERVATIONS
/usr/openv/netbackup/db/config/RB_DISABLE_REAL_UNLOADS_ON_DEMAND
install_path\Veritas\NetBackup\db\config\RB_DISABLE_REAL_UNLOADS_ON_DEMAND
Encryption
When the NetBackup client encryption option is enabled, your backups may run slower. How much slower depends on the throttle point in your backup path. If the network is the issue, encryption should not hinder performance. If the network is not the issue, then encryption may slow down the backup. If you are multi-streaming encrypted backups on a client with multiple CPUs, try to define one fewer stream than the number of CPUs. For example, if the client has four CPUs, define three or fewer streams for the backup. This approach can minimize CPU contention.
Compression
Two types of compression can be used with NetBackup, client compression (configured in the NetBackup policy) and tape drive compression (handled by the device hardware).
145
The decision to use data compression should be based on the compressibility of the data itself. Note the following basic levels of compressibility, in descending order:
Plain text: usually the most compressible type of data. Executable code: may compress somewhat, but not as much as plain text. Already compressed data: often, no further compression is possible. Encrypted data: may expand in size if compression is applied. See Using encryption and compression on page 146.
Tape drive compression is almost always preferable to client compression. Compression is CPU intensive, and tape drives have built-in hardware to perform compression. Avoid using both tape compression and client compression: compressing data that is already compressed can actually increase the amount of backed-up data. Only in rare cases is it beneficial to use client (software) compression. Those cases usually include the following characteristics:
The client data is highly compressible. The client has abundant CPU resources.
You need to minimize the data sent across the network between the client and server. In other cases, however, NetBackup client compression should be turned off, and the hardware should handle the compression.
Client compression reduces the amount of data sent over the network, but increases CPU usage on the client. On UNIX: The NetBackup client configuration setting MEGABYTES_OF_MEMORY may help client performance. This option sets the amount of memory available for compression. It is undesirable to compress files that are already compressed. If the client data is being compressed twice, refer to the NetBackup configuration option COMPRESS_SUFFIX. You can use this option to exclude files with certain suffixes from compression. Edit this setting through bpsetconfig. See the NetBackup Administrators Guide, Volume II.
NetBackup Java
For performance improvement, refer to the following sections in the NetBackup Administrators Guide for UNIX and Linux, Volume I: Configuring the NetBackup-Java Administration Console, and the subsection NetBackup-Java Performance Improvement Hints. In addition, the NetBackup Release Notes may contain information about NetBackup Java performance.
Vault
Refer to the Best Practices chapter of the NetBackup Vault Administrators Guide.
147
Use the FlashBackup (or FlashBackup-Windows) policy type. This is a feature of NetBackup Snapshot Client. FlashBackup is described in the NetBackup Snapshot Client Administrators Guide. See FlashBackup on page 148 of this Tuning guide for a related tuning issue. On Windows, make sure virus scans are turned off (this may double performance). Snap a mirror (such as with the FlashSnap method in Snapshot Client) and back that up as a raw partition. This does not allow individual file restore from tape. Turn off or reduce logging. The NetBackup logging facility has the potential to impact the performance of backup and recovery processing. Logging is usually enabled only to troubleshoot a NetBackup problem, to ensure that any performance impact is short in term. The performance impact can be determined by the amount of logging used and the verbosity level set. Make sure the NetBackup buffer size is the same size on both the servers and clients. Consider upgrading NIC drivers as new releases appear. Run the following bpbkar throughput test on the client with Windows: C:\Veritas\Netbackup\bin\bpbkar32 -nocont > NUL 2> For example:
C:\Veritas\Netbackup\bin\bpbkar32 -nocont c:\ > NUL 2> temp.f
When initially configuring the Windows server, optimize TCP/IP throughput as opposed to shared file access. Always select the choice of boosting background performance on Windows versus foreground performance. Turn off NetBackup Client Job Tracker if the client is a system server. Regularly review the patch announcements for every server OS. Install patches that affect TCP/IP functions, such as correcting out-of-sequence delivery of packets.
FlashBackup
If using Snapshot Client FlashBackup with a copy-on-write snapshot method
If you are using the FlashBackup feature of Snapshot Client with a copy-on-write method such as nbu_snap, assign the snapshot cache device to a separate hard drive. This will improve performance by reducing disk contention and potential head thrashing due to the writing of data to maintain the snapshot.
Enter the desired values in the FBU_READBLKS file, as follows. On the first line of the file, enter an integer value for the read buffer size in bytes for full backups and/or enter the read buffer size in bytes for incremental backups. The default is to read the raw partition in 131072 bytes (128 KB) during full backups and in 32768 bytes (32 KB) for incremental backups. If changing both values, separate them with a space. For example, to set the full backup read buffer to 256 KB and the incremental read buffer to 64 KB, enter the following on the first line of the file:
262144 65536
You can use the second line of the file to set the tape record write size, also in bytes. The default is the same size as the read buffer. The first entry on
149
the second line sets the full backup write buffer size, the second value sets the incremental backup write buffer size. Note: Resizing the read buffer for incremental backups can result in a faster backup in some cases, and a slower backup in others. The result depends on such factors as the location of the data to be read, the size of the data to be read relative to the size of the read buffer, and the read characteristics of the storage device and the I/O stack. Experimentation may be necessary to achieve the best setting.
On Solaris servers, open the /opt/VRTSnom/bin/nomsrvctl file and change the value of the MAX_HEAP parameter. To increase the heap size to 2048 MB, edit the parameter as follows:
MAX_HEAP=-Xmx2048m
Save the nomsrvctl file. Then stop and restart the NOM processes, as follows:
/opt/VRTSnom/bin/NOMAdmin -stop_service /opt/VRTSnom/bin/NOMAdmin -start_service
On Windows servers, open the Registry Editor and go to the following location:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ VRTSnomSrvr\Parameters Increase the JVM Option Number 0 value. For example, enter -Xmx2048m
to increase the heap size to 2048 MB. Then stop and restart the NOM services, as follows:
install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin.bat -stop_service install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin.bat -start_service
On Solaris servers, run the webgui command to change the heap size. For example, to change the heap size to 512 MB, run the following:
/opt/VRTSweb/bin/webgui maxheap 512
should be changed to
-n NOM_nom-sol6 -x tcpip(BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gp 8192 -ct+ -gd DBA -gk DBA -gl DBA -ti 60 -c 512M -cs -ud -m
Note:
151
This example replaced -c 25M -ch 500M -cl 25M with -c 512M -cs in the nomserver.conf file to increase the cache size to 512 MB. In the same manner, to increase the cache size to 1 GB, replace -c 25M -ch 500M -cl 25M with -c 1G -cs. The -ch and -cl server options are used to set the maximum and the minimum cache size respectively. The -cs option logs the cache size changes for the database server. 2 3 Save the nomserver.conf file. Stop and restart the NOM processes, as follows:
/opt/VRTSnom/bin/NOMAdmin -stop_service /opt/VRTSnom/bin/NOMAdmin -start_service
The logs for the cache size changes are stored in:
/opt/VRTSnom/logs/nomdbsrv.log.
To set the cache size using the -c server option on Windows servers 1 Open the install_path\db\conf\server.conf file. For example, to increase the cache size to 512 MB, add -c 512M -cs to the content of server.conf file. This is shown in the following example.
-n NOM_PUNENOM -x tcpip(BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786) -o "install_path\db\log\server.log" -m
should be changed to
-n NOM_PUNENOM -x tcpip(BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786) -o "install_path\db\log\server.log" -c 512M -cs -m
Note: the -cs option logs the cache size changes for the database server. In the same manner, to increase the cache size to 1 GB, you should add -c 1G -cs to the content of the server.conf file. 2 Stop and restart the NOM services, as follows:
install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin.bat -stop_service install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin.bat -start_service
The logs for the cache size changes are stored in install_path\db\log\server.log.
Symantec also recommends that you not store the database files on the hard disk that contains your operating system files. Use the following procedures to move the NOM database and log files to a different hard disk. The first two procedures are for moving the NOM database files on Windows or Solaris; the last two procedures are for moving the database log files. To move the NOM database to a different hard disk on Windows 1 2 Stop all the NOM services by entering the following command: Open the databases.conf file with a text editor from the following directory:
install_path\NetBackup Operations Manager\db\conf install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin -stop_service
These paths specify the default location of the NOM primary database and the alerts database respectively. Enter the new path of the database files in the databases.conf file. If you want to move the database files to a location such as E:\NOM located on a different hard disk, change the content of databases.conf file to the following:
E:\NOM\vxpmdb.db" "E:\NOM\vxam.db
Ensure that you specify the path in double quotes. Also, the directories in the specified path should not contain any special characters such as %, ~, !, @, $, &, ^, # etc. For example, do not specify a path such as E:\NOM%. 3 4 Save the databases.conf file. Copy the database files to the new location. Copy vxpmdb.db and vxam.db from install_path\NetBackup Operations Manager\db\data to a location such as E:\NOM on another hard disk. You should run and monitor NOM for a certain period after moving the database. If NOM works as expected, you can delete vxpmdb.db and vxam.db from install_path\NetBackup Operations Manager\db\data. Restart all the NOM services:
To move the NOM database to a different hard disk on Solaris 1 2 Stop all NOM processes by entering the following command:
/opt/VRTSnom/bin/NOMAdmin -stop_service
153
mv /opt/VRTSnom/db/data /opt/VRTSnom/db/olddata
To move the database to a location such as /usr1/mydata on a different hard disk, create a symbolic link to /usr1/mydata in: /opt/VRTSnom/db/olddata. To do this, enter the following command:
ln -s /usr1/mydata /opt/VRTSnom/db/olddata
Copy the primary and alerts databases to a location such as /usr1/mydata on a different hard disk. Enter the following commands to copy the database:
cp /opt/VRTSnom/db/olddata/vxpmdb.db /usr1/mydata cp /opt/VRTSnom/db/olddata/vxam.db /usr1/mydata
You should run and monitor NOM for a certain period after moving the database. If NOM works as expected, you can delete vxpmdb.db and vxam.db from /opt/VRTSnom/db/olddata. 5 Restart all NOM processes:
/opt/VRTSnom/bin/NOMAdmin -start_service
To move the database log files to a different hard disk on Windows 1 2 Stop all NOM services: Navigate to the following location:
install_path\NetBackup Operations Manager\db\WIN32 install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin -stop_service
where directory_path is the path where you want to store the database logs and database_path is the path where your database is located. This command moves the log file associated with the NOM primary database to the new directory (directory_path). It is recommended to use vxpmdb.log or vxam.log as the name of the log file. 3 Restart all NOM services:
install_path\NetBackup Operations Manager\bin\admincmd\NOMAdmin.bat -start_service
To move the database log files to a different hard disk on Solaris 1 2 Stop all the NOM processes:
/opt/VRTSnom/bin/NOMAdmin -stop_service
where directory_path is the path where you want to store your database log file and database_path is the path where the NOM database is located. This command moves the log file associated with the NOM primary database to the new directory (directory_path). It is recommended to use vxpmdb.log or vxam.log as the name of the log file. 4 Restart all NOM processes:
/opt/VRTSnom/bin/NOMAdmin -start_service
Solaris:
/opt/VRTSnom/bin/NOMAdmin
The directory location (directory_name) must be the same in both commands. 3 For NOM 6.5.1, defragment the database:
NOMAdmin -defrag
155
Chapter
10
Hardware performance hierarchy on page 158 Hardware configuration examples on page 165 Tuning software for better performance on page 166
Note: The critical factors in performance are not software-based. They are hardware selection and configuration. Hardware has roughly four times the weight that software has in determining performance.
Host Memory Level 5 PCI bridge PCI bus Level 4 PCI card 1 PCI card 2 PCI bus PCI bridge PCI bus PCI card 3
Level 3
Fibre channel
Fibre channel
Array 1 Raid controller Level 2 Shelf Shelf adaptor Level 1 Shelf Shelf adaptor
Drives
Drives
Drives
Drives
159
In general, all data going to or coming from disk must pass through host memory. In the following diagram, a dashed line shows the path that the data takes through a media server. Figure 10-2 Data stream in NetBackup media server to arrays
Memory PCI bridge PCI bus PCI card Data moving through host memory
Level 3
The data moves up through the ethernet PCI card at the far right. The card sends the data across the PCI bus and through the PCI bridge into host memory. NetBackup then writes this data to the appropriate location. In a disk example, the data passes through one or more PCI bridges, over one or more PCI buses, through one or more PCI cards, across one or more fibre channels, and so on. Sending data through more than one PCI card increases bandwidth by breaking up the data into large chunks and sending a group of chunks at the same time to multiple destinations. For example, a write of 1 MB could be split into 2 chunks going to 2 different arrays at the same time. If the path to each array is x bandwidth, the aggregate bandwidth will be approximately 2x. Each level in the Performance Hierarchy diagram represents the transitions over which data will flow. These transitions have bandwidth limits. Between each level there are elements that can affect performance as well.
Drives
Drives
Drives
Drives
Level 1 bandwidth potential is determined by the technology used. For FC-AL, the arbitrated loop could be either 1 gigabit or 2 gigabit fibre channel. An arbitrated loop is a shared-access topology, which means that only 2 entities on the loop can be communicating at one time. For example, one disk drive and the shelf adaptor can communicate. So even though a single disk drive might be capable of 2 gigabit bursts of data transfers, there is no aggregation of this bandwidth (that is, multiple drives cannot be communicating with the shelf adaptor at the same time, resulting in multiples of the individual drive bandwidth).
Shelf
Shelf adaptor
Shelf
Shelf adaptor
Shelf
Shelf adaptor
161
Larger disk arrays will have more than one internal FC-AL. Shelves may even support 2 FC-AL so that there will be two paths between the RAID controller and every shelf, which provides for redundancy and load balancing.
Array
Array
While this diagram shows a single point-to-point connection between an array and the host, a real-world use more typically includes a SAN fabric (having one or more fibre channel switches). The logical result is the same, in that either is a data path between the array and the host. When these paths are not arbitrated loops (for example, if they were fabric fibre channel), they do not have the shared-access topology limitations. That is, if two arrays are connected to a fibre channel switch and the host has a single fibre channel connection to the switch, the arrays can be communicating at the same time (the switch does the coordination with the host fibre channel connection). However, this does not aggregate bandwidth, since the host is still limited to a single fibre channel connection. Fibre channel is generally 1 or 2 gigabit (both arbitrated loop and fabric topology). Faster speeds are coming on the market. A general rule-of-thumb when considering protocol overhead is that one can divide the gigabit rate by 10 to get an approximate megabyte-per-second bandwidth. So, 1-gigabit fibre channel can theoretically achieve approximately 100 MB/second and 2-gigabit fibre channel can theoretically achieve approximately 200 MB/second. Fibre channel is also similar to traditional LANs, in that a given interface can support multiple connection rates. That is, a 2-gigabit fibre channel port will also connect to devices that only support 1 gigabit.
A typical host will support 2 or more PCI buses, with each bus supporting 1 or more PCI cards. A bus has a topology similar to FC-AL in that only 2 endpoints can be communicating at the same time. That is, if there are 4 cards plugged into a PCI bus, only one of them can be communicating with the host at a given instant. Multiple PCI buses are implemented to allow multiple data paths to be communicating at the same time, resulting in aggregate bandwidth gains. PCI buses have 2 key factors involved in bandwidth potential: the width of the bus - 32 or 64 bits, and the clock or cycle time of the bus (in Mhz). As a rule of thumb, a 32-bit bus can transfer 4 bytes per clock and a 64-bit bus can transfer 8 bytes per clock. Most modern PCI buses support both 64-bit and 32-bit cards. Currently PCI buses are available in 4 clock rates:
33 Mhz 66 Mhz 100 Mhz (sometimes referred to as PCI-X) 133 Mhz (sometimes referred to as PCI-X)
PCI cards also come in different clock rate capabilities. Backward-compatibility is very common; for example, a bus rated at 100 Mhz will support 100, 66, and 33 Mhz cards. Likewise, a 64-bit bus will support both 32-bit and 64-bit cards. They can also be mixed; for example, a 100-Mhz 64-bit bus can support any mix of clock and width that are at or below those values.
163
Note: In a shared-access topology, a slow card can negatively impact the performance of other fast cards on the same bus. This is because the bus adjusts to the right clock and width for each transfer. One moment it could be doing 100 Mhz 64 bit to card #2 and at another moment doing 33 Mhz 32 bit to card #3. Since the transfer to card #3 will be so much slower, it takes longer to complete. The time that is lost may otherwise have been used for moving data faster with card #2. You should also remember that a PCI bus is a unidirectional bus, which means that when it is doing a transfer in one direction, it cannot move data in the other direction, even from another card. Real-world bandwidth is generally around 80% of the theoretical maximum (clock * width). Following are rough estimates for bandwidths that can be expected: 64 bit/ 33 Mhz = approximately 200 MB/second 64 bit/ 66 Mhz = approximately 400 MB/second 64 bit/100 Mhz = approximately 600 MB/second 64 bit/133 Mhz = approximately 800 MB/second
A drive has sequential access bandwidth and average latency times for seek and rotational delays. Drives perform optimally when doing sequential I/O to disk. Non-sequential I/O forces movement of the disk head (that is, seek and rotational latency).
This movement is a huge overhead compared to the amount of data transferred. So the more non-sequential I/O done, the slower it will get. Reading and/or writing more than one stream at a time will result in a mix of short bursts of sequential I/O with seek and rotational latency in between, which will significantly degrade overall throughput. Because different drive types have different seek and rotational latency specifications, the type of drive selected has a large effect on how much the degradation will be. From best to worst, such drives are fibre channel, SCSI, and SATA, with SATA drives usually twice the latency of fibre channel. However, SATA drives have about 80% the sequential performance that fibre channel drives do.
A RAID controller has cache memory of varying sizes. The controller also does the parity calculations for RAID-5. Better controllers have this calculation (called XOR) in hardware which makes it faster. If there is no hardware-assisted calculation, the controller processor must perform it, and controller processors are not usually high performance. A PCI card can be limited either by the speed supported for the port(s) or the clock rate to the PCI bus. A PCI bridge is usually not an issue because it is sized to handle whatever PCI buses are attached to it.
Memory can be a limit if there is other intensive non-I/O activity in the system. Note that there are no CPUs for the host processor(s) in the Performance hierarchy diagram on page 158. While CPU performance is obviously a contributor to all performance, it is generally not the bottleneck in most modern systems for I/O intensive workloads, because there is very little work done at that level. The CPU must execute a read operation and a write operation, but those operations do not take up much bandwidth. An exception is when older gigabit ethernet card(s) are involved, because the CPU has to do more of the overhead of network transfers.
165
Example 1
A general hardware configuration could have dual 2-gigabit fibre channel ports on a single PCI card. In such a case, the following is true:
Potential bandwidth is approximately 400 MB/second. For maximum performance, the card must be plugged into at least a 66 Mhz PCI slot. No other cards on that bus should need to transfer data at the same time. That single card will saturate the PCI bus. Putting 2 of these cards (4 ports total) onto the same bus and expecting them to aggregate to 800 MB/second will never work unless the bus and cards are 133 Mhz.
Example 2
The following more detailed example shows a pyramid of bandwidth potentials with aggregation capabilities at some points. Suppose you have the following hardware:
1x 66 Mhz quad 1 gigabit ethernet 4x 66 Mhz 2 gigabit fibre channel 4x disk array with 1 gigabit fibre channel port 1x Sun V880 server (2x 33 Mhz PCI buses and 1x 66 Mhz PCI bus)
In this case, for maximum backup and restore throughput with clients on the network, the following is one way to assemble the hardware so that no constraints limit throughput.
The quad 1-gigabit ethernet card can do approximately 400 MB/second throughput at 66 Mhz. It requires at least a 66 Mhz bus, because putting it in a 33 Mhz bus would limit throughput to approximately 200 MB/second. It will completely saturate the 66 Mhz bus, so do not put any other cards on that bus that need significant I/O at the same time.
Since the disk arrays have only 1-gigabit fibre channel ports, the fibre channel cards will degrade to 1 gigabit each.
166 Tuning disk I/O performance Tuning software for better performance
Each card can therefore move approximately 100 MB/second. With four cards, the total is approximately 400 MB/second. However, you do not have a single PCI bus available that can support that 400MB /second, since the 66-Mhz bus is already taken by ethernet card. There are two 33-Mhz buses which can each support approximately 200 MB/second. Therefore, you can put 2 of the fibre channel cards on each of the 2 buses.
This configuration can move approximately 400 MB/second for backup or restore. Real-world results of a configuration like this show approximately 350 MB/second.
167
Figure 10-3
Host Memory Level 5 PCI bridge PCI bus Level 4 PCI card 1 PCI card 2 PCI card 3 PCI bus PCI bridge PCI bus
Level 3 Fibre channel Array 1 Raid controller Level 2 Shelf Shelf adaptor Level 1 Shelf Shelf adaptor Fibre channel Array 2 Raid controller Tape, Ethernet, or another non-disk device
Drives
Drives
Each shelf in the disk array has 9 drives because it uses a RAID 5 group of 8+1 (that is, 8 data disks + 1 parity disk). The RAID controller in the array uses a stripe unit size when performing I/O to these drives. Suppose that you know the stripe unit size to be 64KB. This means that when writing a full stripe (8+1) it will write 64KB to each drive. The amount of non-parity data is 8 * 64KB, or 512KB. So, internal to the array, the optimal I/O size is 512KB. This means that crossing Level 3 to the host PCI card should perform I/O at 512KB. The diagram shows two separate RAID arrays on two separate PCI buses. You want both to be performing I/O transfers at the same time. If each is optimal at 512K, the two arrays together are optimal at 1MB.
168 Tuning disk I/O performance Tuning software for better performance
You can implement software RAID-0 to make the two independent arrays look like one logical device. RAID-0 is a plain stripe with no parity. Parity protects against drive failure, and this configuration already has RAID-5 parity protecting the drives inside the array. The software RAID-0 is configured for a stripe unit size of 512KB (the I/O size of each unit) and a stripe width of 2 (1 for each of the arrays). Since 1MB is the optimum I/O size for the volume (the RAID-0 entity on the host), that size is used throughout the rest of the I/O stack.
If possible, configure the file system mounted over the volume for 1MB. The application performing I/O to the file system also uses an I/O size of 1MB. In NetBackup, I/O sizes are set in the configuration touch file .../db/config/SIZE_DATA_BUFFERS_DISK. See Changing the size of shared data buffers on page 114.
Chapter
11
Kernel tuning (UNIX) on page 170 Kernel parameters on Solaris 8 and 9 on page 170 Kernel parameters on Solaris 10 on page 172 Message queue and shared memory parameters on HP-UX on page 174 Kernel parameters on Linux on page 176 Adjusting data buffer size (Windows) on page 176 Other Windows issues on page 178
Message queues set msgsys:msginfo_msgmax = maximum message size set msgsys:msginfo_msgmnb = maximum length of a message queue in bytes. The length of the message queue is the sum of the lengths of all the messages in the queue. set msgsys:msginfo_msgmni = number of message queue identifiers set msgsys:msginfo_msgtql = maximum number of outstanding messages system-wide that are waiting to be read across all message queues. Semaphores set semsys:seminfo_semmap = number of entries in semaphore map set semsys:seminfo_semmni = maximum number of semaphore identifiers system-wide set semsys:seminfo_semmns = number of semaphores system-wide set semsys:seminfo_semmnu = maximum number of undo structures in system
171
set semsys:seminfo_semmsl = maximum number of semaphores per id set semsys:seminfo_semopm = maximum number of operations per semop call set semsys:seminfo_semume = maximum number of undo entries per process
Shared memory set shmsys:shminfo_shmmin = minimum shared memory segment size set shmsys:shminfo_shmmax = maximum shared memory segment size set shmsys:shminfo_shmseg = maximum number of shared memory segments that can be attached to a given process at one time set shmsys:shminfo_shmmni = maximum number of shared memory identifiers that the system will support
The ipcs -a command displays system resources and their allocation, and is a useful command to use when a process is hanging or sleeping to see if there are available resources for it to use.
Example:
This is an example of tuning the kernel parameters for NetBackup master servers and media servers, for a Solaris 8 or 9 system. Symantec provides this information only to assist in kernel tuning for NetBackup. See Kernel parameters on Solaris 10 on page 172 for Solaris 10. These are recommended minimum values. If /etc/system already contains any of these entries, use the larger of the existing setting and the setting provided here. Before modifying /etc/system, use the command /usr/sbin/sysdef -i to view the current kernel parameters. After you have changed the settings in /etc/system, reboot the system to allow the changed settings to take effect. After rebooting, the sysdef command will display the new settings. *BEGIN NetBackup with the following recommended minimum settings in a Solaris /etc/system file *Message queues set msgsys:msginfo_msgmnb=65536 set msgsys:msginfo_msgmni=256 set msgsys:msginfo_msgssz=16 set msgsys:msginfo_msgtql=512 *Semaphores set semsys:seminfo_semmap=64 set semsys:seminfo_semmni=1024 set semsys:seminfo_semmns=1024 set semsys:seminfo_semmnu=1024 set semsys:seminfo_semmsl=300
set semsys:seminfo_semopm=32 set semsys:seminfo_semume=64 *Shared memory set shmsys:shminfo_shmmax=<one half of system memory> set shmsys:shminfo_shmmin=1 set shmsys:shminfo_shmmni=220 set shmsys:shminfo_shmseg=100 *END NetBackup recommended minimum settings
Socket Parameters on Solaris 8 and 9 The TCP_TIME_WAIT_INTERVAL parameter sets the amount of time to wait after a TCP socket is closed before it can be used again. This is the time that a TCP connection remains in the kernel's table after the connection has been closed. The default value for most systems is 240000, which is 4 minutes (240 seconds) in milliseconds. If your server is slow because it handles many connections, check the current value for TCP_TIME_WAIT_INTERVAL and consider reducing it to 10000 (10 seconds). Use the following command: ndd -get /dev/tcp tcp_time_wait_interval Force load parameters on Solaris 8 and 9 When system memory gets low, Solaris unloads unused drivers from memory and reloads drivers as needed. Tape drivers are a frequent candidate for unloading, since they tend to be less heavily used than disk drivers. Depending on the timing of these unload and reload events for the st (Sun), sg (Symantec), and Fibre Channel drivers, various issues may result. These issues can range from devices disappearing from a SCSI bus to system panics. Symantec recommends adding the following forceload statements to the /etc/system file. These statements prevent the st and sg drivers from being unloaded from memory: forceload: dev/st forceload: dev/sg Other statements may be necessary for various Fibre Channel drivers, such as the following example for JNI: forceload: dev/fcaw
173
For information on tuning these system resources, see Chapter 6, Resource Controls (Overview), in the Sun System Administration Guide: Solaris Containers-Resource Management and Solaris Zones. For further assistance with Solaris parameters, refer to the Solaris Tunable Parameters Reference Manual, available at: http://docs.sun.com/app/docs/doc/819-2724?q=Solaris+Tunable+Parameters The following sections of the Solaris Tunable Parameters Reference Manual may be of particular interest:
Whats New in Solaris System Tuning in the Solaris 10 Release? System V Message Queues System V Semaphores System V Shared Memory
Disable tcp_fusion
Symantec recommends that tcp_fusion be disabled. With tcp_fusion enabled, NetBackup performance may be degraded and processes such as bptm may pause intermittently. Use one of the following methods to disable tcp_fusion. Use the modular debugger (mdb) This method does not require a system reboot. Note however that tcp_fusion will be re-enabled at the next reboot. You must follow these steps each time the system is rebooted. Caution: System operation could be disrupted if this procedure is not followed carefully. 1 2 When no backup or restore jobs are active, run the following command:
echo 'do_tcp_fusion/W 0' | mdb -kw
174 OS-related tuning factors Message queue and shared memory parameters on HP-UX
Use the /etc/system file This is a simpler procedure but requires the system to be rebooted.
Once the /etc/system file is updated, you must restart the system before the change takes effect.
OS-related tuning factors Message queue and shared memory parameters on HP-UX
175
Note on shmmax
Note the following: shmmax = NetBackup shared memory allocation = (SIZE_DATA_BUFFERS * NUMBER_DATA_BUFFERS) * number of drives * MPX per drive SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS are also discussed under Recommended shared memory settings on page 117.
3 4
Key in the new value. Repeat above steps for all the desired parameters. Once all the values have been changed, select Actions > Process New Kernel. This brings up a warning stating that a reboot is required to move the values into place. After the reboot, the sysdef command can be used to confirm that the correct value is in place.
Caution: Any changes to the kernel will require a reboot in order to move the new kernel into place. Do not make changes to the parameters unless a system reboot can be performed, or the changes will not be saved.
For further information on setting boot options for st, see /usr/src/linux*/drivers/scsi/README.st, subsection BOOT TIME.
177
* MaximumSGList Windows includes enhanced scatter/gather list support for doing very large SCSI I/O transfers. Windows supports up to 256 scatter/gather segments of 4096 bytes each, allowing transfers up to 1048576 bytes. Note: The OEMSETUP.INF file has been updated to automatically update the registry to support 65 scatter/gather segments. Normally, no additional changes will be necessary as this typically results in the best overall performance. To change the data buffer size, do the following: 1 2 Click Start > Run and open the REGEDT32 program. Select HKEY_LOCAL_MACHINE and follow the tree structure down to the QLogic driver as follows: HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Services > Ql2200 > Parameters > Device. Double click MaximumSGList:REG_DWORD:0x21 Enter a value from 16 to 255 (0x10 hex to 0xFF). A value of 255 (0xFF) enables the maximum 1 Megabyte transfer size. Setting a value higher than 255 reverts to the default of 64-Kilobyte transfers. The default value is 33 (0x21). Click OK. Exit the Registry Editor, then shut down and reboot the system. The main definition is the so-called SGList (Scatter/Gather list). This parameter sets the number of pages that can be either scattered or gathered (that is, read or written) in one DMA transfer. For the QLA2200, you set the parameter MaximumSGList to 0xFF (or just to 0x40 for 256Kb) and can then set 256Kb buffer sizes for NetBackup. Extreme caution should be used when attempting to modify this registry value, and the vendor of the SCSI/Fiber card should always be contacted first to ascertain the maximum value that particular card can support. The same should be possible for other HBAs as well, especially fiber cards. The default for JNI fiber cards using driver version 1.16 is actually 0x80 (512Kb or 128 pages). The default for the Emulex LP8000 is 0x81 (513Kb or 129 pages). Note that for this approach to work, the HBA has to install its own SCSI miniport driver. If it does not, transfers will be limited to 64 Kilobytes. This is for legacy cards like old SCSI cards. In conclusion, the built-in limit on Windows is 1024 Kilobytes, unless you are using the default Microsoft miniport driver for legacy cards. The
3 4
5 6
limitations are all to do with the HBA drivers and the limits of the physical devices attached to them. For example, Quantum DLT7000 drives work best with 128-Kilobyte buffers and StorageTek 9840 drives with 256-Kilobyte buffers. If these values are increased too far, this could result in damage to either the HBA or the tape drives or any devices in between (fiber bridges and switches, for example).
Troubleshoot NetBackups use of configuration files on Windows systems. If you create a configuration file on a Windows system for NetBackups use (on UNIX systems, such files are called touch files), the file name must match the file name that NetBackup is expecting. In particular, make sure the file name does not have an extension, such as .txt. If, for instance, you create a file called NOexpire to prevent the expiration of backup images, this file will not produce the desired effect if the files name is NOexpire.txt. Note also: the file must use a supported type of encoding, such as ANSI. Unicode encoding is not supported; if the file is in Unicode, it will not produce the desired effect. To check the encoding type, open the file using a tool that displays the current encoding, such as Notepad. Select File > Save As and check the options in the Encoding field. ANSI encoding will work properly. Disable antivirus software when running file system backup on Windows 2000 or Windows XP. Antivirus applications scan all files backed up by NetBackup, and thus load down the clients CPU and slow its backups. As a work around, in the Backup, Archive, and Restore interface, on the NetBackup Client Properties dialog, General tab, clear the checkbox next to Perform incrementals based on archive bit.
Appendix
Additional resources
This chapter lists additional sources of information.
Storage Mountain, previously called Backup Central, is a resource for all backup-related issues. It is located at http://www.storagemountain.com. The following article discusses how and why to design a scalable data installation: High-Availability SANs, Richard Lyford, FC Focus Magazine, April 30, 2002.
Iperf, for measuring TCP and UDP bandwidth: http://dast.nlanr.net/Projects/Iperf1.1.1/index.htm Bonnie, for measuring the performance of UNIX file system operations: http://www.textuality.com/bonnie Bonnie++, extends the capabilities of Bonnie: http://www.coker.com.au/bonnie++/readme.html Tiobench, for testing I/O performance with multiple running threads: http://sourceforge.net/projects/tiobench/
You can find Veritas NetBackup news groups at: http://forums.veritas.com. Search on the keyword NetBackup to find threads relevant to NetBackup. The email list Veritas-bu discusses backup-related products such as NetBackup. Archives for Veritas-bu are located at: http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Index
Symbols
/dev/null 94 /dev/rmt/2cbn 144 /dev/rmt/2n 144 /etc/rc2.d 135 /etc/system 170 /proc/sys (Linux) 176 /user/openv/netbackup/db/config/SIZE_DATA_BU FFERS 115 /user/openv/netbackup/db/config/SIZE_DATA_BU FFERS_DISK 115 /usr/openv/netbackup 110 /usr/openv/netbackup/bin/admincmd/bpimage 13 4 /usr/openv/netbackup/db/error 93 /usr/openv/netbackup/db/images 135 /usr/openv/netbackup/db/media 68 /usr/openv/netbackup/logs 94 /usr/sbin 69 /usr/sbin/devfsadm 70 /usr/sbin/modinfo 69 /usr/sbin/modload 70 /usr/sbin/modunload 69 /usr/sbin/ndd 135 /usr/sbin/tapes 69
All Log Entries report 89, 91, 93 ALL_LOCAL_DRIVES 36, 60 alphabetical order, storage units 53 ANSI encoding 178 antivirus software 178 arbitrated loop 160 archive bit 178 archiving catalog 58 array RAID controller 160 arrays 104 ATA 21 ATM card 37 auto-negotiate 106 AUTOSENSE 106 available_media report 72 available_media script 66, 72 Avg. Disk Queue Length counter 98
B
backup catalog 61 disk or tape 72 environment, dedicated or shared 72 large catalog 58 load adjusting 103 load leveling 102 monopolizing devices 104 user-directed 87 window 101 Backup Central 179 Backup Tries parameter 68 balancing load 103 bandwidth 163 bandwidth limiting 102 Bare Metal Restore (BMR) 146 best practices 77, 78, 80, 82 Bonnie 179 Bonnie++ 179 boot options (Linux) 176 bottlenecks 89, 106 freeware tools for detecting 179
Numerics
10000BaseT 22 1000BaseT 22 100BaseT 22 100BaseT cards 37
A
Activity Monitor 90 adjusting backup load 103 data buffer size (Windows) 176 error threshold 67 network communications buffer 107 read buffer size 148
182
bp.conf file 68, 111 bpbkar 105, 120, 123 bpbkar log 93, 94 bpbkar32 105, 120 bpdm log 93 bpdm_dev_null 95 bpend_notify.bat 105 bpimage 134 bpmount -i 60 bprd 50, 53 bpsetconfig 145 bpstart_notify.bat 105 bptm 120, 123, 126, 129, 131 bptm log 68, 93 buffers 107, 176 and FlashBackup 148 changing 113 changing Windows buffers 177 default number of 112 default size of 112 for network communications 108 shared 111 tape 111 testing 118 wait and delay 119 Windows 176 bus 66
C
cache device (snapshot) 148 calculate actual data transfer rate required 20 length of backups 18 network transfer rate 22 number of robotic tape slots needed 30 number of tape drives needed 20 number of tapes needed 28 shared memory 113 space needed for NBDB database 26, 27 catalog 134 archiving 58 backup requirements 102 backups not finishing 58 backups, guidelines 57 compression 58, 59 large backups 58 managing 57 Checkpoint Restart 134 child delay values 118
CHILD_DELAY file 119 cleaning robots 79 tape drives 78 tapes 72 client compression 145 tuning performance 104 variables 88 Client Job Tracker 147 clock or cycle time 162 Committed Bytes 97 common system resources 95 communications buffer 108, 109 process 120 Communications Buffer Size parameter 108, 109 COMPRESS_SUFFIX option 145 compression 98, 144 and encryption 146 catalog 58, 59 how to enable 144 tape vs client 145 configuration files (Windows) 178 configuration guidelines 60 CONNECT_OPTIONS 51 controller 98 copy-on-write snapshot 148 counters 119 algorithm 121 determining values of 123 in Windows performance 96 wait and delay 119 CPU 94, 96 and performance 164 load, monitoring 94 utilization 51 CPUs needed per media server component 37 critical policies 61 cumulative-incremental backup 18 custom reports available media 72 cycle time 162
D
daily_messages log 61 data buffer overview 111 size 107, 176
183
data compression 138 Data Consumer 122 data path through server 159 data producer 121, 122 data recovery, planning for 80, 81 data stream and tape efficiency 137 data throughput 88 statistics 89 data transfer path 89, 100 basic tuning 101 data transfer rate for drive controllers 21 for tape drives 19 required 20 data variables 88 database backups 140 restores 136 databases list of pre-6.0 databases 28 DB2 restores 136 Deactivate command 87 dedicated backup servers 102 dedicated private networks 102 delay buffer 119 in starting jobs 48 values, parent/child 118 de-multiplexing 101 designing master server 33 media server 37 Detailed Status tab 90 devfsadmd daemon 69 device names 144 reconfiguration 69 devlinks 69 disable TapeAlert 79 disaster recovery 80, 81 testing 52 disk full 98 increase performance 98 load, monitoring 97 performance, issues affecting 157 speed, measuring 94 staging 55 versus tape 72
Disk Queue Length counter 98 disk speed, measuring 95 Disk Time counter 97 disk-based storage 72 diskperf command 97 disks, adding 104 down drive 67, 68 drive controllers 21 drive selection 70 drive_error_threshold 67, 68 drives, number per network connection 66 drvconfig 69
E
email list (Veritas-bu) 180 EMM 50, 59, 66, 70, 72, 78 EMM database derived from pre-6.0 databases 28 EMM server calculating space needed for 26, 27 EMM transaction log 27 encoding, file 178 encryption 144 and compression 146 and multi-streaming 144 error logs 68, 90 error threshold value 66 Ethernet connection 158 evaluating components 94, 95 evaluating performance Activity Monitor 90 All Log Entries report 91 encryption 144 NetBackup clients 104 NetBackup servers 111 network 87, 106 overview 86 exclude lists 60
F
factors in choosing disk vs tape 73 in job scheduling 50 failover, storage unit groups 54 fast-locate 132 FBU_READBLKS 148 FC-AL 160, 162 fibre channel 159, 161
184
arbitrated loop 160 file encoding 178 file ID on vxlogview 61 file system space 55 files backing up many small 146 Windows configuration 178 firewall settings 51 FlashBackup 147, 148 Flexible Disk Option 77 force load parameters (Solaris) 172 forward space filemark 131 fragment size 130, 132 considerations in choosing 131 fragmentation 104 level 98 freeware tools 179 freeze media 66, 67, 68 frequency-based tape cleaning 78 frozen volume 67 full backup 73 full backup and tape vs disk 74 full duplex 106
scaling 166 IMAGE_FILES 135 IMAGE_INFO 135 IMAGE_LIST 135 improving performance, see tuning include lists 60 increase disk performance 98 incremental backups 73, 101, 141 index performance 134 info on tuning 179 insufficient memory 97 interfaces, multiple 110 ipcs -a command 171 Iperf 179 iSCSI 21 iSCSI bus 66
J
Java interface 39, 146 job delays 49 scheduling 48, 50 scheduling, limiting factors 50 Job Tracker 105 jobs queued 48, 49 JVM Option Number 0 (NOM) 149
G
Gigabit Ethernet cards 38 Gigabit Fibre Channel 21 globDB 28 goodies directory 72 groups of storage units 53
K
kernel tuning Linux 176 Solaris 170
H
hardware components and performance 163 configuration examples 165 elements affecting performance 158 performance considerations 163 heap size, adjusting NOM server 149 NOM web server 150 hierarchy, disk 158 host memory 159 host name resolution 88 hot catalog backup 57, 61
L
larger buffer (FlashBackup) 148 largest fragment size 131 latency 163 legacy logs 61 leveling load among backup components 103 library-based tape cleaning 80 Limit jobs per policy attribute 48, 103 limiting fragment size 131 link down 106 Linux, kernel tunable parameters 176 load leveling 102, 103 monitoring 96 parameters (Solaris) 172
I
I/O operations
185
local backups 141 Log Sense page 79 logging 87 logs 61, 68, 90, 123, 147 managing 61 viewing 61 long-term storage 73 ltidevs 28 LTO drives 28, 38
M
mailing lists 180 managing logs 61 the catalog 57 Manual Backup command 87 master server CPU utilization 51 designing 33 determining number of 35 splitting 59 MAX_HEAP setting (NOM) 149 Maximum concurrent write drives 48 Maximum jobs per client 48 Maximum Jobs Per Client attribute 103 Maximum streams per drive 48 maximum throughput rate 138 Maximum Transmission Unit (MTU) 118 MaximumSGList 177 MDS 70 measuring disk read speed 94, 95 NetBackup performance 86 media catalog 67 error threshold 67 not available 66 pools 72 positioning 137 threshold for errors 66 media and device selection logic 70 media errors database 28 Media List report 66 media manager drive selection 70 Media multiplexing setting 48 media server 38 designing 37 factors in sizing 39
not available 49 number needed 38 number supported by a master 36 media_error_threshold 67, 68 mediaDB 28 MEGABYTES_OF_MEMORY 145 memory 159, 163, 164 amount required 38, 113 insufficient 97 monitoring use of 94, 97 shared 111 merging master servers 59 message queue 170 message queue parameters HP-UX 174 migration 77 Mode Select page 79 Mode Sense 79 modload command 70 modunload command 69 monitoring data variables 88 MPX_RESTORE_DELAY option 136 MTFSF/MTFSR 131 multiple drives, storage unit 48 multiple interfaces 110 multiple small files, backing up 146 multiplexed backups and fragment size 131 database backups 136 multiplexed image, restoring 132 multiplexing 73, 101, 140 and memory required 38 effects of 142 schedule 48 set too high 135 when to use 140 multi-streaming 140 NEW_STREAM directive 142 when to use 140
N
namespace.chksum 28 naming conventions 82 policies 83 storage units 83 NBDB database 26, 27 NBDB.log 58 nbemmcmd command 67
186
nbjm and job delays 49 nbpem 53 nbpem and job delays 48 nbu_snap 148 ndd 135 NET_BUFFER_SZ 107, 108, 109, 116 NET_BUFFER_SZ_REST 107 NetBackup capacity planning 11 catalog 134 job scheduling 48 news groups 180 restores 130 scheduler 86 NetBackup Client Job Tracker 147 NetBackup Java console 146 NetBackup Operations Manager, see NOM NetBackup Relational Database 59 NetBackup relational database files 58 NetBackup Vault 146 network bandwidth limiting 102 buffer size 107 communications buffer 108 connection options 51 connections 106 interface cards (NICs) 106 load 106 multiple interfaces 110 performance 87 private, dedicated 102 tapes drives and 66 traffic 106 transfer rate 22 tuning 106 tuning and servers 102 variables 87 Network Buffer Size parameter 108, 126 NEW_STREAM directive 143 news groups 180 no media available 66 NO_TAPEALERT touch file 79 NOexpire touch file 53, 178 NOM 38, 52, 72 adjusting server heap size 149 Sybase cache size 150 web server heap size 150 database 40
defragmenting databases 154 for monitoring jobs 52 purging data 154 sizing 39 store database and logs on separate disk 151 nominal throughput rate 138 nomserver.conf file 150 nomsrvctl file (NOM) 149 none pool 72 non-multiplexed restores 132 no-rewind option 144 NOSHM file 109 Notepad, checking file encoding 178 notify scripts 105 NUMBER_DATA_BUFFERS 113, 117, 118, 175 NUMBER_DATA_BUFFERS_DISK 113 NUMBER_DATA_BUFFERS_RESTORE 114, 134
O
OEMSETUP.INF file 177 offload work to additional master 59 on-demand tape cleaning see TapeAlert online (hot) catalog backup 61 Oracle 136 restores 136 order of using storage units 53 out-of-sequence delivery of packets 147
P
packets 147 Page Faults 97 parent/child delay values 118 PARENT_DELAY file 118 patches 147 PCI bridge 159, 163, 164 PCI bus 159, 162, 163 PCI card 159, 164 performance and CPU 164 and disk hardware 157 and hardware issues 163 see also tuning strategies and considerations 101 performance evaluation 86 Activity Monitor 90 All Log Entries report 91 monitoring CPU 96
187
monitoring disk load 97 monitoring memory use 94, 97 system components 94, 95 PhysicalDisk object 97 policies critical 61 guidelines 60 naming conventions 83 Policy Update Interval 48 poolDB 28 pooling conventions 72 position error 68 Process Queue Length 96 Processor Time 96
S
SAN Client 102 SAN client 76, 104 SAN fabric 161 SAN Media Server 102 sar command 94 SATA 21, 22 Scatter/Gather list 177 schedule naming, best practices 83 scheduling 48, 86 delays 48 disaster recovery 52 limiting factors 50 scratch pool 72 SCSI bus 66 SCSI/FC connection 137 SDLT drives 29, 38 search performance 134 semaphore (Solaris) 170 Serial ATA (SATA) 160 server data path through 159 splitting master from EMM 60 tuning 111 variables 86 SGList 177 shared data buffers 111 changing 113 default number of 112 default size of 112 shared memory 109, 111 amount required 113 parameters, HP-UX 174 recommended settings 117 Solaris parameters 170 testing 118 shared-access topology 160, 163 shelf 160 SIZE_DATA_BUFFERS 116, 117, 118, 175, 176 SIZE_DATA_BUFFERS_DISK 115 small files, backup of 146 SMART diagnostic standard 79 snap mirror 147 snapshot cache device 148 Snapshot Client 147, 148
Q
queued jobs 48, 49
R
RAID 98, 104 controller 160, 164 rate of data transfer 17 raw partition backup 148 read buffer size adjusting 148 and FlashBackup 148 reconfigure devices 69 recovering data, planning for 80, 81 recovery time 73 Reduce fragment size setting 131 REGEDT32 177 registry 177 reload st driver without rebooting 69 report 93 All Log Entries 91 media 72 resizing read buffer (FlashBackup) 149 restore and network 135 in mixed environment 135 multiplexed image 132 of database 136 performance of 134 RESTORE_RETRIES for restores 68 retention period 73 RMAN 137 robot cleaning 79
188
snapshots 105 socket communications 109 parameters (Solaris) 172 software compression (client) 145 tuning 166 Solaris clients and FlashBackup read buffer 148 kernel tuning 170 splitting master server 59 SSOhosts 28 st driver reloading 69 staging, disk 55, 73 Start Window 86 STK drives 29 storage device performance 137 Storage Mountain 179 storage unit 53, 104 groups 53 naming conventions 83 not available 49 Storage Unit dialog 131 storage_units database 28 streaming (tape drive) 73, 138 striped volumes (VxVM) 148 striping block size 148 volumes on disks 101 stunit_groups 28 suspended volume 67 switches 161 Sybase cache size, adjusting 150 synthetic backups 108 System Administration Manager (SAM) 175 system resources 95 system variables, controlling 86
T
Take checkpoints setting 134 tape block size 113 buffers 111 cleaning 72, 78 compression 145 efficiency 137 full, frozen, suspended 72 number of tapes needed for backups 28
position error 68 streaming 73, 138 versus disk 72 tape connectivity 66 reload st driver 69 tape drive 137 cleaning 78 number needed 20 number per network connection 66 technologies 77 technology needed 18 transfer rates 19 types 38 tape library number of tape slots needed 30 TapeAlert 78 tape-based storage 72 tar 121 tar32 121 TCP/IP 147 tcp_deferred_ack_interval 135 testing conditions 86 threshold error, adjusting 67 for media errors 66 throughput 89 time to data 73 Tiobench 179 tools (freeware) 179 topology (hardware) 162 touch files 53, 110 encoding 178 traffic on network 106 transaction log 27 transaction log file 58 transfer rate drive controllers 21 for backups 17 network 22 required 20 tape drives 19 True Image Restore option 146 tuning additional info 179 basic suggestions 101 buffer sizes 107, 108 client performance 104 data transfer path, overview 100 device performance 137
189
FlashBackup read buffer 148 Linux kernel 176 network performance 106 restore performance 130, 135 search performance 134 server performance 111 software 166 Solaris kernel 170
U
Ultra-3 SCSI 21 Ultra320 SCSI 22 Unicode encoding 178 unified logging, viewing 61 user-directed backup 87
V
Vault 146 verbosity level 147 Veritas-bu email list 180 viewing logs 61 virus scans 105, 147 Vision Online 179 vmstat 94 volDB 28 volume frozen 67 pools 72 suspended 67 vxlogview 61 file ID 61 VxVM striped volumes 148
W
wait/delay counters 119, 120, 123 analyzing problems 126 correcting problems 129 for local client backup 123 for local client restore 124 for remote client backup 124 for remote client restore 125 wear of tape drives 137 webgui command 150 Wide Ultra 2 SCSI 21 wild cards in file lists 60 Windows Performance Monitor 95 Working Set, in memory 97
190