Professional Documents
Culture Documents
August 2011
Oracle Database High Availability Best Practices 11g Release 2 (11.2) E10803-01 Copyright 2005, 2011, Oracle and/or its affiliates. All rights reserved. Primary Authors: Lawrence To, Viv Schupmann, Thomas Van Raalte, Virginia Beecher Contributors: Andrew Babb, Janet Blowney, Larry Carpenter, Timothy Chien, Jay Davison, Senad Dizdar, Ray Dutcher, Mahesh Girkar, Stephan Haisley, Holger Kalinowski, Nitin Karkhanis, Frank Kobylanski, Rene Kundersma, Joydip Kundu, Barb Lundhild, Roderick Manalac, Pat McElroy, Robert McGuirk, Joe Meeks, Markus Michalewicz, Valarie Moore, Michael Nowak, Darryl Presley, Michael T. Smith, Vinay Srihari, Lawrence To, Douglas Utzig, James Viscusi, Vern Wagman, Steve Wertheimer, Shari Yamaguchi This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065. This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.
Contents
Preface ............................................................................................................................................................... xiii
Audience..................................................................................................................................................... Documentation Accessibility ................................................................................................................... Related Documents ................................................................................................................................... Conventions ............................................................................................................................................... xiii xiii xiii xiv
iii
4.2.2 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.4 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.4.7 4.5 4.5.1 4.5.2 4.5.3 4.5.4 4.5.5 4.5.6 4.5.7 4.5.8 4.6 4.6.1 4.6.2
Use Oracle Restart for Oracle ASM Instances (non-clustered Oracle Database)........ 4-2 Oracle ASM Strategic Best Practices......................................................................................... 4-3 Use a Simple Disk and Disk Group Configuration ........................................................ 4-3 Use Redundancy to Protect from Disk Failure................................................................ 4-6 Oracle ASM in the Grid Infrastructure Home................................................................. 4-8 Ensure Disks in the Same Disk Group Have the Same Characteristics ....................... 4-8 Use Failure Groups When Using Oracle ASM Redundancy......................................... 4-8 Use Intelligent Data Placement.......................................................................................... 4-9 Use Oracle ACFS to Manage Files Outside the Database .............................................. 4-9 Oracle ASM Configuration Best Practices............................................................................... 4-9 Use Disk Multipathing Software to Protect from Path Failure .................................. 4-10 Use Automatic Memory Management with MEMORY_TARGET Parameter ........ 4-10 Set the PROCESSES Initialization Parameter ............................................................... 4-11 Use Disk Labels ................................................................................................................. 4-11 Set the DISK_REPAIR_TIME Disk Group Attribute Appropriately......................... 4-11 Use ASMLib On Supported Platforms .......................................................................... 4-11 Disable Variable Sized Extents ....................................................................................... 4-12 Oracle ASM Operational Best Practices................................................................................ 4-12 Use SYSASM for Oracle ASM Authentication ............................................................. 4-12 Set Rebalance to the Maximum Limit that Does Not Affect Service Levels ............ 4-13 Use a Single Command to Mount Multiple Disk Groups .......................................... 4-13 Use a Single Command to Add or Remove Storage.................................................... 4-13 Check Disk Groups for Imbalance ................................................................................. 4-14 Proactively Mine Vendor Logs for Disk Errors............................................................ 4-14 Use ASMCMD Utility to Ease Manageability of Oracle ASM ................................... 4-14 Use Oracle ASM Configuration Assistant (ASMCA) .................................................. 4-15 Use Oracle Storage Grid.......................................................................................................... 4-15 Oracle Storage Grid Best Practices for Unplanned Outages ...................................... 4-15 Oracle Storage Grid Best Practices for Planned Maintenance.................................... 4-16
iv
Use Automatic Undo Management ............................................................................... Use Locally Managed Tablespaces................................................................................. Use Automatic Segment Space Management ............................................................... Use Temporary Tablespaces and Specify a Default Temporary Tablespace ........... Use Resumable Space Allocation ................................................................................... Use Database Resource Manager ...................................................................................
8.2 8.2.1 8.2.2 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6 8.3.7 8.3.8 8.3.9 8.4 8.4.1 8.4.2 8.5 8.6 8.7
Determine Protection Mode and Data Guard Transport ...................................................... 8-3 Use Redo Transport Services Best Practices .................................................................... 8-5 Assess Performance with Proposed Network Configuration ....................................... 8-5 General Data Guard Configuration Best Practices................................................................. 8-6 Use Oracle Data Guard Broker with Oracle Data Guard............................................... 8-6 Use Recovery Manager to Create Standby Databases.................................................... 8-6 Use Flashback Database for Reinstatement After Failover ........................................... 8-7 Use FORCE LOGGING Mode ........................................................................................... 8-7 Use a Simple, Robust Archiving Strategy and Configuration ...................................... 8-8 Use Standby Redo Logs and Configure Size Appropriately......................................... 8-9 Use Data Guard Transport and Network Configuration Best Practices .................. 8-11 Use Data Guard Redo Apply Best Practices ................................................................. 8-12 Implement Multiple Standby Databases ....................................................................... 8-16 Oracle Data Guard Role Transition Best Practices .............................................................. 8-18 Oracle Data Guard Switchovers Best Practices ............................................................ 8-18 Oracle Data Guard Failovers Best Practices.................................................................. 8-20 Use Oracle Active Data Guard Best Practices...................................................................... 8-24 Use Snapshot Standby Database Best Practices................................................................... 8-25 Assessing Data Guard Performance ..................................................................................... 8-26
vi
ACFS Snapshots ................................................................................................................ 9-16 Oracle Sun ZFS Storage Appliance Snapshots ............................................................. 9-16 Tape Backups..................................................................................................................... 9-17
12
vii
13
14
Glossary Index
viii
ix
List of Figures
41 42 121 122 123 124 125 126 127 128 129 1210 1211 131 132 133 134 135 136 137 138 139 1310 141 142 Allocating Entire Disks .............................................................................................................. 4-5 Partitioning Each Disk................................................................................................................ 4-5 Enterprise Manager Home Page............................................................................................ 12-3 Setting Notification Rules for Availability ........................................................................... 12-5 Setting Notification Rules for Metrics................................................................................... 12-9 Database Home Page............................................................................................................. 12-10 Database Home Page with Targets Showing Policy Violations...................................... 12-13 Database Targets Policy Trend Overview Page ................................................................ 12-14 Shows Compliance Tab with Policy Violations................................................................. 12-15 Monitoring a Primary Database in the High Availability Console ................................ 12-17 Monitoring the Standby Database in the High Availability Console............................. 12-18 Monitoring the Cluster in the High Availability Console Showing Services ............... 12-19 Maximum Availability Architecture (MAA) Advisor Page in Enterprise Manager.... 12-20 Network Routes Before Site Failover .................................................................................... 13-7 Network Routes After Site Failover ...................................................................................... 13-8 Enterprise Manager Reports Disk Failures ........................................................................ 13-17 Enterprise Manager Reports Oracle ASM Disk Groups Status....................................... 13-17 Enterprise Manager Reports Pending REBAL Operation................................................ 13-18 Partitioned Two-Node Oracle RAC Database ................................................................... 13-41 Oracle RAC Instance Failover in a Partitioned Database................................................. 13-42 Nonpartitioned Oracle RAC Instances ............................................................................... 13-43 Fast-Start Failover and the Observer Are Successfully Enabled..................................... 13-45 Reinstating the Original Primary Database After a Fast-Start Failover......................... 13-46 Using a Transient Logical Standby Database for Database Rolling Upgrade .............. 14-21 Database Object Reorganization Using Oracle Enterprise Manager.............................. 14-31
List of Tables
21 81 82 83 84 85 86 91 92 93 121 122 123 124 125 131 132 133 134 135 136 137 138 139 1310 1311 1312 1313 1314 1315 141 142 143 144 Tradeoffs for Different Test and QA Environments ............................................................. 2-3 Requirements and Data Guard Deployment Options.......................................................... 8-2 Archiving Recommendations................................................................................................... 8-8 Parallel Recovery Coordinator Wait Events ....................................................................... 8-14 Parallel Recovery Slave Wait Events.................................................................................... 8-15 Comparing Fast-Start Failover and Manual Failover ........................................................ 8-21 Minimum Recommended Settings for FastStartFailoverThreshold................................ 8-23 Backup and Recovery Summary.............................................................................................. 9-1 Sample Situations that Require Database Backup ................................................................ 9-2 Comparing Backup to Disk Options ....................................................................................... 9-9 Recommendations for Monitoring Space ............................................................................ 12-6 Recommendations for Monitoring the Alert Log............................................................... 12-7 Recommendations for Monitoring Processing Capacity .................................................. 12-8 Recommendations for Performance Related Metrics ...................................................... 12-11 Recommendations for Setting Data Guard Metrics ......................................................... 12-12 Recovery Times and Steps for Unscheduled Outages on the Primary Site.................... 13-2 Recovery Steps for Unscheduled Outages on the Secondary Site ................................... 13-5 Types of Oracle ASM Failures and Recommended Repair ............................................ 13-14 Recovery Options for Data Area Disk Group Failure ..................................................... 13-19 Recovery Options for Fast Recovery Area Disk Group Failure ..................................... 13-20 Flashback Solutions for Different Outages........................................................................ 13-28 Summary of Flashback Features......................................................................................... 13-28 Additional Processing When Restarting or Rejoining a Node or Instance .................. 13-37 Restoration and Connection Failback ................................................................................ 13-40 SQL Statements for Starting Standby Databases.............................................................. 13-47 SQL Statements to Start Redo Apply and SQL Apply .................................................... 13-47 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS ...... 13-48 SCN on Standby Database is Behind RESETLOGS SCN on the Primary Database ... 13-49 SCN on the Standby is Ahead of Resetlogs SCN on the Primary Database................. 13-49 Re-Creating the Primary and Standby Databases............................................................ 13-50 Solutions for Scheduled Outages on the Primary Site....................................................... 14-2 Managing Scheduled Outages on the Secondary Site ....................................................... 14-6 Database Upgrade Options ................................................................................................. 14-16 Platform and Location Migration Options........................................................................ 14-26
xi
xii
Preface
This book provides high availability best practices for configuring and maintaining your Oracle Database system and network components.
Audience
This book is intended for chief information technology officers and architects, as well as administrators that perform the following database, system, network, and application tasks:
Plan data centers Implement data center policies Maintain high availability systems Plan and build high availability solutions
Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc. Access to Oracle Support Oracle customers have access to electronic support through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.
Related Documents
For more information, see the Oracle database documentation set. These books may be of particular interest:
Oracle Database High Availability Overview Oracle Data Guard Concepts and Administration and Oracle Data Guard Broker Oracle Automatic Storage Management Administrator's Guide Oracle Clusterware Administration and Deployment Guide Oracle Real Application Clusters Administration and Deployment Guide
xiii
Oracle Database Backup and Recovery User's Guide Oracle Database Administrator's Guide The Oracle High Availability Best Practice white papers that can be downloaded from the Oracle Technology Network (OTN) at http://www.oracle.com/goto/maa
Conventions
The following text conventions are used in this document:
Convention boldface italic monospace Meaning Boldface type indicates graphical user interface elements associated with an action, or terms defined in text or the glossary. Italic type indicates book titles, emphasis, or placeholder variables for which you supply particular values. Monospace type indicates commands within a paragraph, URLs, code in examples, text that appears on the screen, or text that you enter.
xiv
1
1
By implementing and using Oracle best practices, you can provide high availability for the Oracle database and related technology. This chapter contains the following topics:
Oracle Database High Availability Architecture Oracle Database High Availability Best Practices Oracle Maximum Availability Architecture
Reduce the cost of creating an Oracle Database high availability system by following detailed guidelines on configuring your database, storage, application
1-1
failover, backup and recovery. See Chapter 3, "Overview of Configuration Best Practices" for more information.
Use Operational Best practices to maintain your system. See Chapter 2, "Operational Prerequisites to Maximizing Availability" for more information. Detect and quickly recover from unscheduled outages caused by computer failure, storage failure, human error, or data corruption. For more information, see Section 5.1.6, "Protect Against Data Corruption" and Chapter 13, "Recovering from Unscheduled Outages". Eliminate or reduce downtime that might occur due to scheduled maintenance such as database patches or application upgrades as described in Chapter 14, "Reducing Downtime for Planned Maintenance".
Oracle Database as described in this book Oracle Exadata Database Machine and Oracle Exalogic Elastic Cloud Oracle Fusion Middleware and Oracle WebLogic Server Oracle Applications (Siebel, Peoplesoft, E-Business Suite) Oracle Collaboration Suite Oracle Enterprise Manager
This book, Oracle Database High Availability Best Practices primarily focuses on high availability best practices for the Oracle Database. There are also other components for which you might want to consider Oracle Maximum Availability Architecture (MAA) best practices. For more information go to: http://www.oracle.com/goto/MAA
See:
Oracle Fusion Middleware Disaster Recovery Guide for information on Oracle Fusion Middleware high availability Oracle Fusion Middleware Administrator's Guide for information on backup and recovery for Oracle Fusion Middleware
The goal of MAA is to achieve the optimal HA architecture at the lowest cost and complexity. MAA provides:
Best practices that span the Exadata Database Machine, Oracle Database, Oracle Fusion Middleware, Oracle Applications, Oracle Enterprise Manager, and solutions provided by Oracle Partners. Accommodates a range of business requirements to make these best practices as widely applicable as possible. Leverages lower-cost servers and storage.
Uses hardware and operating system independent features and evolves with new Oracle versions and features. The only exception is Exadata MAA which has specific and customized configuration and operating practices for Exadata Database Machine. Makes high availability best practices as widely applicable as possible considering the various business service level agreements (SLA). Uses the Oracle Grid Infrastructure with Database Server Grid and Database Storage Grid to provide highly resilient, scalable, and lower cost infrastructures. Provides the ability to control the length of time to recover from an outage and the amount of acceptable data loss from any outage.
For more information on MAA and documentation on best practices for all components of MAA, visit the MAA Web site at http://www.oracle.com/goto/maa
1-3
2
2
Document the businesss cost of downtime, Recovery Time Objectives (RTO or recovery time) and Recovery Point Objectives (RPO or data loss tolerance) for the outages described in Oracle Database High Availability Overview. Build an outage and solution matrix similar those shown in Table 131, " Recovery Times and Steps for Unscheduled Outages on the Primary Site" and Table 141, " Solutions for Scheduled Outages on the Primary Site".
Install or update your software with the latest certified patch sets available Configure your software using best practices Document your choices and configuration
2-1
2.6 Provide a Plan to Test and Upgrade for Recommended Patches and Software
Test and upgrade to the software that is recommended. By periodically testing and upgrading to the latest recommended patch and software, you can ensure that the application and system have the latest security and software fixes required to maintain stability and avoid many known issues. Remember to validate all updates and changes on a test system before performing the upgrade on the production system. For more information, see "Oracle Recommended Patches -- Oracle Database" in My Oracle Support Note 756671.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id =756671.1
2-3
Table 21 (Cont.) Tradeoffs for Different Test and QA Environments Test Environment Shared System Resource Benefits and Tradeoffs Validate most patches and software changes. Validate all functional tests. This environment may be suitable for performance testing if enough system resources can be allocated to mimic production. Typically, however, a subset of production system resources, compromising performance testing/validation. Resource management and scheduling is required. Smaller or Subset of the system resources Validate all patches and software changes. Validate all functional tests. No performance testing at production scale. Limited full-scale high availability evaluations. Different hardware or platform system resources but same operating system Validate most patches and software changes. Limited firmware patching test. Validate all functional tests unless limited by some new hardware feature. Limited production scale performance tests. Limited full-scale high availability evaluations.
See Also:
Review the patch or upgrade documentation or any document relevant to that change. Evaluate the possibility of performing a rolling upgrade if your service-level agreements (SLAs) require zero or minimal downtime. Evaluate any rolling upgrade opportunities to minimize or eliminate planned downtime. Evaluate whether the patch qualifies for Standby-First Patching.
Note:
Standby-First Patch allows you to apply a patch initially to a physical standby database while the primary database remains at the previous software release (this applies for certain types of patches and does not apply for Oracle patch sets and major release upgrades; use the Data Guard transient logical standby method for patch sets and major releases). Once you are satisfied with the change, then you perform a switchover to the standby database. The fallback is to switchback if required. Alternatively, you can proceed to the following step and apply the change to your production environment. For more information, see "Oracle Patch Assurance - Data Guard Standby-First Patch Apply" in My Oracle Support Note 1265700.1 at https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=1265700.1
2.
Validate the application in a test environment and ensure the change meets or exceeds your functionality, performance, and availability requirements. Automate the procedure and be sure to also document and test a fallback procedure. This requires comparing metrics captured before and after patch application on the Test and against metrics captured on the Production system. Real Application Testing may be used to capture workload on the production system and replay it on the
test system. AWR and SQL Performance Analyzer may be used to assess performance improvement or regression resulting from the patch. Validate the new software on a test system that mimics your production environment, and ensure the change meets or exceeds your functionality, performance, and availability requirements. Automate the patch or upgrade procedure and ensure fallback. Being thorough during this step eliminates most critical issues during and after the patch or upgrade. See Section 2.7.1, "Configuring the Test System and QA Environments" for more information about configuring your test system.
3.
Optionally, use the Oracle Real Application Testing option that enables you to perform real-world testing of Oracle Database. Oracle Real Application Testing captures production workloads and assesses the impact of system changes before production deployment; thus, Oracle Real Application Testing minimizes the risk of instabilities associated with changes. Oracle GoldenGate can also be used as another logical replica to apply changes. See Section 2.7.1, "Configuring the Test System and QA Environments" for more information about configuring your test system.
4.
If applicable, perform final pre-production validation of all changes on a Data Guard standby database before applying them to production. Apply the change in an Oracle Data Guard environment, if applicable. For more information on Data Guard transient logical standby method, see Section 14.2.6, "Database Upgrades". Apply the change in your production environment.
See Also:
5.
Oracle Database Real Application Testing User's Guide Oracle Data Guard Concepts and Administration for complete information about Converting a Physical Standby Database into a Snapshot Standby Database Oracle Data Guard Concepts and Administration for more information on Performing a Rolling Upgrade With an Existing Physical Standby Database Oracle GoldenGate For Windows and UNIX Administrator's Guide for more information about Oracle GoldenGate The MAA white paper, "Database Rolling Upgrades Made Easy by Using a Data Guard Physical Standby Database", from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
See "Oracle Patch Assurance - Data Guard Standby-First Patch Apply" in My Oracle Support Note 1265700.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1265700.1
2-5
threshold. By reacting and executing efficiently, which includes detection and making the decision to failover, overall downtime can be reduced from hours to minutes. If you use Oracle Data Guard for disaster recovery and data protection, Oracle recommends that you perform periodic switchover operations every quarter or conduct full application and database failover tests. For more information on configuring Oracle Data Guard and role transition best practices, see Chapter 8, "Configuring Oracle Data Guard" and Section 8.4.1, "Oracle Data Guard Switchovers Best Practices."
See:
If using SQL*Plus, see "11.2 Data Guard Physical Standby Switchover Best Practices using SQL*Plus" in My Oracle Support Note 1304939.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1304939.1
If using the Data Guard Broker or Enterprise Manager, see "11.2 Data Guard Physical Standby Switchover Best Practices using the Broker" in My Oracle Support Note 1305019.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1305019.1
2.10 Configure Monitoring and Service Request Infrastructure for High Availability
To maintain your High Availability environment, you should configure the monitoring infrastructure that can detect and react to performance and high availability related thresholds. Also, where available, Oracle can detect failures, dispatch field engineers, and replace failing hardware without customer involvement.
Monitor performance and service statistics Create performance and high availability thresholds as early warning indicators of system or application problems
See Also:
Chapter 12, "Monitoring for High Availability" Oracle Enterprise Manager Administrator's Guide for information on detecting and reacting to potential problems and failures
https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=1185493.1
2-7
3
3
Migrate to Oracle Automatic Storage Management (Oracle ASM) Migrate a single-instance Oracle Database to Oracle Clusterware and Oracle Real Application Clusters (Oracle RAC) Create Oracle Data Guard standby databases Configure backup and recovery Implement Oracle Active Data Guard Use the MAA Advisor to implement Oracle's best practices and achieve a high availability architecture
For information on the configuration Best Practices for Oracle Database, see the following chapters:
Chapter 4, "Configuring Storage" Chapter 5, "Configuring Oracle Database" Chapter 6, "Configuring Oracle Database with Oracle Clusterware" Chapter 7, "Configuring Oracle Database with Oracle RAC" Chapter 8, "Configuring Oracle Data Guard" Chapter 9, "Configuring Backup and Recovery" Chapter 10, "Configuring Oracle GoldenGate" Chapter 11, "Configuring Fast Connection Failover"
See Also: Oracle Enterprise Manager online help system, and the documentation set available at
http://www.oracle.com/technetwork/oem/grid-control/i ndex.html
3-1
4
Configuring Storage
4
This chapter describes best practices for configuring a fault-tolerant storage subsystem that protects data while providing manageability and performance. These practices apply to all Oracle Database high availability architectures described in Oracle Database High Availability Overview. This chapter includes the following sections:
Evaluate Database Performance and Storage Capacity Requirements Use Automatic Storage Management (Oracle ASM) to Manage Database Files Oracle ASM Strategic Best Practices Oracle ASM Configuration Best Practices Oracle ASM Operational Best Practices Use Oracle Storage Grid
Average load Peak load Application workloads such as batch processing, Online Transaction Processing (OLTP), decision support systems (DSS) and reporting, Extraction, Transformation, and Loading (ETL)
Evaluating Database Performance Requirements You can gather the necessary statistics by using Automatic Workload Repository (AWR) reports or by querying the GV$SYSSTAT view. Along with understanding the database performance requirements, you must evaluate the performance capabilities of a storage array. Choosing Storage When you understand the performance and capacity requirements, choose a storage platform to meet those requirements.
See Also: Oracle Database Performance Tuning Guide for Overview of the Automatic Workload Repository (AWR) and on Generating Automatic Workload Repository Reports
Configuring Storage 4-1
4.2 Use Automatic Storage Management (Oracle ASM) to Manage Database Files
Oracle ASM is a vertical integration of both the file system and the volume manager built specifically for Oracle database files. Oracle ASM extends the concept of stripe and mirror everything (SAME) to optimize performance, while removing the need for manual I/O tuning (distributing the data file layout to avoid hot spots). Oracle ASM helps manage a dynamic database environment by letting you grow the database size without shutting down the database to adjust the storage allocation. Oracle ASM also enables low-cost modular storage to deliver higher performance and greater availability by supporting mirroring and striping. Oracle ASM provides data protection against drive and SAN failures, the best possible performance, and extremely flexible configuration and reconfiguration options. Oracle ASM automatically distributes the data across all available drivers, transparently and dynamically redistributes data when storage is added or removed from the database. Oracle ASM manages all of your database files. You can phase Oracle ASM into your environment by initially supporting only the fast recovery area.
Note:
See:
Oracle Automatic Storage Management Administrator's Guide for information about Oracle ASM Oracle Database Backup and Recovery User's Guide for information on duplicating a database The MAA white papers "Migration to Automatic Storage Management (ASM)" and "Best Practices for Creating a Low-Cost Storage Grid for Oracle Databases" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
4.2.2 Use Oracle Restart for Oracle ASM Instances (non-clustered Oracle Database)
Oracle Restart improves the availability of your Oracle database. When you install the Oracle Grid Infrastructure for a standalone server, it includes both Oracle ASM and
4-2 Oracle Database High Availability Best Practices
Oracle Restart. Oracle Restart runs out of the Oracle Grid Infrastructure home, which you install separately from Oracle Database homes. Oracle Restart provides managed startup and restart of a single-instance (non-clustered) Oracle Database, Oracle ASM instance, service, listener, and any other process running on the server. If an interruption of a service occurs after a hardware or software failure, Oracle Restart automatically takes the necessary steps to restart the component. With Server Control Utility (SRVCTL) you can add a component, such as an Oracle ASM instance to Oracle Restart. You then enable Oracle Restart protection for the Oracle ASM instance. With SRVCTL, you also remove or disable Oracle Restart protection.
See Also:
Oracle Database Administrator's Guide for more information on Oracle Restart Oracle Automatic Storage Management Administrator's Guide for more information on using Oracle Restart
data area: contains the active database files and other files depending on the level of Oracle ASM redundancy. If Oracle ASM with high redundancy is used, then the data area can also contain OCR, Voting, spfiles, control files, online redo log files, standby redo log files, broker metadata files, and change tracking files used for RMAN incremental backup. For example (high redundancy):
CREATE DISKGROUP data HIGH REDUNDANCY
FAILGROUP controller1 DISK '/devices/c1data01' NAME c1data01,\ '/devices/c1data02' NAME c1data02 FAILGROUP controller2 DISK '/devices/c2data01' NAME c2data01, '/devices/c2data02' NAME c2data02 FAILGROUP controller3 DISK '/devices/c3data01' NAME c3data01, '/devices/c3data02' NAME c3data02 ATTRIBUTE 'au_size'='4M', 'compatible.asm' = '11.2', 'compatible.rdbms'= '11.2', 'compatible.advm' = '11.2';
fast recovery area: contains recovery-related files, such as a copy of the current control file, a member of each online redo log file group, archived redo log files, RMAN backups, and flashback log files. For example (normal redundancy):
CREATE DISKGROUP reco NORMAL REDUNDANCY FAILGROUP controller1 DISK '/devices/c1reco01' NAME c1reco01, '/devices/c1reco02' NAME c1reco02 FAILGROUP controller2 DISK '/devices/c2reco01' NAME c2reco01, '/devices/c2reco02' NAME c2reco02 ATTRIBUTE 'au_size'='4M', 'compatible.asm' = '11.2', 'compatible.rdbms'= '11.2', 'compatible.advm' = '11.2';
If you are using ASMLib in a Linux environment, then create the disks using the ORACLEASM CREATEDISK command. ASMLib is a support library for Oracle ASM and is not supported on all platforms. For more information on ASMLib, see Section 4.4.6, "Use ASMLib On Supported Platforms".
Note 1:
For example:
/etc/init.d/oracleasm createdisk lun1 /devices/lun01
Note 2:
Oracle recommends using four (4) or more disks in each disk group. Having multiple disks in each disk group spreads kernel contention accessing and queuing for the same disk.
To simplify file management, use Oracle Managed Files to control file naming. Enable Oracle Managed Files by setting the following initialization parameters: DB_CREATE_ FILE_DEST and DB_CREATE_ONLINE_LOG_DEST_n. For example:
4-4 Oracle Database High Availability Best Practices
DB_CREATE_FILE_DEST=+DATA DB_CREATE_ONLINE_LOG_DEST_1=+RECO
You have two options when partitioning disks for Oracle ASM:
Allocate entire disks to the data area and fast recovery area disk groups. Figure 41 illustrates allocating entire disks. Partition each disk into two partitions, one for the data area and another for the fast recovery area. Figure 42 illustrates partitioning each disk into two partitions.
Easier management of the disk partitions at the operating system level because each disk is partitioned as just one large partition. Quicker completion of Oracle ASM rebalance operations following a disk failure because there is only one disk group to rebalance. Fault isolation, where storage failures only cause the affected disk group to go offline. Patching isolation, where you can patch disks or firmware for individual disks without impacting every disk.
Less I/O bandwidth, because each disk group is spread over only a subset of the available disks.
Figure 42 illustrates the partitioning option where each disk has two partitions. This option requires partitioning each disk into two partitions: a smaller partition on the faster outer portion of each drive for the data area, and a larger partition on the slower inner portion of each drive for the fast recovery area. The ratio for the size of the inner and outer partitions depends on the estimated size of the data area and the fast recovery area.
Figure 42 Partitioning Each Disk
More flexibility and easier to manage from a performance and scalability perspective. Higher I/O bandwidth is available, because both disk groups are spread over all available spindles. This advantage is considerable for the data area disk group for I/O intensive applications. There is no need to create a separate disk group with special, isolated storage for online redo logs or standby redo logs if you have sufficient I/O capacity. You can use the slower regions of the disk for the fast recovery area and the faster regions of the disk for data.
A double partner disk failure will result in loss of both disk groups, requiring the use of a standby database or tape backups for recovery. This problem is eliminated when using high redundancy ASM disk groups. An Oracle ASM rebalance operation following a disk failure is longer, because both disk groups are affected.
See Also:
Oracle Database 2 Day DBA for an Overview of Disks, Disk Groups, and Failure Groups and a description of normal redundancy, high redundancy and external redundancy Oracle Database Backup and Recovery User's Guide for details about setting up and sizing the fast recovery area Oracle Automatic Storage Management Administrator's Guide for details about Oracle ASM
Oracle Automatic Storage Management Administrator's Guide for an overview of Oracle Automatic Storage Management Oracle Automatic Storage Management Administrator's Guide for information about creating disk groups
'/devices/lun1','/devices/lun2','/devices/lun3','/devices/lun4';
See Also:
Oracle Automatic Storage Management Administrator's Guide for information on Oracle ASM Mirroring and Disk Group Redundancy Oracle Database 2 Day DBA for information on Creating a Disk Group
If every disk is available through every I/O path, as would be the case if using disk multipathing software, then keep each disk in its own failure group. This is the default Oracle ASM behavior if creating a disk group without explicitly defining failure groups.
CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4', '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
For an array with two controllers where every disk is seen through both controllers, create a disk group with each disk in its own failure group:
CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4', '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
If every disk is not available through every I/O path, then define failure groups to protect against the piece of hardware that you are concerned about failing. Here are some examples: For an array with two controllers where each controller sees only half the drives, create a disk group with two failure groups, one for each controller, to protect against controller failure:
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP controller1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP controller2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
For a storage network with multiple storage arrays, you want to mirror across storage arrays, then create a disk group with two failure groups, one for each array, to protect against array failure:
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP array1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP array2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
When determining the proper size of a disk group that is protected with Oracle ASM redundancy, enough free space must exist in the disk group so that when a disk fails Oracle ASM can automatically reconstruct the contents of the failed drive to other drives in the disk group while the database remains online. The amount of space required to ensure Oracle ASM can restore redundancy following disk failure is in the column REQUIRED_MIRROR_FREE_MB in the V$ASM_DISKGROUP view. The amount of free space that you can use safely in a disk group, taking mirroring into account, and still be able to restore redundancy after a disk failure is in the USABLE_FILE_MB column in the V$ASM_DISKGROUP view. The value of the USABLE_FILE_MB column should always be greater than zero. If USABLE_FILE_MB falls below zero, then add more disks to the disk group.
See:
Group
4.3.4 Ensure Disks in the Same Disk Group Have the Same Characteristics
Although ensuring that all disks in the same disk group have the same size and performance characteristics is not required, doing so provides more predictable overall performance and space utilization. When possible, present physical disks (spindles) to Oracle ASM as opposed to Logical Unit Numbers (LUNs) that create a layer of abstraction between the disks and Oracle ASM. If the disks are the same size, then Oracle ASM spreads the files evenly across all of the disks in the disk group. This allocation pattern maintains every disk at the same capacity level and ensures that all of the disks in a disk group have the same I/O load. Because Oracle ASM load balances workload among all of the disks in a disk group, different Oracle ASM disks should not share the same physical drive.
See Also: Oracle Automatic Storage Management Administrator's Guide for complete information about administering Oracle ASM disk groups
Note: If you have purchased a high-end storage array that has redundancy features built in, then you can optionally use those features from the vendor to perform the mirroring protection functions and set the Oracle ASM disk group to external redundancy. Along the same lines, use Oracle ASM normal or high redundancy with low-cost storage and Exadata storage.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information on Oracle ACFS
Note:
For more information about using the combination of ASMLib and Multipath Disks, see "Configuring Oracle ASMLib on Multipath Disks" in My Oracle Support Note 309815.1 at
https://support.oracle.com/CSP/main/article?cmd=show&type=NO T&id=309815.1
See Also:
Oracle Automatic Storage Management Administrator's Guide for information on Oracle ASM and Multipathing For more information, see "Oracle ASM and Multi-Pathing Technologies" in My Oracle Support Note 294869.1 at https://support.oracle.com/CSP/main/article?cmd=show&type =NOT&id=294869.1
where n is the number database instances connecting to the Oracle ASM instance.
See Also:
Oracle Automatic Storage Management Administrator's Guide for information on Oracle ASM Parameter Setting Recommendations Oracle Database Administrator's Guide for more information about setting the PROCESSES initialization parameter Oracle Database Reference for more information about the PROCESSES parameter
Eliminates the need for every Oracle process to open a file descriptor for each Oracle ASM disk, thus improving system resource usage. Simplifies the management of disk device names, makes the discovery process simpler, and removes the challenge of having disks added to one node and not be known to other nodes in the cluster. Eliminates the impact when the mappings of disk device names change upon system reboot.
Note:
See Also:
Oracle Database 2 Day + Real Application Clusters Guide for more information on installing ASMLib Oracle ASMLib Web site at http://www.oracle.com/technetwork/topics/linux/asmlib/ind ex-101839.html
This rule of thumb assumes that read-only tablespaces are not being shared across multiple databases.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about authentication to access Oracle ASM instances
4.5.2 Set Rebalance to the Maximum Limit that Does Not Affect Service Levels
Higher Oracle ASM rebalance power limits make a rebalance operation run faster but can also affect application service levels. Rebalancing takes longer with lower power values, but consumes fewer processing and I/O resources that are shared by other applications, such as the database. After performing planned maintenance, for example adding or removing storage, it is necessary to subsequently perform a rebalance to spread data across all of the disks. There is power limit associated with the rebalance. You can set a power limit to specify how many processes perform the rebalance. If you do not want the rebalance to impact applications, then set the power limit lower. However, if you want the rebalance to finish quickly, then set the power limit higher. To determine the default power limit for rebalances, check the value of the ASM_POWER_LIMIT initialization parameter in the Oracle ASM instance. If the POWER clause is not specified in an ALTER DISKGROUP statement, or when rebalance is run implicitly when you add or drop a disk, then the rebalance power defaults to the value of the ASM_POWER_LIMIT initialization parameter. You can adjust the value of this parameter dynamically.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about rebalancing Oracle ASM disk groups
Note: The ALTER DISKGROUP...MOUNT command only works on one node. For cluster installations use the following command:
srvctl start diskgroup -g
See Also: Oracle Automatic Storage Management Administrator's Guide for information about mounting and dismounting disk groups
See Also: Oracle Automatic Storage Management Administrator's Guide for information on Altering Disk Groups
To check for an imbalance on all mounted disk groups, see "Script to Report the Percentage of Imbalance in all Mounted Diskgroups" in My Oracle Support Note 367445.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=367445 .1
To check for an imbalance from an I/O perspective, query the statistics in the V$ASM_DISK_IOSTAT view before and after running a large SQL*Plus statement. For example, if you run a large query that performs only read I/O, the READS and BYTES_READ columns should be approximately the same for all disks in the disk group.
As a best practice to create and drop Oracle ASM disk groups, use SQL*Plus, ASMCA, or Oracle Enterprise Manager.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information on ASMCMD Disk Group Management Commands
4-14 Oracle Database High Availability Best Practices
Oracle ASM and third-party storage using external redundancy. Oracle ASM and Oracle Exadata or third-party storage using Oracle ASM redundancy. The Oracle Storage Grid with Exadata seamlessly supports MAA-related technology, improves performance, provides unlimited I/O scalability, is easy to use and manage, and delivers mission-critical availability and reliability to your enterprise.
See Also:
Set the DB_BLOCK_CHECKSUM initialization parameter to TYPICAL (default) or FULL. For more information, see Section 8.3.8.3, "Use DB_BLOCK_CHECKING=OFF and Set DB_BLOCK_CHECKSUM=FULL".
Note:
Oracle Exadata Database Machine also prevents corruptions from being written to disk by incorporating the hardware assisted resilient data (HARD) technology in its software. HARD uses block checking, in which the storage subsystem validates the Oracle block contents, preventing corrupted data from being written to disk. HARD checks in Oracle Exadata operate completely transparently and no parameters need to be set for this purpose at the database or storage tier. For more information see the White Paper "Optimizing Storage and Protecting Data with Oracle Database 11g" at
http://www.oracle.com/us/products/database/database-11g-mana ging-storage-wp-354099.pdf Choose Oracle ASM redundancy type (NORMAL or HIGH) based on your desired protection level and capacity requirements The NORMAL setting stores two copies of Oracle ASM extents, while the HIGH setting stores three copies of Oracle ASM extents. Normal redundancy provides more usable capacity and high redundancy provides more protection.
If a storage component is to be offlined when one or more databases are running, then verify that taking the storage component offline does not impact Oracle ASM disk group and database availability. Before dropping a failure group or offlining a storage component perform the appropriate checks. Ensure I/O performance can be sustained after an outage
Ensure that you have enough I/O bandwidth to support your service-level agreement if a failure occurs. For example, a typical case for a Storage Grid with n storage components would be to ensure that n-1 storage components could support the application service levels (for example, to handle a storage component failure).
Oracle Automatic Storage Management Administrator's Guide for information on the ASM_POWER_LIMIT initialization parameter Oracle Automatic Storage Management Administrator's Guide for information on tuning rebalance operations Oracle Database Reference for more on the ASM_POWER_LIMIT initialization parameter
5
5
The best practices discussed in this chapter apply to Oracle Database Release 11g High Availability architectures. This chapter describes best practices for configuring all Oracle databases, including single-instance, Oracle RAC databases, Oracle RAC One Node databases, and the primary and standby databases in Oracle Data Guard configurations (for more information on High Availability architectures, see Oracle Database High Availability Overview). Adopt these best practices to reduce or avoid outages, reduce the risk of corruption, and to improve recovery performance. This chapter includes the following sections:
Database Configuration High Availability and Fast Recoverability Best Practices Recommendations to Improve Manageability
See Also: Oracle Database High Availability Overview for more information on high availability architectures
5.1 Database Configuration High Availability and Fast Recoverability Best Practices
Use the following best practices to reduce recovery time and increase database availability and redundancy: Set the Database ARCHIVELOG Mode and FORCE LOGGING Mode Configure the Size of Redo Log Files and Groups Appropriately Use a Fast Recovery Area Enable Flashback Database Set FAST START MTTR TARGET Initialization Parameter Protect Against Data Corruption Set DISK_ASYNCH_IO Initialization Parameter Set LOG_BUFFER Initialization Parameter to a Minimum 8 MB Use Automatic Shared Memory Management Disable Parallel Recovery for Instance Recovery
5.1.1 Set the Database ARCHIVELOG Mode and FORCE LOGGING Mode
Running the database in ARCHIVELOG mode and using database FORCE LOGGING mode are prerequisites for database recovery operations. The ARCHIVELOG mode enables online database backup and is necessary to recover the database to a point in
Configuring Oracle Database 5-1
time later than what has been restored. Features such as Oracle Data Guard and Flashback Database require that the production database run in ARCHIVELOG mode. If you can isolate data that never needs to be recovered within specific tablespaces, then you can use tablespace level FORCE LOGGING attributes instead of the database FORCE LOGGING mode.
See Also:
Oracle Database Administrator's Guide for more information about controlling archiving mode Oracle Database Administrator's Guide for information on Specifying FORCE LOGGING Mode See Reduce Overhead and Redo Volume During ETL Operations in the technical white paper, "Oracle Data Guard: Disaster Recovery for Oracle Exadata Database Machine" from the MAA Best Practices area for Exadata Database Machine at http://www.oracle.com/goto/maa
5.1.2 Configure the Size of Redo Log Files and Groups Appropriately
Use Oracle log multiplexing to create multiple redo log members in each redo group, one in the data area and one in the Fast Recovery Area (unless the redo logs are in an Oracle ASM high redundancy disk group). This protects against a failure involving the redo log, such as a disk or I/O failure for one member, or a user error that accidentally removes a member through an operating system command. If at least one redo log member is available, then the instance can continue to function. Best Practices for Sizing Redo Log Files and Groups Use a minimum of three redo log groups: this helps prevent the log writer process (LGWR) from waiting for a group to be available following a log switch. All online redo logs and standby redo logs are equal size. Use redo log size = 4GB or redo log size >= peak redo rate x 20 minutes Locate redo logs on high performance disks. Place log files in a high redundancy disk group, or multiplex log files across different normal redundancy disk groups, if using ASM redundancy.
Note:
See Also:
Chapter 8, "Configuring Oracle Data Guard" Oracle Database Administrator's Guide for more information about managing redo logs Oracle Database Administrator's Guide for information on Multiplexing Redo Log Files Oracle Data Guard Concepts and Administration for more information about online, archived, and standby redo log files
DB_RECOVERY_FILE_DEST: specifies the default location for the fast recovery area. DB_RECOVERY_FILE_DEST_SIZE: specifies (in bytes) the hard limit on the total space to be used by database recovery files created in the recovery area location.
The Oracle Suggested Backup Strategy described in the Oracle Database 2 Day DBA recommends using the fast recovery area as the primary location for recovery. When the fast recovery area is properly sized, files needed for repair are readily available. The minimum recommended disk limit is the combined size of the database, incremental backups, all archived redo logs that have not been copied to tape, and flashback logs.
See Also:
Oracle Database Administrator's Guide for information on Specifying a Fast Recovery Area Oracle Database Backup and Recovery User's Guide for detailed information about sizing the fast recovery area and setting the retention period Oracle Database 2 Day DBA
Know your application performance baseline before you enable flashback to help determine the overhead and to assess the application workload implications of turning on flashback database. Ensure the fast recovery area space is sufficient to hold the flashback database flashback logs. For more information on sizing the fast recovery area, see the Oracle Database Backup and Recovery User's Guide. A general rule of thumb is to note that the volume of flashback log generation is approximately the same order of magnitude as redo log generation. For example, if you intend to set DB_ FLASHBACK_RETENTION_TARGET to 24 hours, and if the database generates 20 GB of redo in a day, then a rule of thumb is to allow 20 GB to 30 GB disk space for the flashback logs. The same rule applies for guaranteed restore points. For example, if the database generates 20 GB redo every day, and if the guaranteed restore point will be kept for a day, then plan to allocate 20 to 30 GB.
Configuring Oracle Database 5-3
An additional method to determine fast recovery area sizing is to enable flashback database and allow the database to run for a short period (2-3 hours). The estimated amount of space required for the fast recovery area can be retrieved by querying V$FLASHBACK_DATABASE_STAT.ESTIMATED_ FLASHBACK_SIZE. Note that the DB_FLASHBACK_RETENTION_TARGET is a target and there is no guarantee that you can flashback the database that far. In some cases if there is space pressure in the fast recovery area where the flashback logs are stored then the oldest flashback logs may be deleted. For a detailed explanation of the fast recovery area deletion rules see the Oracle Database Backup and Recovery User's Guide. To guarantee a flashback point-in-time you must use guaranteed restore points.
Set the Oracle Enterprise Manager monitoring metric, "Recovery Area Free Space (%)" for proactive alerts of space issues with the fast recovery area. Ensure there is sufficient I/O bandwidth to the fast recovery area. Insufficient I/O bandwidth with flashback database on is usually indicated by a high occurrence of the "FLASHBACK BUF FREE BY RVWR" wait event in an Automatic Workload Repository (AWR) report. Set the LOG_BUFFER initialization parameter to at least 8 MB to give flashback database more buffer space in memory. For large databases with more than a 4GB SGA, you may consider setting LOG_BUFFER to values in the range of 32-64 MB (for more information on LOG_BUFFER and valid values on 32-bit and 64-bit operating systems, see Oracle Database Reference). Set the parameter PARALLEL_EXECUTION_MESSAGE_SIZE to at least 8192. This improves the media recovery phase of any flashback database operation. If you have a Data Guard standby database, always set DB_FLASHBACK_ RETENTION_TARGET to the same value on the standby database(s) as the primary. Set DB_FLASHBACK_RETENTION_TARGET initialization parameter to the largest value prescribed by any of the following conditions that apply: To leverage flashback database to reinstate your failed primary database after Data Guard failover, for most cases set DB_FLASHBACK_RETENTION_TARGET to a minimum of 60 (mins) to enable reinstatement of a failed primary. Consider cases where there are multiple outages, for example, first a network outage, followed later by a primary database outage, that may result in a transport lag between primary and standby database at failover time. For such cases set DB_FLASHBACK_RETENTION_TARGET to a value equal to the sum of 60 (mins) plus the maximum transport lag that you want to accommodate. This ensures that the failed primary database can be flashed back to an SCN that precedes the SCN at which the standby became primary - a requirement for primary reinstatement. If using Flashback Database for fast point in time recovery from user error or logical corruptions, set DB_FLASHBACK_RETENTION_TARGET to a value equal to the farthest time in the past that you want to be able to recover to.
Review Oracle Database Backup and Recovery User's Guide for information on Configuring the Fast Recovery Area. To monitor the progress of a flashback database operation you can query the V$SESSION_LONGOPS view. An example query to monitor progress is:
select * from v$session_longops where opname like 'Flashback%';
For repetitive tests where you must flashback to the same point, use Flashback database guaranteed restore points (GRP) instead of enabling flashback database to minimize space utilization. In general, the performance effect of enabling Flashback Database is minimal. In 11.2.0.2 there are significant performance enhancements to nearly eliminate any overhead when you first enable flashback database, and during batch direct loads. For more information, see "Flashback Database Best Practices & Performance" in My Oracle Support Note 565535.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=565535.1
See Also:
Oracle Database Backup and Recovery User's Guide for more information about guaranteed restore points and Flashback Database Oracle Database Backup and Recovery User's Guide for information on configuring the environment for optimal Flashback Database performance Oracle Database Backup and Recovery User's Guide for information on configuring Oracle Flashback Database and Restore Points Oracle Data Guard Concepts and Administration for information on Using Flashback Database After a Role Transition Oracle Data Guard Concepts and Administration for information on Converting a Failed Primary Into a Standby Database Using Flashback Database Oracle Database 2 Day + Performance Tuning Guide for information on Gathering Database Statistics Using the Automatic Workload Repository (AWR) The MAA white paper "Active Data Guard 11g Best Practices (includes best practices for Redo Apply)" for more information on media recovery best practices from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Initially, set the FAST_START_MTTR_TARGET initialization parameter to 300 (seconds) or to the value required for your expected recovery time objective (RTO). Outage testing for cases such as for node or instance failures during peak loads is recommended.
See Also:
Oracle Database Performance Tuning Guide for information on Tuning Instance Recovery Performance: Fast-Start Fault Recovery Oracle Database Backup and Recovery User's Guide for more information about Fast-Start Fault Recovery The MAA white paper "Optimizing Availability During Unplanned Outages Using Oracle Clusterware and Oracle RAC" for more best practices from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
For more information see the "Preventing, Detecting, and Repairing Block Corruption: Database 11g" MAA white paper from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
For more information, see "Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration" in My Oracle Support Note 1302539.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1302539.1
Use Oracle Data Guard with physical standby databases to prevent widespread block corruption. Oracle Data Guard is the best solution for protecting Oracle data against data loss and corruption, and lost writes. For more information, see Section 8.3, "General Data Guard Configuration Best Practices".
Set the Oracle Database block-corruption initialization parameters on the Data Guard primary and standby databases:
On the Primary database set... DB_BLOCK_CHECKSUM=FULL DB_LOST_WRITE_PROTECT=TYPICAL DB_BLOCK_CHECKING=FULL On the Standby databases set... DB_BLOCK_CHECKSUM=FULL DB_LOST_WRITE_PROTECT=TYPICAL DB_BLOCK_CHECKING=OFF
Note:
Performance overhead is incurred on every block change, therefore performance testing is of particular importance when setting the DB_BLOCK_CHECKING parameter. A thorough performance assessment is recommended when changing these settings.
For more information, see Section 8.3.8, "Use Data Guard Redo Apply Best Practices".
Use Oracle Automatic Storage Management (Oracle ASM) to provide disk mirroring to protect against disk failures. For more information, see Section 4.2, "Use Automatic Storage Management (Oracle ASM) to Manage Database Files". Use Oracle ASM HIGH REDUNDANCY for optimal corruption repair. Using Oracle ASM redundancy for disk groups provides mirrored extents that can be used by the database if an I/O error or corruption is encountered. For continued protection, Oracle ASM redundancy provides the ability to move an extent to a different area on a disk if an I/O error occurs. The Oracle ASM redundancy mechanism is useful if you have bad sectors returning media sense errors. For more information, see Section 4.3.2, "Use Redundancy to Protect from Disk Failure". Use the Oracle Active Data Guard option for automatic block repair. For more information about Active Data Guard, see Section 8.5, "Use Oracle Active Data Guard Best Practices". Configure and use Configure Data Recovery Advisor to automatically diagnose data failures. For more information, see Section 5.2.2, "Use Data Recovery Adviser to Detect, Analyze and Repair Data Failures". Enable Flashback Technologies for fast point-in-time recovery from logical corruptions most often caused by human error and for fast reinstatement of a primary database following failover. For more information, see Section 5.1.4, "Enable Flashback Database". Implement a backup and recovery strategy with Recovery Manager (RMAN) and periodically use the RMAN BACKUP VALIDATE CHECK LOGICAL... scan to detect corruptions. For more information, see Chapter 9, "Configuring Backup and Recovery." Use RMAN and Oracle Secure Backup for additional block checks during backup and restore operations.
Use Enterprise Manager to monitor the availability of all discovered targets and detect errors and alerts. You can also review all targets in a single view from the HA Console. For more information, see Chapter 12, "Monitoring for High Availability" for more information about Enterprise Manager. Query the V$DATABASE_BLOCK_CORRUPTION view that is automatically updated when block corruption is detected or repaired. Configure Data Recovery Advisor to automatically diagnose data failures, determine and present appropriate repair options, and perform repair operations at the user's request. See Section 5.2.2, "Use Data Recovery Adviser to Detect, Analyze and Repair Data Failures" for more information.
Note: Data Recovery Advisor integrates with the Oracle Enterprise Manager Support Workbench (Support Workbench), the Health Monitor, and RMAN.
Use Data Guard to detect physical corruptions and to detect lost writes. Data Guard can detect physical corruptions when the apply process stops due to a corrupted block in the redo steam or when it detects a lost write. Use Enterprise Manager to manage and monitor your Data Guard configuration. By taking advantage of Automatic Block Media Recovery, a corrupt block found on either a primary database or a physical standby database can be fixed automatically when the Active Data Guard option is used. For more information on Automatic Block Media Recovery, see Section 13.2.6.2, "Use Active Data Guard".
Use SQL*Plus to detect data file corruptions and interblock corruptions Issue the ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE SQL*Plus statement. After determining the corruptions, the table can be re-created or another action can be taken.
See Also:
Section 13.2.6, "Recovering from Data Corruption" Oracle Data Guard Concepts and Administration for more information on Oracle Active Data Guard option and the Automatic Block Repair feature Oracle Database Backup and Recovery User's Guide for information on Performing Block Media Recovery
To explicitly enable asynchronous I/O, set the DISK_ASYNCH_IO initialization parameter to TRUE:
ALTER SYSTEM SET DISK_ASYNCH_IO=TRUE SCOPE=SPFILE SID='*';
Note that if you are using Oracle ASM, it performs I/O asynchronously by default.
See Also:
Oracle Database Reference for more information on the DISK_ASYNCH_IO initialization parameter
Oracle Database Performance Tuning Guide for information on Configuring and Using the Redo Log Buffer Oracle Database Reference for more information on LOG_BUFFER and valid values on 32-bit and 64-bit operating systems For more information on using a buffer hit rate histogram for determining optimal size for log buffer to support redo transport, see "View X$LOGBUF_READHIST and In-Memory Log Buffer Hit Rate Histogram" in My Oracle Support Note 951152.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=951152.1
Section 5.1.5, "Set FAST START MTTR TARGET Initialization Parameter" Oracle Database Reference for more information on the RECOVERY_ PARALLELISM parameter
Use Data Recovery Adviser to Detect, Analyze and Repair Data Failures Use Automatic Performance Tuning Features Use a Server Parameter File Use Automatic Undo Management Use Locally Managed Tablespaces Use Automatic Segment Space Management Use Temporary Tablespaces and Specify a Default Temporary Tablespace Use Resumable Space Allocation Use Database Resource Manager
Chapter 6, "Configuring Oracle Database with Oracle Clusterware" Oracle Database Administrator's Guide for information on configuring Oracle Restart
5.2.2 Use Data Recovery Adviser to Detect, Analyze and Repair Data Failures
Use Data Recovery Advisor to quickly diagnose data failures, determine and present appropriate repair options, and execute repairs at the user's request. In this context, a data failure is a corruption or loss of persistent data on disk. By providing a centralized tool for automated data repair, Data Recovery Advisor improves the manageability and reliability of an Oracle database and thus helps reduce the Mean Time To Recover (MTTR). Data Recovery Advisor can diagnose failures based on symptoms, such as:
Components that are not accessible because they do not exist, do not have the correct access permissions, are taken offline, and so on Physical corruptions such as block checksum failures, invalid block header field values, and so on Logical corruptions caused by software bugs Incompatibility failures caused by an incorrect version of a component
I/O failures such as a limit on the number of open files exceeded, channels inaccessible, network or I/O errors, and so on Configuration errors such as an incorrect initialization parameter value that prevents the opening of the database
If failures are diagnosed, then they are recorded in the Automatic Diagnostic Repository (ADR). Data Recovery Advisor intelligently determines recovery strategies by:
Generating repair advice and repairing failures only after failures have been detected by the database and stored in ADR Aggregating failures for efficient recovery Presenting only feasible recovery options Indicating any data loss for each option
Typically, Data Recovery Advisor presents both automated and manual repair options. If appropriate, you can choose to have Data Recovery Advisor automatically perform a repair, verify the repair success, and close the relevant repaired failures.
Note:
In the current release, Data Recovery Advisor only supports single-instance databases. Oracle RAC databases are not supported. See Oracle Database Backup and Recovery User's Guide for more information on Data Recovery Advisor supported database configurations.
See Also:
Section 13.2.6, "Recovering from Data Corruption" for more information on using Data Recovery Advisor Oracle Database Backup and Recovery User's Guide for information on diagnosing and repairing failures with Data Recovery Advisor
Automatic Workload Repository (AWR) Automatic Database Diagnostic Monitor (ADDM) SQL Tuning Advisor SQL Access Advisor Active Session History Reports (ASH)
When using Automatic Workload Repository (AWR), consider the following best practices:
Create a baseline of performance data to be used for comparison purposes should problems arise. This baseline should be representative of the peak load on the system.
Set the AWR automatic snapshot interval to 10-20 minutes to capture performance peaks during stress testing or to diagnose performance issues. Under usual workloads a 60-minute interval is sufficient.
See Also: Oracle Database Performance Tuning Guide for more information on Managing the Automatic Workload Repository
Oracle Database Administrator's Guide for information about managing initialization parameters with an SPFILE Oracle Real Application Clusters Administration and Deployment Guide for information on initialization parameters with Real Application Clusters Oracle Data Guard Broker for information on other prerequisites for using the broker
UNDO_RETENTION Specify the desired time in seconds to retain undo data. Set this parameter to the same value on all instances.
Advanced object recovery features, such as Flashback Query, Flashback Version Query, Flashback Transaction Query, and Flashback Table, require automatic undo management. The success of these features depends on the availability of undo information to view data as of a previous point in time. By default, Oracle Database automatically tunes undo retention by collecting database usage statistics and estimating undo capacity needs. Unless you enable retention guarantee for the undo tablespace (by specifying the RETENTION GUARANTEE clause on either the CREATE DATABASE or the CREATE UNDO TABLESPACE statement), Oracle Database may reduce the undo retention below the specified UNDO_ RETENTION value.
Note:
By default, ongoing transactions can overwrite undo data even if the UNDO_RETENTION parameter setting specifies that the undo data should be maintained. To guarantee that unexpired undo data is not overwritten, you must enable RETENTION GUARANTEE for the undo tablespace.
If there is a requirement to use Flashback technology features, the best practice recommendations is to enable RETENTION GUARANTEE for the undo tablespace and set a value for UNDO_RETENTION based on the following guidelines:
1. 2. 3.
Establish how long it would take to detect when erroneous transactions have been carried out. Multiply this value by two. Use the Undo Advisor to compute the minimum undo tablespace size based on setting UNDO_RETENTION to the value recommended in step 1. If the undo tablespace has the AUTOEXTEND option disabled, allocate enough space as determined in step 2 or reduce the value of the UNDO_RETENTION parameter. If the undo tablespace has the AUTOEXTEND option enabled, make sure there is sufficient disk space available to extend the datafiles to the size determined in step 2. Make sure the autoextend MAXSIZE value you specified is large enough.
4.
With the RETENTION GUARANTEE option, if the tablespace is configured with less space than the transaction throughput requires, then the following sequence of events occurs:
1. 2. 3. 4.
If you have an autoextensible file, then the file automatically grows to accommodate the retained undo data. A warning alert reports the disk is at 85% full. A critical alert reports the disk is at 97% full. Transactions receive an out-of-space error.
See Also:
Oracle Database 2 Day DBA for information about computing the minimum undo tablespace size using the Undo Advisor Oracle Database Administrator's Guide for more information about the UNDO_RETENTION setting and the size of the undo tablespace
performance tuning related to space management. It facilitates management of free space within objects such as tables or indexes, improves space utilization, and provides significantly better performance and scalability with simplified administration. The automatic segment space management feature is enabled by default for all tablespaces created using default attributes.
See Also: Oracle Database Administrator's Guide for more information on automatic segment space management
Using the default temporary tablespace ensures that all disk sorting occurs in a temporary tablespace and that other tablespaces are not mistakenly used for sorting.
See Also: Oracle Database Administrator's Guide for more information about managing tablespaces
database is available but users are not getting the level of performance they need, then availability and service level objectives are not being met. Application performance, to a large extent, is affected by how resources are distributed among the applications that access the database. The main goal of the Resource Manager is to give the Oracle Database server more control over resource management decisions, thus circumventing problems resulting from inefficient operating system management and operating system resource managers. When you use the Resource Manager, consider the following best practices:
Use Enterprise Manager to manage resource plans. When you test with the Resource Manager, ensure there is sufficient load on the system to make CPU resources scarce.
See Also:
Oracle Database Administrator's Guide for more information about Oracle Database Resource Manager For information on configuring and troubleshooting Database Resource Manager, see "Resource Manager Training (11.2 features included)" in My Oracle Support Note 1119407.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1119407.1
6
Configuring Oracle Database with Oracle Clusterware
6
Oracle Clusterware enables servers to communicate with each other, so that they appear to function as a collective unit. Oracle Clusterware provides the infrastructure necessary to run Oracle Real Application Clusters (Oracle RAC) and Oracle RAC One Node. The Grid Infrastructure is the software that provides the infrastructure for an enterprise grid architecture. In a cluster, this software includes Oracle Clusterware and Oracle ASM. For a standalone server, the Grid Infrastructure includes Oracle Restart and Oracle ASM. Oracle Database 11g Release 2 combines these infrastructure products into one software installation called the Grid Infrastructure home. This chapter includes the following sections:
About Oracle Clusterware Best Practices Oracle Clusterware Configuration Best Practices Oracle Clusterware Operational Best Practices
Before installing Oracle RAC or Oracle RAC One Node, you must first install the Oracle Grid Infrastructure, which consists of Oracle Clusterware and Oracle ASM. After Oracle Clusterware and Oracle ASM are operational, you can use OUI to install the Oracle Database software with the Oracle RAC components.
Oracle Clusterware includes a high availability framework that provides an infrastructure to manage any application. Oracle Clusterware ensures the applications it manages start when the system starts and monitors the applications to make sure they are always available. If a process fails then Oracle Clusterware attempts to restart the process using agent programs (agents). Oracle clusterware provides built-in agents so that you can use shell or batch scripts to protect and manage an application. Oracle Clusterware also provides preconfigured agents for some applications (for example for
Oracle TimesTen In-Memory Database). If a node in the cluster fails, then you can program processes that normally run on the failed node to restart on another node. The monitoring frequency, starting, and stopping of the applications and the application dependencies are configurable. Oracle RAC One Node is a single instance of an Oracle Real Application Clusters (Oracle RAC) database that runs on one node in a cluster. For information on working with Oracle RAC One Node, see Section 7.2, "Configuring Oracle Database with Oracle RAC One Node".
See Also:
Oracle Clusterware Administration and Deployment Guide for information on Making Applications Highly Available Using Oracle Clusterware Oracle Database 2 Day + Real Application Clusters Guide for more information on installing Oracle Clusterware
Client-side load balancing and connection load balancing. For more information see Section 6.2.6, "Use Client-Side and Server-Side Load Balancing" Single Client Access Name (SCAN) Fast Application Notification (FAN) Fast Connection Failover (FCF) (ideally used by FAN-enabled clients)
Tip: For more information on using and configuring Fast Connection Failover (FCF), see Chapter 11, "Configuring Fast Connection Failover"
6.1.1.1 Services
The use of Fast Application Notification (FAN) requires the use of Services. Services de-couple any hardwired mapping between a connection request and an Oracle RAC instance. Services are an entity defined for an Oracle RAC database that allows the workload for an Oracle RAC database to be managed. Services divide the entire workload executing in the Oracle Database into mutually disjoint classes. Each service represents a workload with common attributes, service level thresholds, and priorities. The grouping is based on attributes of the work that might include the application being invoked, the application function to be invoked, the priority of execution for the application function, the job class to be managed, or the data range used in the application function or job class. To manage workloads or a group of applications, you can define services that you assign to a particular application or to a subset of an application's operations. You can also group work by type under services. For example, online users can use one service, while batch processing can use another and reporting can use yet another service to connect to the database. Oracle recommends that all users who share a service have the same service level requirements. You can define specific characteristics for services and each service can represent a separate unit of work. There are many options that you can take advantage
of when using services. Although you do not have to implement these options, using them helps optimize application performance.
Oracle Clusterware Administration and Deployment Guide for more information on Single Client Access Name (SCAN) Oracle Database Net Services Administrator's Guide for more information on EZConnect
Use the Cluster Verification Utility (CVU) Use a Local Home for Oracle Database and Oracle Clusterware with Oracle ASM Ensure Services are Highly Available Client Configuration and FAN Best Practices Connect to Database Using Services and Single Client Access Name (SCAN) Use Client-Side and Server-Side Load Balancing
Mirror Oracle Cluster Registry (OCR) and Configure Multiple Voting Disks with Oracle ASM Use Company Wide Cluster Time Management Verify That Oracle Clusterware, Oracle RAC, and Oracle ASM Use the Same Interconnect Network Use Redundant Interconnect with Highly Available IP (HAIP) Configure Failure Isolation with Intelligent Management Platform Interface (IPMI)
Oracle Clusterware Administration and Deployment Guide for information on the Cluster Verification Utility
6.2.2 Use a Local Home for Oracle Database and Oracle Clusterware with Oracle ASM
All rolling patch features require that the software home being patched is local, not shared. The software must be physically present in a local file system on each node in the cluster and it is not on a shared cluster file system. The reason for this requirement is that if a shared cluster file system is used, patching the software on one node affects all of the nodes, and would require that you shut down all components using the software on all nodes. Using a local file system allows software to be patched on one node without affecting the software on any other nodes. Note the following when you install Oracle Grid Infrastructure and configure Oracle Clusterware:
Oracle RAC databases require shared storage for the database files. Configure Oracle Cluster Registry (OCR) and voting files to use Oracle ASM. For more information, see Section 6.2.7, "Mirror Oracle Cluster Registry (OCR) and Configure Multiple Voting Disks with Oracle ASM" Oracle recommends that you install Oracle Database on local homes, rather than using a shared home on shared storage. It is recommended to not use a shared file system for the Oracle Database Home (using a shared home prevents you from doing rolling upgrades, as all software running from that shared location must be stopped before it can be patched). Oracle Clusterware and Oracle ASM are both installed in one home on a non shared file system called the Grid Infrastructure home (Grid_home).
See Also: Oracle Database 2 Day + Real Application Clusters Guide for more information on installing Oracle ASM
instantaneously reconnect and continue working. Oracle Clusterware handles this responsibility and it is of utmost importance during unplanned outages. Even though you can rely on Oracle Clusterware to start the service during planned maintenance as well, it is safer to ensure that the service is available on an alternate instance by manually starting an alternate preferred instance ahead of time. Manually starting an alternate instance eliminates the single point of failure with a single preferred instance and you have the luxury to do this because it is a planned activity. Add at least a second preferred instance to the service definition and start the service before the planned maintenance. You can then stop the service on the instance where maintenance is being performed with the assurance that another service member is available. Adding one or more preferred instances does not have to be a permanent change. You can revert it back to the original service definition after performing the planned maintenance. Manually relocating a service rather than changing the service profile is advantageous in cases such as the following:
If you are using Oracle XA, then use of manual service relocation is advantageous because, although the XA specification allows for a transaction branch to be suspended and resumed by the TM, if connection load balancing is utilized then any resumed connection could land on an alternate Oracle instance to the one that the transaction branch started on and there is a performance implication if a single distributed transaction spans multiple database instances. If an application is not designed to work properly with multiple service members, then application errors or performance issues can arise.
As with all configuration changes, you should test the effect of a service with multiple members to assess its viability and impact in a test environment before implementing the change in your production environment.
See Also: The Technical Article, "XA and Oracle controlled Distributed Transactions" on the Oracle Real Application Clusters Web site at http://www.oracle.com/technetwork/database/clustering/overvi ew/index.html
Client is configured to receive FAN notifications and is properly configured for run time connection load balancing and Fast Connection Failover. Oracle Clusterware stops services on the instance to be brought down or relocates services to an alternate instance. Oracle Clusterware returns a Service-Member-Down event. Client that is configured to receive FAN notifications receives a notification for a Service-Member-Down event and moves connections to other instances offering the service.
Configuring Oracle Database with Oracle Clusterware 6-5
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for an Introduction to Automatic Workload Management. Detailed information about client failover best practices in an Oracle RAC environment are available in the "Automatic Workload Management with Oracle Real Application Clusters 11g" Technical Article on the Oracle Technology Network at http://www.oracle.com/technetwork/database/clustering/ove rview/index.html
6.2.5 Connect to Database Using Services and Single Client Access Name (SCAN)
With Oracle Database 11g, application workloads can be defined as services so that they can be automatically or manually managed and controlled. For manually managed services, DBAs control which processing resources are allocated to each service during both normal operations and in response to failures. Performance metrics are tracked by service and thresholds set to automatically generate alerts should these thresholds be crossed. CPU resource allocations and resource consumption controls are managed for services using Database Resource Manager. Oracle tools and facilities such as Job Scheduler, Parallel Query, and Oracle Streams Advanced Queuing also use services to manage their workloads. With Oracle Database 11g, you can define rules to automatically allocate processing resources to services. Oracle RAC in Oracle Database release 11g instances can be allocated to process individual services or multiple services, as needed. These allocation rules can be modified dynamically to meet changing business needs. For example, you could modify these rules at quarter end to ensure that there are enough processing resources to complete critical financial functions on time. You can also define rules so that when instances running critical services fail, the workload is automatically shifted to instances running less critical workloads. You can create and administer services with the SRVCTL utility or with Oracle Enterprise Manager. You should make application connections to the database by using a VIP address (preferably SCAN) in combination with a service to achieve the greatest degree of availability and manageability. A VIP address is an alternate public address that client connections use instead of the standard public IP address. If a node fails, then the node's VIP address fails over to another node but there is no listener listening on that VIP, so a client that attempts to connect to the VIP address receives a connection refused error (ORA-12541) instead of waiting for long TCP connect timeout messages. This error causes the client to quickly move on to the next address in the address list and establish a valid database connection. New client connections can initially try to connect to a failed-over-VIP, but when there is no listener running on that VIP the "no listener" error message is returned to the clients. The clients traverse to the next address in the address list that has a non-failed-over VIP with a listener running on it. The Single Client Access Name (SCAN) is a fully qualified name (hostname+domain) that is configured to resolve to all three of the VIP addresses allocated for the SCAN. The addresses resolve using Round Robin DNS either on the DNS server, or within the cluster in a GNS configuration. SCAN listeners can run on any node in the cluster, multiple SCAN listeners can run on one node. Oracle Database 11g Release 2 and later, by default, instances register with SCAN listeners as remote listeners.
SCANs are cluster resources. SCAN vips and SCAN listeners run on random cluster nodes. SCANs provide location independence for the databases, so that client configuration does not have to depend on which nodes are running a particular database. For example, if you configure policy managed server pools in a cluster, then the SCAN enables connections to databases in these server pools regardless of which nodes are allocated to the server pool. SCAN names functions like a virtual cluster address. SCANs are resolved to three SCAN VIPs that may run on any node in the cluster. So unlike a VIP address per node as entry point, clients connecting to the SCAN no longer require any updates on how they connect when a virtual IP addresses is added, changed, or removed from the cluster. The SCAN addresses resolve to the cluster, rather than to a specific node address. During Oracle Grid Infrastructure installation, SCAN listeners are created for as many IP addresses as there are addresses assigned to resolve to the SCAN. Oracle recommends that the SCAN resolves to three addresses, to provide high availability and scalability. If the SCAN resolves to three addresses, then there are three SCAN listeners created. Oracle RAC provides failover with the node VIP addresses by configuring multiple listeners on multiple nodes to manage client connection requests for the same database service. If a node fails, then the service connecting to the VIP is relocated transparently to a surviving node. If the client or service are configured with transparent application failover options, then the client is reconnected to one of the surviving nodes. When a SCAN Listener receives a connection request, the SCAN Listener checks for the least loaded instance providing the requested service. It then re-directs the connection request to the local listener on the node where the least loaded instance is running. Subsequently, the client is given the address of the local listener. The local listener finally creates the connection to the database instance. Clients configured to use IP addresses for Oracle Database releases before Oracle Database 11g Release 2 can continue to use their existing connection addresses; using SCANs is not required: in this case, the pre-Database 11g Release 2 client would use a TNS connect descriptor that resolves to the node-VIPs of the cluster. When an earlier version of Oracle Database is upgraded, it registers with the SCAN listeners, and clients can start using the SCAN to connect to that database. The database registers with the SCAN listener through the remote listener parameter in the init.ora file. The REMOTE_LISTENER parameter must be set to SCAN:PORT. Do not set it to a TNSNAMES alias with a single address with the SCAN as HOST=SCAN. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the cluster, independently of which server(s) in the cluster the database is active. For example:
sqlplus system/manager@sales1-scan:1521/oltp
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for more information about automatic workload management Oracle Real Application Clusters Administration and Deployment Guide for an Overview of Connecting to Oracle Database Using Services and VIP Addresses Oracle Clusterware Administration and Deployment Guide for more information on Oracle Clusterware Network Configuration Concepts Oracle Database Net Services Administrator's Guide for more information on EZConnect
Configures and enables server-side load balancing Sets the remote listener parameter to the SCAN listener (Note: If you do not use DBCA, you should set the REMOTE_LISTENER database parameter to scan_ name:scan_port.) Creates a sample client-side load balancing connection definition in the tnsnames.ora file on the server
Note:
The following features do not work with the default database service. You must create cluster managed services to take advantage of these features. You can only manage the services that you create. Any service created automatically by the database server is managed by the database server.
To further enhance connection load balancing, use the Load Balancing Advisory and define the connection load balancing for each service. Load Balancing Advisory provides information to applications about the current service levels that the database and its instances are providing. The load balancing advisory makes recommendations to applications about where to direct application requests to obtain the best service based on the policy that you have defined for that service. Load balancing advisory
events are published through ONS. There are two types of service-level goals for runtime connection load balancing: SERVICE_TIME: Attempts to direct work requests to instances according to response time. Load balancing advisory data is based on elapsed time for work done in the service plus available bandwidth to the service. An example for the use of SERVICE_ TIME is for workloads such as internet shopping where the rate of demand changes. The following example shows how to set the goal to SERVICE_TIME for connections using the online service:
srvctl modify service -d db_unique_name -s online -B SERVICE_TIME -j SHORT
THROUGHPUT: Attempts to direct work requests according to throughput. The load balancing advisory is based on the rate that work is completed in the service plus available bandwidth to the service. An example for the use of THROUGHPUT is for workloads such as batch processes, where the next job starts when the last job completes. The following example shows how to set the goal to THROUGHPUT for connections using the sjob service:
srvctl modify service -d db_unique_name -s sjob -B THROUGHPUT -j LONG
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for more information on Connection Load Balancing and Load Balancing Advisory Oracle Real Application Clusters Administration and Deployment Guide for information on Configuring Your Environment to Use the Load Balancing Advisory Oracle Database Net Services Administrator's Guide for more information about configuring listeners Oracle Database Reference for more information about the LOCAL_ LISTENER and REMOTE_LISTENER parameters
6.2.7 Mirror Oracle Cluster Registry (OCR) and Configure Multiple Voting Disks with Oracle ASM
Configure Oracle Cluster Registry (OCR) and voting files to use Oracle ASM. It is recommended to mirror OCR and configure multiple voting disks using an Oracle ASM high redundancy disk group when available. The Oracle Cluster Registry (OCR) contains important configuration data about cluster resources. Always protect the OCR by using Oracle ASM redundant disk groups for example. Oracle Clusterware uses the Oracle Cluster Registry (OCR) to store and manage information about the components that Oracle Clusterware controls, such as Oracle RAC databases, listeners, virtual IP addresses (VIPs), and services and any applications. To ensure cluster high availability when using a shared cluster file system, as opposed to when using Oracle ASM, Oracle recommends that you define multiple OCR locations. In addition:
You can have up to five OCR locations Each OCR location must reside on shared storage that is accessible by all of the nodes in the cluster You can replace a failed OCR location online if it is not the only OCR location
Configuring Oracle Database with Oracle Clusterware 6-9
You must update OCR through supported utilities such as Oracle Enterprise Manager, the Server Control Utility (SRVCTL), the OCR configuration utility (OCRCONFIG), or the Database Configuration Assistant (DBCA)
Each OCR location must reside on shared storage that is accessible by all of the nodes in the cluster and the voting disk also must reside on shared storage. For high availability, Oracle recommends that you have multiple voting disks on multiple storage devices across different controllers, where possible. Oracle Clusterware enables multiple voting disks, but you must have an odd number of voting disks, such as three, five, and so on. If you define a single voting disk, then you should use external redundant storage to provide redundancy. Extended Oracle RAC requires a quorum (voting) disk that should be on an arbiter site at a location different from the main sites (data centers). For more information, see Section 7.3.2, "Add a Third Voting Disk to Host the Quorum Disk".
See Also:
Oracle Clusterware Administration and Deployment Guide for information on Adding, Replacing, Repairing, and Removing Oracle Cluster Registry Locations Oracle Real Application Clusters Administration and Deployment Guide for more information about managing OCR and voting disks Oracle Clusterware Administration and Deployment Guide for information on voting disks and oracle cluster registry requirements
Oracle Database 2 Day + Real Application Clusters Guide for information About Setting the Time on All Nodes Oracle Clusterware Administration and Deployment Guide for information on Cluster Time Synchronization Service (CTSS) and Cluster Time Management
6.2.9 Verify That Oracle Clusterware, Oracle RAC, and Oracle ASM Use the Same Interconnect Network
For efficient network detection and failover and optimal performance, Oracle Clusterware, Oracle RAC, and Oracle ASM should use the same dedicated interconnect subnet so that they share the same view of connections and accessibility. Perform the following steps to verify the interconnect subnet:
1.
To verify the interconnect subnet for Oracle RAC and for Oracle ASM instances, do either of the following:
View instances information in the alert log to verify the interconnect subnet.
2.
See: Oracle Database Administrator's Guide for information on Viewing the Alert Log
See Also:
Oracle Clusterware Administration and Deployment Guide for information on Redundant Interconnect Usage with HAIP Oracle Grid Infrastructure Installation Guide for your platform for information about defining interfaces For more information, see "11gR2 Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip" in My Oracle Support Note 1210883.1 at https://support.oracle.com/CSP/main/article?cmd=show&type =NOT&id=1210883.1
6.2.11 Configure Failure Isolation with Intelligent Management Platform Interface (IPMI)
Failure isolation is a process by which a failed node is isolated from the rest of the cluster to prevent the failed node from corrupting data. The ideal fencing involves an external mechanism capable of restarting a problem node without cooperation either from Oracle Clusterware or from the operating system running on that node. To provide this capability, Oracle Clusterware 11g Release 2 (11.2) and later supports the Intelligent Management Platform Interface specification (IPMI), an industry-standard management protocol. Typically, you configure failure isolation using IPMI during Grid Infrastructure installation, when you are provided with the option of configuring IPMI from the Failure Isolation Support screen. If you do not configure IPMI during installation, then you can configure it after installation using the Oracle Clusterware Control utility (CRSCTL).
See:
Oracle Clusterware Administration and Deployment Guide for information about IPMI and for information on Configuring IPMI for Failure Isolation
from the application work when possible. For example, if a patch requires that a SQL script is run after all nodes have been patched, it is a best-practice to run this script on the last node receiving the patch before allowing the application to start using that node. This technique ensures that the SQL script has full use of the operating system resources on the node and it is less likely to affect the application. For example, the CATCPU.SQL script that must be run after installing the CPU patch on all nodes.
In addition to using the automatically created OCR backup files, you can use the -manualbackup option on the ocrconfig command to perform a manual backup, on demand. For example, you can perform a manual backup before and after you make changes to the OCR such as adding or deleting nodes from your environment, modifying Oracle Clusterware resources, or creating a database. The ocrconfig -manualbackup command exports the OCR content to a file format. You can then backup the export files created by ocrconfig as a part of the operating system backup using Oracle Secure backup, standard operating-system tools, or third-party tools.
See Also:
Oracle Clusterware Administration and Deployment Guide for more information about backing up the OCR
7
7
Configuring Oracle Database with Oracle RAC Configuring Oracle Database with Oracle RAC One Node Configuring Oracle Database with Oracle RAC on Extended Clusters
See Also: Oracle Real Application Clusters Administration and Deployment Guide
In a single-instance environment, you can set the FAST_START_MTTR_TARGET initialization parameter to the number of seconds the crash recovery should take. Note that crash recovery time includes the time to startup, mount, recover, and open the database. Oracle provides several ways to help you understand the MTTR target your system is currently achieving and what your potential MTTR target could be, given the I/O capacity.
See Also: The MAA white paper "Best Practices for Optimizing Availability During Unplanned Outages Using Oracle Clusterware and Oracle Real Application Clusters" for more information from the MAA Best Practices area for Oracle Database at
http://www.oracle.com/goto/maa
See Also: Oracle Database VLDB and Partitioning Guide for information on Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL
The white paper about extended clusters on the Oracle Real Application Clusters Web site at http://www.oracle.com/technetwork/database/cluste ring/overview/index.html Oracle Database High Availability Overview for a high-level overview, benefits, and a configuration example
7.3.1 Spread the Workload Evenly Across the Sites in the Extended Cluster
A typical Oracle RAC architecture is designed primarily as a scalability and availability solution that resides in a single data center. To build and deploy an Oracle RAC extended cluster, the nodes in the cluster are separated by greater distances. When configuring an Oracle RAC database for an extended cluster environment, you must:
Configure one set of nodes at Site A and another set of nodes at Site B.
Spread the cluster workload evenly across both sites to avoid introducing additional contention and latency into the design. For example, avoid client/server application workloads that run across sites, such that the client component is in site A and the server component is in site B.
Oracle Clusterware supports NFS, iSCSI, Direct Attached Storage (DAS), Storage Area Network (SAN) storage, and Network Attached Storage (NAS). If your system does not support NFS, use an alternative. For example, on Windows systems you can use iSCSI.
See Also: For more information, see the Technical Article "Using standard NFS to support a third voting file for extended cluster configurations" at
http://www.oracle.com/technetwork/database/clusterin g/overview/index.html
Distances less than 10 km can be deployed using normal network cables. Distances equal to or more than 10 km require Dense Wavelength Division Multiplexing (DWDM) links. Distances from 10 to 50 km require storage area network (SAN) buffer credits to minimize the performance impact due to the distance. Otherwise, the performance degradation due to the distance can be significant.
For distances greater than 50 km, there are not yet enough proof points to indicate the effect of deployments. More testing is needed to identify what types of workloads could be supported and what the effect of the chosen distance would have on performance.
7.3.4 Use Host-Based Storage Mirroring with Oracle ASM Normal or High Redundancy
Use host-based mirroring with Oracle ASM normal or high redundancy configured disk groups so that a storage array failure does not affect the application and database availability. Oracle recommends host-based mirroring using Oracle ASM to internally mirror across the two storage arrays. Implementing mirroring with Oracle ASM provides an active/active storage environment in which system write I/Os are propagated to both sets of disks, making the disks appear as a single set of disks that is independent of location. Do not use array-based mirroring because only one storage site is active, which makes the architecture vulnerable to this single point of failure and longer recovery times. The Oracle ASM volume manager provides flexible host-based mirroring redundancy options. You can choose to use external redundancy to defer the mirroring protection function to the hardware RAID storage subsystem. The Oracle ASM normal and high-redundancy options allow two-way and three-way mirroring, respectively.
Note:
Array based mirroring can be used in an Oracle RAC extended cluster. Using this approach has the result that the two mirror sites will be in an active-passive configuration and this will result in a complete outage if one site fails. Service becomes available if the remaining mirror site is brought up. For this reason array based mirroring is not recommended from an HA perspective. In order to work with two active sites, host based mirroring is recommended.
Beginning with Oracle Database Release 11g, Oracle ASM includes a preferred read capability that ensures that a read I/O accesses the local storage instead of unnecessarily reading from a remote failure group. When you configure Oracle ASM failure groups in extended clusters, you can specify that a particular node reads from a failure group extent that is closest to the node, even if it is a secondary extent. This is especially useful in extended clusters where remote nodes have asymmetric access for performance, thus leading to better usage and lower network loading. Using preferred read failure groups is most useful in extended clusters. The ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter value is a comma-delimited list of strings that specifies the failure groups that should be preferentially read by the given instance. This parameter is instance specific, and it is generally used only for clustered Oracle ASM instances. It's value can be different on different nodes. For example:
diskgroup_name1.failure_group_name1, ...
See Also:
Section 4.2, "Use Automatic Storage Management (Oracle ASM) to Manage Database Files" Oracle Automatic Storage Management Administrator's Guide for information about configuring preferred read failure groups with the ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter
Network, storage, and management costs increase. Write performance incurs the overhead of network latency. Test the workload performance to assess impact of the overhead. Because this is a single database without Oracle Data Guard, there is no protection from data corruption or data failures. The Oracle release, the operating system, and the clusterware used for an extended cluster all factor into the viability of extended clusters. When choosing to mirror data between sites: Host-based mirroring requires a clustered logical volume manager to allow active/active mirrors and thus a primary/primary site configuration. Oracle recommends using Oracle ASM as the clustered logical volume manager. Array-based mirroring allows active/passive mirrors and thus a primary/secondary configuration.
Extended clusters need additional destructive testing, covering Site failure Communication failure
For full disaster recovery, complement the extended cluster with a remote Data Guard standby database, because this architecture: Maintains an independent physical replica of the primary database Protects against regional disasters Protects against data corruption and other potential failures Provides options for performing rolling database upgrades and patch set upgrades
8
8
The proper configuration of Oracle Data Guard is essential to ensuring that all standby databases work properly and perform their roles within the necessary service levels after switchovers and failovers. The best practices for Oracle Data Guard build on the best practices described in Chapter 5, "Configuring Oracle Database." This chapter includes the following sections:
Oracle Data Guard Configuration Best Practices Determine Protection Mode and Data Guard Transport General Data Guard Configuration Best Practices Oracle Data Guard Role Transition Best Practices Use Oracle Active Data Guard Best Practices Use Snapshot Standby Database Best Practices Assessing Data Guard Performance
http://www.oracle.com/technetwork/middleware/goldengate/overview /index.html Table 81 provides a summary of the Data Guard deployment options that are appropriate, depending on your requirements. Two or more options may be used in combination to address multiple requirements. This chapter also presents the Best practices for implementing each option.
Table 81 Requirements and Data Guard Deployment Options Data Guard Deployment Options Data Guard Maximum Protection or Maximum Availability (SYNC transport) and Redo Apply (physical standby) Data Guard Maximum Performance (ASYNC transport) and Redo Apply
Requirement Zero data loss protection and availability for Oracle Database Near-zero data loss (single-digit seconds) and availability for Oracle Database Multi-site protection, including topology with local zero data loss standby for HA and remote asynchronous standby for geographic disaster recovery for Oracle Database Fastest possible database failover
Data Guard Fast-Start Failover with Oracle Data Guard broker for automatic failure detection and database failover. Automatic failover of accompanying client applications to the new production database is implemented using Oracle Fast Application Notification (FAN) and Oracle Client Failover Best Practices. For more information, see the MAA white paper "Client Failover Best Practices for Data Guard 11g Release 2" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Offload read-only queries and/or fast incremental backups to a synchronized standby database. Use the standby database to automatically repair corrupt blocks, transparent to the application and user
Active Data Guard. Active Data Guard can be purchased in either of the following ways: (1) standalone as an option license for Oracle Database Enterprise Edition, or (2) included with an Oracle GoldenGate license.
Table 81 (Cont.) Requirements and Data Guard Deployment Options Requirement Pre-production testing Data Guard Deployment Options Snapshot Standby. A snapshot standby is a physical standby database that is temporarily open read/write for test and other read/write activity independent of primary database transactions. A snapshot standby is easily converted back into a synchronized standby database when testing is complete. Snapshot Standby is an included feature of Data Guard Redo Apply and is an ideal complement for Oracle Real Application Testing. Data Guard switchover, planned role transition, using Redo Apply. Redo Apply and Standby-First Patch Apply for qualifying patches from 11.2.0.1 onward. SQL Apply and Data Guard Database Rolling Upgrades (10.1 onward). Data Guard Transient Logical Standby (Upgrades Made Easy) from 11.1.0.7 onward. For more information, see the MAA white paper, "Database Rolling Upgrades Made Easy by Using a Data Guard Physical Standby Database", from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa Data Protection for data residing outside of the Oracle Database When practical, move operating system file system data into Oracle Database using Oracle Database File System (DFBS). Data Guard protects DBFS data in the same manner as any other Oracle data. Data that must remain in operating system files can be protected using Oracle ASM Cluster File System (Oracle ACFS) or storage mirroring, in conjunction with Data Guard.
Planned maintenance: certain platform migrations such as Windows to Linux, data center moves, patching and upgrading system software and/or Oracle Database
Note:
Standby-First Patch allows you to apply a patch initially to a physical standby database while the primary database remains at the previous software release (this applies for certain types of patches and does not apply for Oracle patch sets and major release upgrades; use the Data Guard transient logical standby method for patch sets and major releases). Once you are satisfied with the change, then you perform a switchover to the standby database. The fallback is to switchback if required. For more information, see "Oracle Patch Assurance - Data Guard Standby-First Patch Apply" in My Oracle Support Note 1265700.1 at https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=1265700.1
See Also:
Oracle Database High Availability Overview for a description of the high availability solutions and benefits provided by Oracle Data Guard and standby databases Oracle Data Guard Concepts and Administration provides complete information about Oracle Data Guard Oracle Data Guard Broker for information on the DGMGRL command-line interface
recommended. While both modes use Oracle Data Guard synchronous redo transport by default, there are differences in the rule-sets used to govern behavior at failover time that must be considered, as described below. Oracle Data Guard synchronous redo transport, however, can impact primary database performance if round-trip network latency between primary and standby databases is too great (latency is a function of distance and how 'clean' the network is). If this is the case (testing is easy to do, a DBA may change protection modes and transport methods dynamically), then use Oracle Data Guard Maximum Performance. Maximum Performance uses Oracle Data Guard asynchronous transport services and does not have any impact on primary database performance regardless of network latency. In an environment with sufficient bandwidth to accommodate redo volume, data loss potential is measured in single-digit seconds when using Maximum Performance. To determine the appropriate data protection mode for your application, consult Oracle Data Guard Concepts and Administration. Best practices for the protection mode:
Maximum Protection mode guarantees that no data loss will occur if the primary database fails, even in the case of multiple failures (for example, the network between the primary and standby fails, and then at a later time, the primary fails). This is enforced by never signaling commit success for a primary database transaction until at least one synchronous Data Guard standby has acknowledged that redo has been hardened to disk. Without such an acknowledgment the primary database will stall and eventually shut down rather than allow unprotected transactions to commit. To maintain availability in cases where the primary database is operational but the standby database is not, the best practice is to always have a minimum of two synchronous standby databases in a Maximum Protection configuration. Primary database availability is not impacted if it receives acknowledgment from at least one synchronous standby database. Maximum Availability mode guarantees that no data loss will occur in cases where the primary database experiences the first failure to impact the configuration. Unlike the previous protection mode, Maximum Availability will wait a maximum of NET_TIMEOUT seconds for an acknowledgment from a standby database, after which it will signal commit success to the application and move to the next transaction. Primary database availability (thus the name of the protection mode) is not impacted by an inability to communicate with the standby (for example, due to standby or network outages). Oracle Data Guard will continue to ping the standby and automatically re-establish connection and resynchronize the standby database when possible, but during the period when primary and standby have diverged there will be data loss should a second failure impact the primary database. For this reason, it is a best practice to monitor protection level (simple to do using Enterprise Manager Grid Control) and quickly resolve any disruption in communication between primary and standby before a second failure can occur. Maximum Performance mode (the default mode) provides the highest level of data protection that is possible without affecting the performance or the availability of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log at the primary database (the same behavior as if there were no standby database). Oracle Data Guard transmits redo to the standby database directly from the primary log buffer asynchronous to the local online redo log write. There is never any wait for standby acknowledgment. Similar to Maximum Availability, it is a best practice to monitor protection level (simple to do using Enterprise Manager Grid Control) and quickly resolve any
disruption in communication between primary and standby before a second failure can occur.
See Also: Oracle Data Guard Concepts and Administration for information on Data Guard Protection Modes
Use the SYNC redo transport mode for a high degree of synchronization between the primary and standby databases. Use SYNC redo transport for zero data loss protection where performance service levels can tolerate the impact caused by network latency. Use the ASYNC redo transport mode for minimal impact on the primary database, but with a lower degree of synchronization. Use ASYNC redo transport when zero data loss protection is not required or when the performance impact caused by network latency makes it impractical to use SYNC. Optimize network throughput following the best practices described in Section 8.2.2, "Assess Performance with Proposed Network Configuration".
Sufficient bandwidth to accommodate the maximum redo generation rate If using the SYNC transport, then minimal latency is necessary to reduce the performance impact on the primary database Multiple network paths for network redundancy
In configurations that use a dedicated network connection the required bandwidth is determined by the maximum redo rate of the primary database and the efficiency of the network. Depending on the data protection mode, there are other recommended practices and performance considerations. Maximum protection mode and maximum availability mode require SYNC transport. The maximum performance protection mode use ASYNC redo transport. Use ASYNC redo transport when data loss can be tolerated or when the performance impact caused by network latency makes it impractical to use SYNC (use SYNC redo transport for zero data loss protection). Unlike the ASYNC transport mode, the SYNC transport mode can affect the primary database performance due to the incurred network latency. Distance and network configuration directly influence latency, while high latency can slow the potential transaction throughput and quicken response time. The network configuration, number of repeaters, the overhead of protocol conversions, and the number of routers also affect the overall network latency and transaction response time.
8.3.1 Use Oracle Data Guard Broker with Oracle Data Guard
Use Oracle Data Guard broker to create, manage, and monitor a Data Guard configuration. You can perform all Data Guard management operations locally or remotely through the Oracle Data Guard broker's easy-to-use interfaces: the Data Guard management pages in Oracle Enterprise Manager, which is the broker's graphical user interface (GUI), and the Data Guard command-line interface called DGMGRL. The broker's interfaces improve usability and centralize management and monitoring of the Data Guard configuration. Available as a feature of the Enterprise Edition and Personal Edition of the Oracle database, the broker is also integrated with the Oracle database and Oracle Enterprise Manager. The benefits of using Oracle Data Guard broker include:
Enhanced disaster protection. Higher availability and scalability with Oracle Real Application Clusters (Oracle RAC) Databases. Automated creation of a Data Guard configuration. Easy configuration of additional standby databases. Simplified, centralized, and extended management. Simplified switchover and failover operations. Fast Application Notification (FAN) after failovers. Built-in monitoring and alert and control mechanisms. Transparent to application.
See Also: Oracle Data Guard Broker for more information on the benefits of using Data Guard Broker
You can either create a standby database from backups of your primary database, or create a standby database over the network:
Use the RMAN DUPLICATE TARGET DATABASE FOR STANDBY command to create a standby database from backups of your primary database. You can use any backup copy of the primary database to create the physical standby database if the necessary archived redo log files to completely recover the database are accessible by the server session on the standby host. RMAN restores the most recent datafiles unless you execute the SET UNTIL command.
Use the RMAN FROM ACTIVE DATABASE option to create the standby database over the network if a preexisting database backup is not accessible to the standby system. RMAN copies the data files directly from the primary database to the standby database. The primary database must be mounted or open.
You must choose between active and backup-based duplication. If you do not specify the FROM ACTIVE DATABASE option, then RMAN performs backup-based duplication. Creating a standby database over the network is advantageous because:
You can transfer redo data directly to the remote host over the network without first having to go through the steps of performing a backup on the primary database. (Restoration requires multiple steps including storing the backup locally on the primary database, transferring the backup over the network, storing the backup locally on the standby database, and then restoring the backup on the standby database.) With active duplication you can backup a database (as it is running) from Oracle ASM, and restore the backup to a host over the network and place the files directly into Oracle ASM. Before this feature, restoration required you to backup the primary and copy the backup files on the primary host file system, transfer the backup files over the network, place the backup files on the standby host file system, and then restore the files into Oracle ASM.
See Also:
Oracle Data Guard Concepts and Administration for information on using RMAN to Back Up and Restore Files Oracle Data Guard Concepts and Administration for information about Creating a Standby Database with Recovery Manager Oracle Database Backup and Recovery User's Guide
load performance with NOLOGGING operations, then you must ensure that the corresponding physical standby data files are subsequently synchronized. To synchronize the physical standby data files, either apply an incremental backup created from the primary database or replace the affected standby data files with a backup of the primary data files taken after the nologging operation. Before the file transfer, you must stop Redo Apply on the physical standby database. You can enable force logging immediately by issuing an ALTER DATABASE FORCE LOGGING statement. If you specify FORCE LOGGING, then Oracle waits for all ongoing unlogged operations to finish.
See Also:
Oracle Database Administrator's Guide for information on Specifying FORCE LOGGING Mode Oracle Data Guard Concepts and Administration for information on Enable Forced Logging
Each database uses a fast recovery area. The primary database instances archive remotely to only one apply instance.
Table 82 describes the recommendations for a robust archiving strategy when managing a Data Guard configuration through SQL*Plus. All of the following items are handled automatically when Oracle Data Guard broker is managing a configuration.
Table 82 Archiving Recommendations Description Maintaining a standby database requires that you enable and start archiving on the primary database, as follows: SQL> SQL> SQL> SQL> SHUTDOWN IMMEDIATE STARTUP MOUNT; ALTER DATABASE ARCHIVELOG; ALTER DATABASE OPEN;
Archiving must also be enabled on the standby database to support role transitions. To enable archiving on the standby database: SQL> SHUTDOWN IMMEDIATE; SQL> STARTUP MOUNT; SQL> ALTER DATABASE ARCHIVELOG;
Table 82 (Cont.) Archiving Recommendations Recommendation Use a consistent log format (LOG_ARCHIVE_FORMAT). Description The LOG_ARCHIVE_FORMAT parameter should specify the thread, sequence, and resetlogs ID attributes, and the parameter settings should be consistent across all instances. For example: LOG_ARCHIVE_FORMAT=arch_%t_%S_%r.arc Note: If the fast recovery area is used, then this format is ignored. Perform remote archiving to only one standby instance and node for each Oracle RAC standby database. All primary database instances archive to one standby destination, using the same net service name. Oracle Net Services connect-time failover is used to automatically switch to the "secondary" standby host when the "primary" standby instance has an outage. If the archives are accessible from all nodes because Oracle ASM or some other shared file system is being used for the fast recovery area, then remote archiving can be spread across the different nodes of an Oracle RAC standby database. Specify role-based destinations with the VALID_ FOR attribute The VALID_FOR attribute enables you to configure destination attributes for both the primary and the standby database roles in one server parameter file (SPFILE), so that the Data Guard configuration operates properly after a role transition. This simplifies switchovers and failovers by removing the need to enable and disable the role-specific parameter files after a role transition.
The following example illustrates the recommended initialization parameters for a primary database communicating to a physical standby database. There are two instances, SALES1 and SALES2, running in maximum protection mode.
*.DB_RECOVERY_FILE_DEST=+RECO *.LOG_ARCHIVE_DEST_1='SERVICE=SALES_stby SYNC AFFIRM NET_TIMEOUT=30 REOPEN=300 VALID_FOR=(ONLINE_LOGFILES, ALL_ROLES) DB_UNIQUE_NAME=SALES_stby' *.LOG_ARCHIVE_DEST_STATE_1=ENABLE
The fast recovery area must be accessible to any node within the cluster and use a shared file system technology such as automatic storage management (Oracle ASM), a cluster file system, a global file system, or high availability network file system (HA NFS). You can also mount the file system manually to any node within the cluster very quickly. This is necessary for recovery because all archived redo log files must be accessible on all nodes. On the standby database nodes, recovery from a different node is required when a failure occurs on the node applying redo and the apply service cannot be restarted. In that case, any of the existing standby instances residing on a different node can initiate managed recovery. In the worst case, when the standby archived redo log files are inaccessible, the managed recovery process (MRP) on the different node fetches the archived redo log files using the FAL server to retrieve from the primary node directly. When configuring hardware vendor shared file system technology, verify the performance and availability implications. Investigate the following issues before adopting this strategy:
Is the shared file system accessible by any node regardless of the number of node failures? What is the performance impact when implementing a shared file system? Is there any effect on the interconnect traffic?
(maximum # of logfile groups +1) * maximum # of threads For example, if a primary database has two instances (threads) and each thread has three online log groups, then there should be eight standby redo logs ((3 + 1) * 2 = 8), this reduces the likelihood that the logwriter process for the primary instance is blocked because a standby redo log cannot be allocated on the standby database. The statements in Example 81 create two standby log members for each group and each member is 1 GB. One member is created in the directory specified by the DB_ CREATE_FILE_DEST initialization parameter, and the other member is created in the directory specified by DB_RECOVERY_FILE_DEST initialization parameter. Because this example assumes that there are three online redo log groups in two threads, the next group is group seven.
Example 81 Create Standby Log Members SQL> ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 GROUP 7 SIZE 1G, GROUP 8 SIZE 1G, GROUP 9 SIZE 1G; Group 10 SIZE 1G; SQL> ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 GROUP 11 SIZE 1G; GROUP 12 SIZE 1G; GROUP 13 SIZE 1G; GROUP 14 SIZE 1G;
Consider the following additional guidelines when creating standby redo logs:
Create the same number of standby redo logs on both the primary and standby databases. Create all online redo logs and standby redo logs for both primary and standby databases so that they are the same size. Create standby redo logs in the first available ASM high redundancy disk group, or ensure that the logs are protected using external storage redundancy. In an Oracle RAC environment, create standby redo logs on a shared disk. In an Oracle RAC environment, assign a thread when the standby redo log is created as described in Example 81. Do not multiplex the standby redo logs.
To check the number and group numbers of the redo logs, query the V$LOG view:
SQL> SELECT * FROM V$LOG;
To check the results of the ALTER DATABASE ADD STANDBY LOGFILE THREAD statements, query the V$STANDBY_LOG view:
SQL> SELECT * FROM V$STANDBY_LOG;
You can also see the members created by querying the V$LOGFILE view:
SQL> SELECT * FROM V$LOGFILE;
See Also: Oracle Data Guard Concepts and Administration for information about Configuring an Oracle Database to Receive Redo Data
8.3.7 Use Data Guard Transport and Network Configuration Best Practices
The best practices for Data Guard transport and network configuration include:
Set the LOG_ARCHIVE_MAX_PROCESSES Parameter Set the Network Configuration and Highest Network Redo Rates
You can adjust these parameter settings after evaluating and testing the initial settings in your production environment.
See Also: Oracle Database Administrator's Guide for more information on Adjusting the Number of Archiver Processes
8.3.7.2 Set the Network Configuration and Highest Network Redo Rates
The best practices for Data Guard network configuration and redo rates include:
Properly Configure TCP Send / Receive Buffer Sizes Increase SDU Size Set TCP.NODELAY to YES Determine When to Use Redo Transport Compression
Properly Configure TCP Send / Receive Buffer Sizes To achieve high network throughput, especially for a high-latency, high-bandwidth network, the minimum recommended setting for the sizes of the TCP send and receive socket buffers is the bandwidth-delay product (BDP) of the network link between the primary and standby systems. Settings higher than the BDP may show incremental improvement. For example, in the MAA Linux test lab, simulated high-latency, high-bandwidth networks realized small, incremental increases in throughput when using TCP send and receive socket buffer settings up to three times the BDP. BDP is product of the network bandwidth and latency. Socket buffer sizes are set using the Oracle Net parameters RECV_BUF_SIZE and SEND_BUF_SIZE, so that the socket buffer size setting affects only Oracle TCP connections. The operating system may impose limits on the socket buffer size that must be adjusted so Oracle can use larger values. For example, on Linux, the parameters net.core.rmem_max and net.core.wmem_max limit the socket buffer size and must be set larger than RECV_ BUF_SIZE and SEND_BUF_SIZE. Set the send and receive buffer sizes at either the value you calculated or 10 MB, whichever is larger. For example, if bandwidth is 622 Mbits and latency is 30 ms, then you would calculate the minimum size for the RECV_BUF_SIZE and SEND_BUF_SIZE parameters as follows: 622,000,000 / 8 x 0.030 = 2,332,500 bytes. Then, multiply the BDP 2,332,500 x 3 for a total of 6,997,500. In this example, you would set the initialization parameters as follows:
8-11
RECV_BUF_SIZE=6,997,500 SEND_BUF_SIZE=6,997,500 Increase SDU Size With Oracle Net Services it is possible to control data transfer by adjusting the size of the Oracle Net setting for the session data unit (SDU). Oracle internal testing has shown that setting the SDU to its maximum value of 65535 can improve performance for the SYNC transport. You can set SDU on a per connection basis using the SDU parameter in the local naming configuration file (TNSNAMES.ORA) and the listener configuration file (LISTENER.ORA), or you can set the SDU for all Oracle Net connections with the profile parameter DEFAULT_SDU_SIZE in the SQLNET.ORA file. Note that the ASYNC transport uses the new streaming protocol and increasing the SDU size from the default has no performance benefit.
See Also: Oracle Database Net Services Reference for more information on the SDU and DEFAULT_SDU_SIZE parameters
Set TCP.NODELAY to YES To preempt delays in buffer flushing in the TCP protocol stack, disable the TCP Nagle algorithm by setting TCP.NODELAY to YES in the SQLNET.ORA file on both the primary and standby systems.
See Also: Oracle Database Net Services Reference for more information about the TCP.NODELAY parameter
Determine When to Use Redo Transport Compression In Oracle Database 11g Release 2 (11.2.0.2) redo transport compression is no longer limited to compressing redo data only when a redo gap is being resolved. When compression is enabled for a destination, all redo data sent to that destination is compressed. In general, compression is most beneficial when used over low bandwidth networks. As the network bandwidth increases, the benefit is reduced. Compressing redo in a Data Guard environment is beneficial if:
Sufficient CPU resources are available for the compression processing. The database redo rate is being throttled by a low bandwidth network.
Before enabling compression, assess the available CPU resources and decide if enabling compression is feasible. For complete information about enabling compression, see "Redo Transport Compression in a Data Guard Environment" in My Oracle Support Note 729551.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id =729551.1
Maximize I/O Rates on Standby Redo Logs and Archived Redo Logs Assess Recovery Rate Use DB_BLOCK_CHECKING=OFF and Set DB_BLOCK_CHECKSUM=FULL
Set DB_CACHE_SIZE to a Value Greater than on the Primary Database Assess Database Wait Events Tune I/O Operations Assess System Resources
See Also:
The MAA white paper "Active Data Guard 11g Best Practices (includes best practices for Redo Apply)" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
8.3.8.1 Maximize I/O Rates on Standby Redo Logs and Archived Redo Logs
Measure read I/O rates on the standby redo logs and archived redo log directories. Concurrent writing of shipped redo on a standby database might reduce the redo read rate due to I/O saturation. The overall recovery rate is always bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate.
If your ACTIVE APPLY RATE is greater than the maximum redo generation rate at the primary database or twice the average generation rate at the primary database, then no tuning is required; otherwise follow the tuning tips below. The redo generation rate for the primary database can be monitored from Enterprise Manager or extracted from AWR reports under statistic REDO SIZE. If CHECKPOINT TIME PER LOG is greater than ten seconds, then investigate tuning I/O and checkpoints.
To check for block corruption that was not preventable through the DB_BLOCK_CHECKING parameter, use: RMAN BACKUP command with the VALIDATE option DBVERIFY utility ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE SQL statement
The default setting for DB_BLOCK_CHECKSUM is TYPICAL. Block checksum should always be enabled for both primary and standby databases. It catches most block corruption while incurring negligible overhead.
8-13
Set the DB_LOST_WRITE_PROTECT parameter to FULL on the standby database to enable Oracle to detect writes that are lost in the I/O subsystem. The impact on redo apply is very small for OLTP applications and generally less than 5 percent.
See Also:
Table 83 (Cont.) Parallel Recovery Coordinator Wait Events Wait Name Datafile init write Description The parallel recovery coordinator is waiting for a file resize to finish, as would occur with file auto extend. The coordinator has sent a synchronous control messages to all slaves, and is waiting for all slaves to reply. Tuning Tune or increase the I/O bandwidth for the ASM diskgroup where datafiles reside. This is a non-tunable event.
When dealing with recovery slave events, it is important to know how many slaves were started. Divide the wait time for any recovery slave event by the number of slaves. Table 84 describes the parallel recovery slave wait events.
Table 84 Wait Name Parallel recovery slave next change Parallel Recovery Slave Wait Events Description The parallel recovery slave is waiting for a change to be shipped from the coordinator. This is in essence an idle event for the recovery slave. To determine the amount of CPU a recovery slave is using, divide the time spent in this event by the number of slaves started and subtract that value from the total elapsed time. This may be close, because there are some waits involved. A parallel recovery slave (or serial recovery process) is waiting for a batch of synchronous data block reads to complete. Recovery is waiting for checkpointing to complete, and Redo Apply is not applying any changes currently. A parallel recovery slave is waiting for a batched data block I/O. Tuning Tune or increase the I/O bandwidth for the ASM diskgroup where the archive logs or online redo logs reside.
Tune or increase the I/O bandwidth for the ASM diskgroup where datafiles reside. Tune or increase the I/O bandwidth for the ASM diskgroup where datafiles reside. Tune or increase the I/O bandwidth for the ASM diskgroup where datafiles reside.
Checkpoint completed
Recovery read
8-15
If there are I/O bottlenecks or excessive wait I/O operations, then investigate operational or application changes that increased the I/O volume. If the high waits are due to insufficient I/O bandwidth, then add more disks to the relevant Oracle ASM disk group. Verify that this is not a bus or controller bottleneck or any other I/O bottleneck. The read I/O rate from the standby redo log should be greater than the expected recovery rate. Check for excessive swapping or memory paging. Check to ensure the recovery coordinator or MRP is not CPU bound during recovery.
2. 3.
To provide continuous protection following failover The standby databases in a multiple standby configuration that are not the target of the role transition (these databases are referred to as bystander standby databases) automatically apply redo data received from the new primary database.
To achieve zero data loss protection while also guarding against widespread geographic disasters that extend beyond the limits of synchronous communication For example, one standby database that receives redo data synchronously is located 200 miles away, and a second standby database that receives redo data asynchronously is located 1,500 miles away from the primary.
To perform rolling database upgrades while maintaining disaster protection throughout the rolling upgrade process To perform testing and other ad-hoc tasks while maintaining disaster-recovery protection
Use Multiple Standby Databases Best Practices The Oracle Database High Availability Overview describes how a multiple standby database architecture is virtually identical to that of single standby database architectures. Therefore, the configuration guidelines for implementing multiple standby databases described in this section complement the existing best practices for physical and logical standby databases. When deploying multiple standby databases, use the following best practices:
Use Oracle Data Guard broker (described in Chapter 12, "Monitoring for High Availability") to manage your configuration and perform role transitions. However, if you choose to use SQL*Plus statements, see the MAA white paper "Multiple Standby Databases Best Practices" for best practices from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
If you are using Flashback Database for the sole purpose of reinstating databases following a failover, a DB_FLASHBACK_RETENTION_TARGET of 120 minutes is the minimum recommended value. When you use Flashback Database to quickly reinstate the original primary as the standby after a failover, instead of re-creating the entire standby database from backups or from the primary database, when using Fast-start Failover, ensure the UNDO_RETENTION and DB_FLASHBACK_ RETENTION_TARGET initialization parameters are set to a minimum of 120 so that reinstatement is still possible after a prolonged outage. On a standby the flashback barrier cannot be guaranteed to be published every 30 minutes as it is on a primary. Thus, when enabling flashback database on a standby, the DB_ FLASHBACK_RETENTION_TARGET should be a minimum of 120. Since the primary and standby should match, this implies the same for the primary. Enable supplemental logging in configurations containing logical standby databases. When creating a configuration with both physical and logical standby databases, issue the ALTER DATABASE ADD SUPPLEMENTAL LOG DATA statement to enable supplemental logging in the following situations: When adding a logical standby database to an existing configuration consisting of all physical standby databases, you must enable supplemental logging on all existing physical standby databases in the configuration. When adding a physical standby database to an existing configuration that contains a logical standby database, you must enable supplemental logging on the physical standby database when you create it.
As part of the logical standby database creation supplemental logging is automatically enabled on the primary. Enabling supplemental logging is a control file change and therefore the change is not propagated to each physical standby database. Supplemental logging is enabled automatically on a logical standby database when it is first converted from a physical standby database to a logical standby database as part of the dictionary build process. To enable supplemental logging, issue the following SQL*Plus statement when connected to a physical standby database:
SQL> ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY, UNIQUE INDEX) COLUMNS;
If logical standby databases are not configured to perform real-time queries, then consider configuring SQL Apply to delay applying redo data to the logical standby database. By delaying the application of redo, you can minimize the need to manually reinstate the logical standby database after failing over to a physical standby database. To set a time delay, use the DELAY=minutes attribute of the LOG_ARCHIVE_ DEST_n initialization parameter.
8-17
See Also:
Oracle Database High Availability Overview to learn about the benefits of using multiple standby database and for implementation examples Oracle Database High Availability Overview for an overview of multiple standby database architectures The MAA white paper "Multiple Standby Databases Best Practices" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Using Oracle Enterprise Manager Using the Oracle Data Guard broker's DGMGRL command-line interface Issuing SQL statements, as described in Section 14.2.1.3, "How to Perform Data Guard Switchover"
See Also: Oracle Data Guard Broker for information about using Oracle Enterprise Manager or Oracle Data Guard broker's DGMGRL command-line interface to perform database switchover
To optimize switchover processing, perform the following steps before performing a switchover:
Disconnect all sessions possible using the ALTER SYSTEM KILL SESSION SQL*Plus command. Stop job processing by setting the AQ_TM_PROCESSES parameter to 0. Cancel any specified apply delay by using the NODELAY keyword to stop and restart log apply services on the standby database.
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL; ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE NODELAY;
You can view the current delay setting on the primary database by querying the DELAY_MINS column of the V$ARCHIVE_DEST view.
For physical standby databases in an Oracle RAC environment, ensure there is only one instance active for each primary and standby database. Configure the standby database to use real-time apply and, if possible, ensure the databases are synchronized before the switchover operation to optimize switchover processing. For the fastest switchover, use real-time apply so that redo data is applied to the standby database as soon as it is received, and the standby database is synchronized with the primary database before the switchover operation to minimize switchover time. To enable real-time apply use the following SQL*Plus statement:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT USING CURRENT LOGFILE;
For a physical standby database, reduce the number of archiver (ARCn) processes to the minimum needed for both remote and local archiving. Additional archiver processes can take additional time to shut down, thereby increasing the overall time it takes to perform a switchover. After the switchover has completed you can reenable the additional archiver processes. Set the LOG_FILE_NAME_CONVERT initialization parameter to any valid value for the environment, or if it is not needed set the parameter to null. As part of a switchover, the standby database must clear the online redo log files on the standby database before opening as a primary database. The time needed to complete the I/O can significantly increase the overall switchover time. By setting the LOG_FILE_NAME_CONVERT parameter, the standby database can pre-create the online redo logs the first time the MRP process is started. You can also pre-create empty online redo logs by issuing the SQL*Plus ALTER DATABASE CLEAR LOGFILE statement on the standby database.
See Also: Support notes for switchover best practices for Data Guard Physical Standby (11.2.0.2):
If using SQL*Plus, see "11.2 Data Guard Physical Standby Switchover Best Practices using SQL*Plus" in My Oracle Support Note 1304939.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1304939.1
If using the Oracle Data Guard broker or Oracle Enterprise Manager, see "11.2 Data Guard Physical Standby Switchover Best Practices using the Broker" in My Oracle Support Note 1305019.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1305019.1
The MAA white paper "Switchover and Failover Best Practices" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
8-19
Comparing Fast-Start Failover and Manual Failover Failover Best Practices (Manual Failover and Fast-Start Failover) Fast-Start Failover Best Practices Manual Failover Best Practices
See Also: For a comprehensive review of Oracle Data Guard failover best practices, see:
Oracle Data Guard Broker for information on Switchover and Failover Operations "Data Guard Fast-Start Failover" MAA white paper from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
"11.2 Data Guard Physical Standby Switchover Best Practices using SQL*Plus" in My Oracle Support Note 1304939.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1304939.1
"11.2 Data Guard Physical Standby Switchover Best Practices using the Broker" in My Oracle Support Note 1305019.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1305019.1
Table 85
Points of Comparison Benefits
Manual Failover Gives you control over exactly when a failover occurs and to which target standby database. A manual failover is user initiated and involves performing a series of steps to convert a standby database into a primary database. A manual failover should be performed due to an unplanned outage such as:
Failover triggers
Database instance failure (or last instance failure in an Oracle RAC configuration). Shutdown abort (or a shutdown abort of the last instance in an Oracle RAC configuration). Specific conditions that are detected through the database health-check mechanism (for example, data files taken offline due to I/O errors). Fast-start failover can be enabled for these conditions (ENABLE FAST_START FAILOVER CONDITION) and ORA errors raised by the Oracle server when they occur. See Oracle Data Guard Broker for a full list of conditions.
Site disaster which results in the primary database becoming unavailable (all instances of an Oracle RAC primary database). User errors that cannot be repaired in a timely fashion. Data failures, which impact the production application.
Both the observer and the standby database lose their network connection to the primary database. Application initiated fast-start failover using the DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure. Use the following tools to perform manual failovers:
Management
Oracle Enterprise Manager The Oracle Data Guard broker command-line interface (DGMGRL) SQL statements
Oracle Enterprise Manager The Oracle Data Guard broker command-line interface (DGMGRL)
See Section 14.2.1.3, "How to Perform Data Guard Switchover". Restoring the original primary database after failover Following a fast-start failover, Oracle Data Guard broker can automatically reconfigure the original primary database as a standby database upon reconnection to the configuration (FastStartFailoverAutoReinstate), or you can delay the reconfiguration to allow diagnostics on the failed primary. Automatic reconfiguration enables Data Guard to restore disaster protection in the configuration quickly and easily, returning the database to a protected state as soon as possible. Oracle Data Guard broker coordinates the role transition on all databases in the configuration. Bystanders that do no require reinstatement are available as viable standby databases to the new primary. Bystanders that require reinstatement are automatically reinstated by the observer. Oracle Data Guard broker automatically publishes FAN/AQ (Advanced Queuing) and FAN/ONS (Oracle Notification Service) notifications after a failover. Clients that are also configured for Fast Connection Failover can use these notifications to connect to the new primary database. You can also use the DB_ROLE_CHANGE system event to help user applications locate services on the primary database. (These events are also available for manual failovers performed by the broker. See Oracle Data Guard Broker.)
See Section 13.2.2.3, "Best Practices for Performing Manual Failover". After manual failover, you must reinstate the original primary database as a standby database to restore fault tolerance.
A benefit of using Oracle Data Guard broker is that it provides the status of bystander databases and indicates whether a database must be reinstated. Status information is not readily available when using SQL*Plus statements to manage failover. See Section 13.3.2, "Restoring a Standby Database After a Failover". Oracle Data Guard broker automatically publishes FAN/AQ (Advanced Queuing) and FAN/ONS (Oracle Notification Service) notifications after a failover. Clients that are also configured for Fast Connection Failover can use these notifications to connect to the new primary database. You can also use the DB_ROLE_CHANGE system event to help user applications locate services on the primary database. (These events are also available for manual failovers performed by the broker. See Oracle Data Guard Broker.)
Application failover
8-21
Enable Flashback Database to reinstate the failed primary databases after a failover operation has completed. Flashback Database facilitates fast point-in-time recovery, if needed. Use real-time apply with Flashback Database to apply redo data to the standby database as soon as it is received, and to quickly rewind the database should user error or logical corruption be detected. Consider configuring multiple standby databases to maintain data protection following a failover. Set the LOG_FILE_NAME_CONVERT parameter. As part of a failover, the standby database must clear its online redo logs before opening as the primary database. The time needed to complete this I/O can add significantly to the overall failover time. By setting the LOG_FILE_NAME_CONVERT parameter, the standby pre-creates the online redo logs the first time the MRP process is started. You can also pre-create empty online redo logs by issuing the SQL*Plus ALTER DATABASE CLEAR LOGFILE statement on the standby database. Use fast-start failover. The MAA tests running Oracle Database 11g show that failovers performed using Oracle Data Guard broker and fast-start failover offer a significant improvement in availability. For more information, see Section 8.4.2.3, "Fast-Start Failover Best Practices". For physical standby databases, do the following: When transitioning from read-only mode to Redo Apply (recovery) mode, restart the database. Go directly to the OPEN state from the MOUNTED state instead of restarting the standby database (as required in releases prior to 11g Release 2). See the MAA white paper "Oracle Data Guard Redo Apply and Media Recovery" to optimize media recovery for Redo Apply from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Run the fast-start failover observer process on a host that is not located in the same data center as the primary or standby database. Ideally, you should run the observer on a system that is equally distant from the primary and standby databases. The observer should connect to the primary and standby databases using the same network as any end-user client. If the designated observer fails, Oracle Enterprise Manager can detect it and automatically restart the observer. If the observer cannot run at a third site, then you should install the observer on the same network as the application. If a third, independent location is not available, then locate the observer in the standby data center on a separate host and isolate the observer as much as possible from failures affecting the standby database.
Make the observer highly available by using Oracle Enterprise Manager to configure the original primary database to be automatically reinstated as a standby database when a connection to the database is reestablished. Also, Oracle Enterprise Manager enables you to define an alternate host on which to restart the observer. After the failover completes, the original primary database is automatically reinstated as a standby database when a connection to it is reestablished, if you set the FastStartFailoverAutoReinstate configuration property to TRUE.
Set the value of the FastStartFailoverThreshold property according to your configuration characteristics, as described in Table 86.
Minimum Recommended Settings for FastStartFailoverThreshold Minimum Recommended Setting 15 seconds 30 seconds Oracle RAC miscount + reconfiguration time + 30 seconds
Table 86
Configuration Single-instance primary, low latency, and a reliable network Single-instance primary and a high latency network over WAN Oracle RAC primary
Test your configuration using the settings shown in Table 86 to ensure that the fast-start failover threshold is not so aggressive that it induces false failovers, or so high it does not meet your failover requirements.
Site disaster that results in the primary database becoming unavailable User errors that cannot be repaired in a timely fashion Data failures, to include widespread corruption, which affects the production application
Use the following manual failover best practices in addition to the generic best practices listed in Section 8.4.2.2, "Failover Best Practices (Manual Failover and Fast-Start Failover)":
Reinstate the original primary database as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 13.3.2, "Restoring a Standby Database After a Failover."
8-23
See Also: For physical standby databases see the MAA white paper "Oracle Data Guard Redo Apply and Media Recovery" from the MAA Best Practices area for Oracle Database at
http://www.oracle.com/goto/maa
A physical standby database can be open for read-only access while Redo Apply is active if a license for the Oracle Active Data Guard option has been purchased. This capability, known as real-time query also provides the ability to have block-change tracking on the standby database, thus allowing incremental backups to be performed on the standby.
The following list summarizes the best practices for deploying real-time query:
Ensure Active Data Guard is enabled. The easiest and best way to view the status of Oracle Active Data Guard is on the Data Guard overview page through Oracle Enterprise Manager. Alternatively, query the v$database view on the standby database and confirm the status of "READ ONLY WITH APPLY':
SQL> SELECT open_mode FROM V$DATABASE; OPEN_MODE -------------------READ ONLY WITH APPLY
Use real-time apply on the standby database so that changes are applied as soon as the redo data is received. Oracle Data Guard broker automatically enables real-time apply when the configuration is created. If you are using the SQL*Plus command-line to create your configuration, then enable real-time apply as follows: Issue the statement:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE
Enable Flashback Database on the standby database to minimize downtime for logical corruptions. Monitor standby performance by using Standby Statspack. For complete details about installing and using Standby Statspack, see "Installing and Using Standby Statspack in 11g" in My Oracle Support Note 454848.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=454848.1
When you deploy real-time query to offload queries from a primary database to a physical standby database, monitor the apply lag to ensure that it is within acceptable limits. See Oracle Data Guard Concepts and Administration for information on Monitoring Apply Lag in a Real-time Query Environment. Create an Oracle Data Guard broker configuration to simplify management and to enable automatic apply instance failover on an Oracle RAC standby database.
See Also: The "Active Data Guard 11g Best Practices (includes best practices for Redo Apply)" white paper available from the MAA Best Practices area for Oracle Database at
http://www.oracle.com/goto/maa
Recover all available redo data Create a guaranteed restore point Activate the standby database as a primary database Open the database as a snapshot standby database
To convert the snapshot standby back to a physical standby, issue the ALTER DATABASE CONVERT TO PHYSICAL STANDBY statement. This command causes the physical standby database to be flashed back to the guaranteed restore point that was created before the ALTER DATABASE CONVERT TO SNAPSHOT STANDBY statement was issued. Then, you must perform the following actions:
1. 2.
Restart the physical standby database Restart Redo Apply on the physical standby database
Follow these best practices when creating and managing snapshot standby databases:
Use the Oracle Data Guard broker to manage your Oracle Data Guard configuration, because it simplifies the management of snapshot standby databases. The broker will automatically convert a snapshot standby database into a physical standby database as part of a failover operation. Without the broker, this conversion must be manually performed before initiating a failover. Create multiple standby databases if your business requires a fast recovery time objective (RTO). Ensure the physical standby database that you convert to a snapshot standby is caught up with the primary database, or has a minimal apply lag. See Section 8.3.8, "Use Data Guard Redo Apply Best Practices" for information about tuning media recovery. Configure a fast recovery area and ensure there is sufficient I/O bandwidth available. This is necessary because snapshot standby databases use guaranteed restore points.
See Also:
Oracle Data Guard Concepts and Administration for complete information about creating a snapshot standby database
Configuring Oracle Data Guard 8-25
Physical reads per transaction Physical writes per transaction CPU usage per transaction Redo generated per transaction
Redo generated per second or redo rate User commits per second or transactions per second Database time per second Response time per transaction SQL service response time
If the application profile has changed between the two scenarios, then this is not a fair comparison. Repeat the test or tune the database or system with the general principles outlined in the Oracle Database Performance Tuning Guide. If the application profile is similar and you observe application performance changes on the primary database because of a decrease in throughput or an increase in response time, then assess these common problem areas:
CPU utilization If you are experiencing high load (excessive CPU usage of over 90%, paging and swapping), then tune the system before proceeding with Data Guard. Use the V$OSSTAT view or the V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.
Higher I/O wait events If you are experiencing higher I/O waits from the log writer or database writer processes, then the slower I/O effects throughput and response time. To observe the I/O effects, look at the historical data of the following wait events: Log file parallel writes Log file sequential reads Log file parallel reads Data file parallel writes Data file sequential reads parallel writes
With SYNC transport, commits take more time because of the need to guarantee that the redo data is available on the standby database before foreground processes get an acknowledgment from the log writer (LGWR) background process that the commit has completed. A LGWR process commit includes the following wait events:
Log File Parallel Write (local write for the LGWR process) LGWR wait on SENDREQ
This wait event includes: Time to put the packet into the network Time to send the packet to the standby database RFS write or standby write to the standby redo log, which includes the RFS I/O wait event plus additional overhead for checksums Time to send a network acknowledgment back to the primary database (for example, single trip latency time)
Longer commit times for the LGWR process can cause longer response time and lower throughput, especially for small time-sensitive transactions. However, you may obtain sufficient gains by tuning the log writer local write (Log File Parallel Write wait event) or the different components that comprise the LGWR wait on SENDREQ wait event. To tune the disk write I/O (Log File Parallel Write or the RFS I/O), add more spindles or increase the I/O bandwidth. To reduce the network time:
Tune the Oracle Net send and receive buffer sizes Set SDU=65535 (for more information, see Section 8.3.7.2, "Set the Network Configuration and Highest Network Redo Rates") Increase the network bandwidth if there is saturation Possibly find a closer site to reduce the network latency
With ASYNC transport, the LGWR process never waits for the network server processes to return before writing a COMMIT record to the current log file. However, if the network server processes has fallen behind and the redo to be shipped has been flushed from the log buffer, then the network server process reads from the online redo logs. This causes more I/O contention and possibly longer wait times for the log writer process writes (Log File Parallel Write). If I/O bandwidth and sufficient spindles are not allocated, then the log file parallel writes and log file sequential reads increase, which may affect throughput and response time. In most cases, adding sufficient spindles reduces the I/O latency.
Note:
To enable most of the statistical gathering and advisors, ensure the STATISTICS_LEVEL initialization parameter is set to TYPICAL (recommended) or ALL.
See Also:
Oracle Database Performance Tuning Guide for general performance tuning and troubleshooting best practices Oracle Database Performance Tuning Guide for Overview of the Automatic Workload Repository (AWR) and on Generating Automatic Workload Repository Reports The MAA white paper "Data Guard Redo Transport & Network Best Practices" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
8-27
9
9
A data protection plan is not complete without a sound backup and recovery strategy to protect against system and storage failures. Oracle delivers a comprehensive data protection suite for backup and recovery of Oracle database and unstructured, application files. The primary focus of this chapter is best practice configuration for backup and recovery for the Oracle database. File system data protection offerings are introduced along with pointers on where to find more information. This chapter includes the following sections:
Oracle Database Backup and Recovery Products and Features Backup and Recovery Configuration and Administration Best Practices Backup to Disk Best Practices Backup to Tape Best Practices Backup and Recovery Operations and Maintenance Best Practices Backup Files Outside the Database
Table 91provides a quick reference summary of the Oracle backup and recovery suite.
Table 91 Backup and Recovery Summary Recommended use with Oracle Database Yes Yes Yes Recommended use with File System Data No Yes No No No
Technology Recovery Manager (RMAN) Oracle Secure Backup Oracle Secure Backup Cloud Module
Comments Native backup utility for the Oracle database Tape backup management software Backup to Amazon S3 storage Logical error correction leveraging undo data of the Oracle database Continuous Data Protection (CDP) leveraging flashback logs
9-1
Table 91 (Cont.) Backup and Recovery Summary Recommended use with Oracle Database No Recommended use with File System Data Yes
Technology Automatic Clustered File System (ACFS) Snapshots ZFS Snapshots (for database clones such as dev/test) ZFS Snapshots for backup/restore
Comments Read-only or read/write copy-on-write version of the file system Read-only or read/write copy-on-write version of the database for testing and development Read-only or read/write copy-on-write version of the file system for testing, development and backup
Yes
Yes
No
Yes
Situations that require Database Backup Setting Up the Initial Data Guard Environment
Recovering from Data Failures Using File or Block Media Recovery Resolving a Double Failure
See Also:
Oracle Data Guard Concepts and Administration for information about creating a standby database Oracle Database Backup and Recovery User's Guide for information about the DUPLICATE command
Online database backups without placing tablespaces in backup mode Efficient block-level incremental backups Data block integrity checks during backup and restore operations Test backups and restores without actually performing the operation Synchronize a physical standby database with the primary database
RMAN automates backup and recovery. While user-managed methods require you to:
Locate backups for each data file Copy backups to the correct place using operating system commands Choose which logs to apply
RMAN fully automates these backup and recovery tasks. There are also capabilities of Oracle backup and recovery that are only available when using RMAN, such as automated tablespace point-in-time recovery and block media recovery.
See Also:
Oracle Database Backup and Recovery User's Guide for information on performing block media recovery Oracle Database Backup and Recovery User's Guide for information on performing RMAN Tablespace Point-in-Time Recovery (TSPITR) Oracle Data Guard Concepts and Administration for information on Using RMAN Incremental Backups to Roll Forward a Physical Standby Database
Oracle database through the Oracle Secure Backup built-in integration with Recovery Manager (RMAN)
9-3
File system data protection: For UNIX, Windows, and Linux servers Network Attached Storage (NAS) data protection leveraging the Network Data Management Protocol (NDMP)
Oracle Secure Backup is integrated with RMAN providing the media management layer (MML) for Oracle database tape backup and restore operations. The tight integration between these two products delivers high-performance Oracle database tape backup. Specific performance optimizations between RMAN and Oracle Secure Backup that reduce tape consumption and improve backup performance are:
Unused block compression: Eliminates the time and space usage needed to backup unused blocks Backup undo optimization: Eliminates the time and space usage needed to backup undo that is not required to recover the current backup.
You can manage the Oracle Secure Backup environment using the command line, the Oracle Secure Backup Web tool, and Oracle Enterprise Manager. Using the combination of RMAN and Oracle Secure Backup provides an end-to-end tape backup solution, eliminating the need for third-party backup software.
See Also:
Oracle Database Backup and Recovery User's Guide for more information on unused block compression and backup undo optimization The Oracle Secure Backup platform and NAS and tape device compatibility matrixes: http://www.oracle.com/technetwork/database/secure -backup/learnmore/index.html
Restore points protect against logical failures at risky points during database maintenance. Creating a normal restore point assigns a restore point name to a specific point in time or SCN, that is a snapshot of the data as of that time. Normal restore points are available with Flashback Table, Flashback Database, and all RMAN recovery-related operations. Guaranteed restore points are recommended for database-wide maintenance such as database or application upgrades, or running batch processes. Guaranteed restore points are integrated with Flashback Database and enforce the retention of all flashback logs required for flashing back to the guaranteed restore point. After maintenance activities complete and the results are verified, you should delete guaranteed restore points that are no longer needed to reclaim flashback log space.
See Also: Oracle Database Backup and Recovery User's Guide for more information about using restore points and guaranteed restore points with a Flashback Database
Criticality of the data: The Recovery Point Objective (RPO) determines how much data your business can acceptably lose in the event of a failure. The more critical the data, the lower the RPO and the more frequently data should be backed up. If you are going to back up certain tablespaces more often than others, with the goal of getting better RPO for those tablespaces, then you also need to plan for doing TSPITR as part of your recovery strategy. This requires considerably more planning and practice than DBPITR, because you need to make sure that the tablespaces you plan to TSPITR are self-contained. Estimated repair time: The Recovery Time Objective (RTO) determines the acceptable amount of time needed for recovery. Repair time is dictated by restore time plus recovery time. The lower the RTO, the higher the frequency of backups, that is, backups are more current, thereby reducing recovery time. Volume of changed data: The rate of database change effects how often data is backed up: For read-only data, perform backups frequently enough to adhere to retention policies. For frequently changing data, perform backups more often to reduce the RTO.
9-5
To simplify database backup and recovery, the Oracle Suggested Backup Strategy uses the fast recovery area, incremental backups, and incrementally updated backup features. After the initial image copy backup to the FRA, only the changed blocks are captured in the incremental backups thereafter and subsequently applied to the image copy, thereby updating the copy to the most current incremental backup time (that is, incrementally updating the backup).
See Also:
Oracle Database Backup and Recovery User's Guide for information on Performing RMAN Tablespace Point-in-Time Recovery (TSPITR) Oracle Database Backup and Recovery User's Guide for information on Performing Database Point-in-Time Recovery Oracle Database 2 Day DBA for information on Using the Oracle Suggested Backup Strategy
A backup retention policy is a rule set regarding which backups must be retained, on disk or other backup media, to meet recovery and other requirements. It may be safe to delete a specific backup because it has been superseded by more recent backups or because it has been stored on tape. You may also have to retain a specific backup on disk for other reasons such as archival or regulatory requirements. A backup that is no longer needed to satisfy the backup retention policy is said to be obsolete. Base your backup retention policy on redundancy or on a recovery window:
In a redundancy-based retention policy, specify a number n such that you always keep at least n distinct backups of each file in your database. In a recovery window-based retention policy, specify an earlier time interval, for example, one week or one month, and keep all backups required to let you perform point-in-time recovery to any point during that window.
Keeping Archival Backups Some businesses must retain some backups for much longer than their day-to-day backup retention policy. RMAN allows for this with the Long-term Archival Backup feature. Rather than becoming obsolete according to the database's backup retention policy, archival backups either never become obsolete or become obsolete when their time limit expires.
You can use the RMAN BACKUP command with the KEEP option to retain backups for longer than your ordinary retention policy. This option specifies the backup as an archival backup, which is a self-contained backup that is exempt from the configured retention policy. This allows you to retain certain backups for much longer than usual, when needed for such reasons as satisfying statutory retention requirements. Using the KEEP FOREVER option, a recovery catalog is required because the backup records eventually age out of the control file (otherwise, without a recovery catalog, loss may occur when you retain backups for much longer than usual using the database control file). Only the archived redo log files required to make an archival backup consistent are retained. For more information on the RMAN recovery catalog, see Section 9.2.2, "Use an RMAN Recovery Catalog".
See Also: Oracle Database Backup and Recovery User's Guide for information on Archival Backups for Long-Term Storage
Storing backup information for a longer retention period than what can be feasibly stored in the control file. If the control file is too small to hold additional backup metadata, then existing backup information is overwritten, making it difficult to restore and recover using those backups. Stores metadata for multiple databases. Offloading backups to a physical standby database and using those backups to restore and recover the primary database. Similarly, you can back up a tablespace on a primary database and restore and recover it on a physical standby database. Note that backups of logical standby databases are not usable at the primary database.
See Also: Oracle Database Backup and Recovery User's Guide for more information on RMAN repository and the recovery catalog
9-7
See Also:
Oracle Database Backup and Recovery User's Guide for information on Enabling and Disabling Block Change Tracking Oracle Database Backup and Recovery Basics for more information about Block Change Tracking
9.2.5 Enable Autobackup for the Control File and Server Parameter File
You should configure RMAN to automatically back up the control file and the server parameter file (SPFILE) whenever the database structure metadata in the control file changes or when a backup record is added. The control file autobackup option enables RMAN to recover the database even if the current control file, catalog, and SPFILE are lost. Enable the RMAN autobackup feature with the CONFIGURE CONTROLFILE AUTOBACKUP ON statement. You should enable autobackup for both the primary and standby databases. For example, after connecting to the primary database, as the target database, and the recovery catalog, issue the following command:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
See Also:
Oracle Database Backup and Recovery User's Guide for information on Configuring Control File and Server Parameter File Autobackups Oracle Data Guard Concepts and Administration for more information on RMAN Configurations at a Standby Database Where Backups are Performed
Backups of logical standby databases are not usable on the primary database.
See Also: Oracle Data Guard Concepts and Administration for information on using RMAN to back up and restore files
9.2.7 Set UNDO Retention for Flashback Query and Flashback Table Needs
To ensure a database is enabled to use Flashback Query, Flashback Versions Query, and Flashback Transaction Query, implement the following:
Set the UNDO_MANAGEMENT initialization parameter to AUTO. This ensures the database is using an undo tablespace. Set the UNDO_RETENTION initialization parameter to a value that allows UNDO to be kept for a length of time that allows success of your longest query back in time or to recover from human errors.
Set the RETENTION GUARANTEE clause for the undo tablespace to guarantee that unexpired undo will not be overwritten.
The Flashback Table also relies on the undo data to recover the tables. Enabling Automatic Undo Management is recommended and the UNDO_RETENTION parameter must be set to a period for which the Flashback Table is needed. If a given table does not contain the required data after a Flashback Table, it can be flashed back further, flashed forward, or back to its original state, if there is sufficient UNDO data.
See Also:
Oracle Database Advanced Application Developer's Guide for information on Flashback Oracle Database Advanced Application Developer's Guide for information on Flashback Query
Overall backup time Impact to resource consumption Space used by the backup Recovery time
Table 93 compares different backup alternatives against the different priorities you might have. Using Table 93 as a guide, you can choose the best backup approach for your specific business requirements. You might want to minimize backup space while sacrificing recovery time. Alternatively, you might choose to place a higher priority on recovery and backup times while space is not an issue.
Table 93 Comparing Backup to Disk Options Overall Backup Time Backup Option Data file image copy 1: Fastest 5: Slowest 5 Impact on Resource Consumption 1: Lowest 5: Highest 5 Space Used by Backup 1: Least 5: Most 5
Recovery Time 1: Fastest 5: Slowest 1 No restore (switch to fast recovery area copy)
Full or level 0 backup set 4 Differential incremental backup set (level 1); applied to previous level 0 and level 1 backups during recovery Cumulative incremental backup set (level 1); applied to previous level 0 backup during recovery 1
4 1
3 1
3 5
9-9
Table 93 (Cont.) Comparing Backup to Disk Options Overall Backup Time Backup Option Incrementally updated backup (level 1); incremental applied to image copy after backup 1: Fastest 5: Slowest 3 Impact on Resource Consumption 1: Lowest 5: Highest 1 Space Used by Backup 1: Least 5: Most 5
Recovery Time 1: Fastest 5: Slowest 1 No restore (switch to fast recovery area copy)
Backup to Disk: Best Practices for Optimizing Recovery Times If restore time is
your primary concern then perform either a database copy or an incremental backup with immediate apply of the incremental to the copy. These are the only options that provide an immediate usable backup of the database, which you then need to recover only to the time of the failure using archived redo log files created since the last incremental backup was performed.
Backup to Disk: Best Practices for Minimizing Space Usage If space usage is your primary concern then perform an incremental backup with a deferred apply of the incremental to the copy. If you perform a cumulative level 1 incremental backup, then it stores only those blocks that have been changed since the last level 0 backup:
With a cumulative incremental backup apply only the last level 1 backup to the level 0 backup. With a differential incremental backup apply all level 1 backups to the level 0 backup.
A cumulative incremental backup usually consumes more space in the fast recovery area than a differential incremental backup.
Backup to Disk: Best Practices for Minimizing System Resource Consumption (I/O and CPU) If system resource consumption is your primary concern then an
incremental backup with a Block Change Tracking enabled consumes the least amount of resources on the database. Example For many applications, only a small percentage of the entire database is changed each day even if the transaction rate is very high. In many cases, applications repeatedly modify the same set of blocks; so, the total unique, changed block set is small. For example, a database contains about 600 GB of user data, not including temp files and redo logs. Every 24 hours, approximately 2.5% of the database is changed, which is approximately 15 GB of data. In this example, MAA testing recorded the following results:
Level 0 backup takes 180 minutes, including READs from the data area and WRITEs to the fast recovery area Level 1 backup takes 20 minutes, including READs from the data area and WRITEs to the fast recovery area Rolling forward and merging an existing image copy in the fast recovery area with a newly created incremental backup takes only 45 minutes, including READs and WRITEs from the fast recovery area.
In this example, the level 0 backup (image copy) takes 180 minutes. This is the same amount of time it takes to perform a full backup set. Subsequent backups are level 1 (incremental), which take 20 minutes, so the potential impact on the data area is reduced. That backup is then applied to the existing level 0 backup, which takes 45 minutes. This process does not perform I/O to the data area, so there is no impact (assuming the fast recovery area and data area use separate storage). The total time to create the incremental backup and apply it to the existing level 0 backup is 65 minutes (20+45). The result is the same using incrementally updated backups or full backup sets, a full backup of the database is created. The incremental approach takes 115 minutes less time (64% less) than simply creating a full backup set. In addition, the I/O impact is less, particularly against the data area which should have less detrimental effect on production database performance. Thus, for this example when you compare a full backup set strategy versus starting with an image copy, performing only incremental backups, and then rolling forward the copy, the net savings are: Thus, for this example when you compare always taking full backups versus starting with a level 0 backup, performing only incremental backups, and then rolling forward the level 0 backup, the net savings are:
115 minutes or 64% time savings to create a complete backup Reduced I/O on the database during backups
See Also: Oracle Database Backup and Recovery User's Guide for more information on backing up the database
Define your Oracle Secure Backup Administrative Server in Oracle Enterprise Manager Database Control enabling the Oracle Secure Backup domain to be managed through Oracle Enterprise Manager. Pre-authorize an Oracle Secure Backup user for use with RMAN allowing the RMAN backup/restore be performed without having to explicitly login to Oracle Secure Backup. Set-up media policies in Oracle Secure Backup to be used for RMAN backups.
2.
3.
4.
If you use Oracle Secure Backup or tape-side compression, do not also use RMAN compression.
See Also: Oracle Secure Backup Administrator's Guide for more information on using Recovery Manager with Oracle Secure Backup
9.4.2 Define Oracle Secure Backup Media Policies for Tape Backups
Once backup data stored on tape is no longer needed, its lifecycle is complete and the tape media can be reused. Management requirements during a tape's lifecycle (retention period) may include duplication and vaulting across multiple storage locations. Oracle Secure Backup provides effective media lifecycle management through user-defined media policies, including:
Media lifecycle management may be as simple as defining appropriate retention settings or more complex to include tape duplication with the original and duplicate(s) having different retention periods and vaulting requirements. Oracle Secure Backup media families, often referred to as tape pools, provide the media lifecycle management foundation. The best practice recommendation is to leverage content-managed media families which use defined RMAN retention parameters associated with the database to determine when the tape may be reused (effectively an expired tape). A specific expiration date is not associated with content-managed tapes as is done with time-managed. The expiration or recycling of these tapes is based on the attribute associated with the backup images on the tape. All backup images written to content-managed tapes automatically have an associated "content-manages reuse" attribute. Since the recycling of content-managed tapes adheres to user-defined RMAN retention settings, RMAN instructs Oracle Secure Backup when to change the backup image attribute to "deleted". The RMAN DELETE OBSOLETE command communicates which backup pieces (images) are no longer required to meet the user-defined RMAN retention periods. Once Oracle Secure Backup receives this communication, the backup image attribute is changed to "deleted". The actual backup image is not deleted but the attribute is updated within the Oracle Secure Backup catalog. Once all backup images on tape have a deleted attribute, Oracle Secure Backup considers the tape eligible for reuse, similar to that of an expired time-managed tape. Oracle Secure Backup provides policy-based media management for RMAN backup operations through user-defined Database Backup Storage Selectors. One Database Backup Storage Selector (SSEL) may apply to multiple databases or multiple SSELs may be associated with a single database. For example, you would create two SSEL for a database when using RMAN duplexing and each copy should be written to a different media family. The SSEL contains the following information:
Database name / ID or applicable to all databases Hostname or applicable to all hosts Content: archive logs, full, incremental, autobackup or applicable to all
RMAN copy number (applicable when RMAN duplexing is configured) Media family name Name(s) of devices to which operations are restricted (if no device restrictions are configured, Oracle Secure Backup uses any available device) Wait time (duration) for available tape resources Encryption setting
Oracle Secure Backup automatically uses the storage selections defined within a SSEL without further user intervention. To override the storage selections for one time backup operations or other exceptions, define alternate media management parameters in the RMAN backup script. For more information, see: http://www.oracle.com/technetwork/database/secure-backup/documen tation/index.html
Saves tape consumption by creating an optimized backup of the Fast Recovery Area (FRA) thereby eliminating unnecessary backup of files already protected on tape Enables RMAN to use better restore intelligence from disk then tape as necessary, otherwise, RMAN would restore from the most recent backup regardless of media type Reduces I/O on the production database since the FRA uses a separate disk group
Upon restoration, RMAN automatically selects the most appropriate backup to restore from disk or tape. If the required backup is on tape, RMAN would restore or recovery the database directly from tape media through integration with Oracle Secure Backup. As RMAN has intimate knowledge of what files are necessary for recovery, restoration from disk or tape is an automated process. While it is possible to backup the FRA or other RMAN disk backup to tape outside of RMAN by performing a file system backup of the disk area using the media management software, it is not recommended. If RMAN is not aware of the tape backup then restoration is an error-prone, manual process:
1. 2. 3.
DBA must determine what files are needed for the restoration. Media manager administrator would then restore designated files from tape backups to a disk location. Once files on disk, DBA would initiate an RMAN restore or recovery from the disk location.
The combination of RMAN and Oracle Secure Backup provides an integrated Oracle database tape backup solution.
See Also:
Oracle Database 2 Day DBA for more information on the Fast Recovery Area
RESTORE DATABASE PREVIEW command provides a list of tapes needed for restoration which are offsite. RESTORE DATABASE PREVIEW RECALL command initiates a recall operation through Oracle Secure Backup to return the tapes from offsite to the tape device for restoration. Once the tapes are on-site, you can begin the RMAN restore operation.
See Also:
Oracle Database Backup and Recovery User's Guide for more information on recalling offsite backups with RMAN Oracle Database Backup and Recovery User's Guide for more information on previewing backups used in restore operations
To detect all types of corruption that are possible to detect, specify the CHECK LOGICAL option.
See Also: Oracle Database Backup and Recovery User's Guide for information on Validating Database Files and Backups
Oracle Database Backup and Recovery User's Guide for information on DUPLICATE command Oracle Database Backup and Recovery User's Guide for information on validating backups before restoring them
9.5.4 Backup the RMAN and Oracle Secure Backup Catalogs on a Regular Basis
Include the recovery catalog database in your backup and recovery strategy. If you do not back up the recovery catalog and a disk failure occurs that destroys the recovery catalog database, then you may lose the metadata in the catalog. Without the recovery catalog contents, recovery of your other databases is likely to be more difficult. The Oracle Secure Backup catalog maintains backup metadata, scheduling and configuration details for the backup domain. Just as it's important to protect the RMAN catalog or control file, the Oracle Secure Backup catalog should be backed up on a regular basis.
See Also:
Oracle Database Backup and Recovery User's Guide for information on managing a recovery catalog Oracle Secure Backup Administrator's Guide for information about the Oracle Secure Backup administrative server backup catalog
administration for file system data could cross organizational areas. The Oracle data protection suite offers a cohesive solution meeting your complete needs for Oracle database and non Oracle database storage.
Oracle Automatic Storage Management Administrator's Guide for more information About Oracle ACFS Snapshots Oracle Automatic Storage Management Administrator's Guide for information on Managing Oracle ACFS Snapshots with Oracle Enterprise Manager
See Also:
http://wikis.sun.com/display/FishWorks/Documentatio n for documentation on Oracles Sun ZFS Storage Appliance The MAA white paper "Oracle Database Cloning Solution Using Oracle's Sun ZFS Storage Appliance And Oracle Data Guard" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
The MAA white paper "Oracle Database Cloning Using Oracle Recovery Manager and Sun ZFS Storage Appliance" from the MAA Best Practices area for Oracle Databases at http://www.oracle.com/goto/maa
10
10
Oracle GoldenGate delivers low-impact, real-time data acquisition, distribution, and delivery across both homogeneous and heterogeneous systems. Oracle GoldenGate enables cost-effective and low-impact real-time data integration and continuous availability solutions across a wide variety of use cases. Oracle GoldenGate offers close integration with Oracle technologies and applications, support for additional heterogeneous systems, and improved performance. This chapter includes the following sections:
Oracle GoldenGate Overview Oracle GoldenGate Configuration Best Practices Oracle GoldenGate Operational Best Practices
Active-Active multi-master configurations used for data availability and to scale performance. An important consideration for such configurations is the ability to manage update conflicts either by avoiding them or by implementing a process for conflict detection and resolution. Offload operational reporting when read/write access to the reporting instance is required. Near zero downtime (one-way replication) or zero downtime (bi-directional replication) for planned maintenance tasks, including:
Database upgrades Application upgrades that modify back-end database objects (requires the user to implement transformations to map old and new versions.
Redundant copies of the Oracle database Supplemental database logging Replication overhead from capture and apply processes on each database Conflict detection and resolution Client notification/redirection in the event of a database outage Meeting all prerequisites for logical replication
In contrast, Oracle RAC provides High Availability and scalable performance by balancing workload across multiple servers having shared access to a single copy of the Oracle database. Oracle RAC includes integrated methods of client notification and redirection and has none of the additional processing or prerequisites required by logical replication. For these reasons Oracle RAC is the preferred method for implementing High Availability and scaling performance for the Oracle Database. Oracle GoldenGate is often used in conjunction with Oracle RAC for maintenance or migrations, or as a method for distributing subsets of a source database across remote geographic locations to provide local read/write access or to create a local read/write replica for offloading reporting applications (when such applications require read/write access to the database).
10.1.3 Oracle GoldenGate and Oracle Data Guard/Oracle Active Data Guard
Oracle GoldenGate is Oracle's strategic logical replication product. Oracle Data Guard is Oracle's strategic physical replication product focused on data protection and data availability, and is the standard MAA recommendation for such purposes because of the advantages it offers over logical replication. Oracle Data Guard is also commonly used in place of storage-remote mirroring or host-based mirroring solutions for
disaster protection. Oracle Data Guard also minimizes planned downtime by supporting database rolling upgrades, select migrations (for example, Windows to Linux), data center moves, and other types of planned maintenance. Oracle Active Data Guard, an extension to Oracle Data Guard, is the simplest, fastest, most efficient method of maintaining a synchronized physical replica of a source database open read-only for offloading read-only workload and/or backups. For a detailed discussion of Data Guard advantages for data protection, see the Product Technical Brief, "Oracle Active Data Guard and Oracle GoldenGate" available from the GoldenGate link at http://www.oracle.com/technetwork/database/features/availability /index.html Oracle GoldenGate is often used in Data Guard configurations in a complementary manner. A Data Guard physical standby provides data protection and availability while Oracle GoldenGate capture processes are configured at either the primary database, or at the standby database (using ALO mode), to distribute data to one or more target databases, either Oracle or non-Oracle, to address advanced replication requirements. Oracle GoldenGate can also be used in place of Oracle Data Guard for requirements where its advantages outweigh Oracle Data Guard's specialized capabilities for data protection, high availability, and disaster recovery, and when all prerequisites for logical replication can be met. Such use-cases include:
When read/write access to the target database is required, either to implement multi-master replication, or to offload packaged reporting applications that require read/write access to the target database For planned maintenance or migrations in heterogeneous environments not supported by Oracle Data Guard For additional flexibility during planned maintenance to replicate downward from a new Oracle Database or application version to the previous version, for fast switchback should unforeseen problems be encountered To maintain availability through application upgrades that modify back-end database objects
Using edition-based redefinition, the old version of the application is in the old edition and the new version of the application is in the new edition - both within the same database; the edition is the isolation mechanism. Data that is represented the same in the old and the new versions of the application is represented only once in table columns used by both versions; only data that is represented differently in the two application versions must exist twice. Synchronization is needed, therefore, only for that typically small proportion of the total data that differs between the two versions. Because a cross edition trigger fires within a transaction, potential conflicts between the old and the new representations are prevented before they can be committed, and there is no need for conflict-resolution. Using Oracle GoldenGate, the old version of the application runs on the original database and the new version of the application runs on a second database; the second database is the isolation mechanism. All data - both that which is represented the same in the old and the new versions of the application and that which is represented differently in the two application versions must exist twice. Synchronization is needed, therefore, for all the data. The synchronization is implemented using code that intervenes in the replay mechanism for the SQL that is constructed by mining the redo logs. It is, therefore, non-transactional; and conflicts between the old and the new representations cannot be prevented. Rather, conflict-resolution must be implemented as an explicit, post-processing step.
Use of a Clustered File System Improved ASM Log File Read API Replicat Commit Behavior Oracle Clusterware Configuration Additional Oracle GoldenGate Best Practices
Note:
The best practices provided in the following documents apply to all supported Oracle GoldenGate platforms, including Oracle Exadata Database Machine.
See:
See "DBFS Configuration Best Practices for use with Oracle GoldenGate" in My Oracle Support Note 1319042.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1319042.1
Product Technical Brief, "Oracle GoldenGate on Sun Oracle Database Machine" from the GoldenGate link at http://www.oracle.com/technetwork/database/featur es/availability/index.html
For Oracle Database 11g Release 2 (11.2) and newer Oracle Database releases, set the Replicat parameter file to COMMIT NOWAIT as follows:
SQLEXEC "ALTER SESSION SET COMMIT_WAIT='NOWAIT'";
Starting in GoldenGate v11.1.1.1 COMMIT NOWAIT is the default behavior when using a checkpoint table.
If you are using an Oracle GoldenGate Data pump process (release v11.1.1 or earlier) to transfer the trail files from a source host on the database machine using DBFS (this applies to the Exadata Database Machine and to non-Exadata configurations), see "Oracle GoldenGate Best Practices: Oracle GoldenGate high availability using Oracle Clusterware" in My Oracle Support Note 1313703.1 at https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=1313703.1
See Also: For more detailed instructions to configure Oracle Clusterware with GoldenGate, see the product technical brief, "Oracle GoldenGate High Availability using Oracle Clusterware" from the GoldenGate link at
http://www.oracle.com/technetwork/database/features/ availability/index.html
See "Oracle GoldenGate database Schema Profile check script for Oracle Database" in My Oracle Support Note 1296168.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=1296168.1
See "Oracle GoldenGate database Complete Database Profile check script (All Schemas)" in My Oracle Support Note 1298562.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=1298562.1
The product technical briefs: Zero-Downtime Database Upgrades Using Oracle GoldenGate Best Practices for Conflict Detection and Resolution in Active-Active Database Configurations Using Oracle GoldenGate
Oracle GoldenGate For Windows and UNIX Administrator's Guide Oracle GoldenGate For Windows and UNIX Troubleshooting and Tuning Guide See "Oracle GoldenGate - Heartbeat process to monitor lag and performance in GoldenGate" in My Oracle Support Note 1299679.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=1299679.1
See, "Oracle GoldenGate Best Practices: Oracle GoldenGate Veridata" in My Oracle Support Note 1312092.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=1312092.1
11
11
Role based services Data Guard broker sending FAN ONS events to JDBC clients Support for SCAN addresses
While previous versions do have the above features it is possible to achieve similar results with manual configuration. For example:
Create triggers that manage stopping and starting a service based on the database role. Utilize an external ONS publisher to send FAN events after a failover has occurred. Creating Oracle Net aliases that include all hosts with the potential to become a primary.
The steps for configuring versions earlier than 11.2.0.1 are in the MAA white paper "Client Failover Best Practices for Highly Available Oracle Databases: Oracle Database 10g Release 2" at http://www.oracle.com/goto/maa Types of Failures Unplanned failures of an Oracle Database instance fall into the general categories:
A server failure or other fault that causes the crash of an individual Oracle instance in an Oracle RAC database. To maintain availability, application clients connected to the failed instance must quickly be notified of the failure and immediately establish a new connection to the surviving instances of the Oracle RAC database. A complete-site failure that results in both the application and database tiers being unavailable. To maintain availability users must be redirected to a secondary site that hosts a redundant application tier and a synchronized copy of the production database. A partial-site failure where the primary database, a single-instance database, or all nodes in an Oracle RAC database become unavailable but the application tier at the primary site remains intact.
Configure Fast Connection Failover as a best practice to fully benefit from fast instance and database failover and switchover with Oracle RAC and Oracle Data Guard. Fast
Configuring Fast Connection Failover 11-1
Connection Failover enables clients, mid-tier applications, or any program that connects directly to a database to failover quickly and seamlessly to an available database service when a database service becomes unavailable. This chapter includes the following sections:
Configure JDBC and OCI Clients for Failover Configure Oracle RAC Databases for Failover Configure the Oracle Data Guard Environment Client Transition During Switchover Operations Preventing Login Storms
See Also:
The MAA white paper "Client Failover Best Practices for Highly Available Oracle Databases: Oracle Database 11g Release 2" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
"Application High Availability with Services and FAN" in Oracle Database Administrator's Guide
Enable Fast Connection Failover for JDBC clients by setting the DataSource property FastConnectionFailoverEnabled to TRUE. Configure JDBC clients to use a connect descriptor that includes an address list that in turn includes the SCAN address for each site and connects to an existing service. The JDBC client must set the oracle.net.ns.SQLnetDef.TCP_ CONNTIMEOUT_STR property. This property enables the JDBC client to quickly traverse an ADDRESS_LIST in the event of a failure. Configure a remote Oracle Notification Service (ONS) subscription on the JDBC client so that an ONS daemon is not required on the client. By default the JDBC application randomly picks three hosts from the setONSConfiguration property and creates connections to those three ONS daemons. You must change this default so that connections are made to all ONS daemons. This is done by setting the following property when the JDBC application is invoked to the total number of ONS daemons in the configuration: java - oracle.ons.maxconnections=4
See Also:
3.
4. 5.
Oracle Database Administrator's Guide for more information on Enabling Fast Connection Failover for JDBC Clients "Application High Availability with Services and FAN" in Oracle Database Administrator's Guide
Fast Connection Failover for OCI Clients For OCI clients, follow these best practices:
1. 2. 3. 4.
Enable Fast Application Notification (FAN) for OCI clients by initializing the environment with the OCI_EVENTS parameter. Link the OCI client applications with the thread library. Set the AQ_HA_NOTIFICATIONS parameter to TRUE and configure the transparent application failover (TAF) attributes for services. Configure an Oracle Net alias that the OCI application uses to connect to the database. The Oracle Net alias should specify both the primary and standby SCAN hostnames. For best performance while creating new connections the Oracle Net alias should have LOAD_BALANCE=OFF for the DESCRIPTION_LIST so that DESCRIPTIONs are tried in an ordered list, top to bottom. With this configuration the second DESCRIPTION is only attempted if all connection attempts to the first DESCRIPTION have failed.
See Also:
Oracle Database Administrator's Guide for more information on Enabling Fast Connection Failover for Oracle Call Interface Clients Oracle Call Interface Programmer's Guide for more information on Transparent Application Failover in OCI
Section 6.1.1, "Client Configuration and Migration Concepts" Section 6.2.5, "Connect to Database Using Services and Single Client Access Name (SCAN)"
side callouts to log trouble tickets or page Administrators to alert them of a failure. For Up events, when services and instances are started, new connections can be created so the application can immediately take advantage of the extra resources
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for an Introduction to Automatic Workload Management. For more information about client failover best practices and details on deploying FAN server side callouts, see the Technical Article, "Automatic Workload Management with Oracle Real Application Clusters 11g Release 2" on the Oracle Technology Network at http://www.oracle.com/technetwork/database/cluste ring/overview/index.html
http://www.oracle.com/goto/maa
Oracle Database Administrator's Guide for information on Creating and Deleting Database Services with SRVCTL Oracle Database Administrator's Guide for information About Automatic Startup of Database Services
Oracle Clusterware must be installed and active on the primary and standby sites for both single instance (using Oracle Restart) and Oracle RAC databases. Oracle Data Guard broker coordinates with Oracle Clusterware to properly fail over role-based services to a new primary database after a Data Guard failover has occurred. For more information, see
See Also: Section 8.3.1, "Use Oracle Data Guard Broker with Oracle Data Guard"
There are no additional considerations for switchovers using Oracle Active Data Guard.
The following steps describe the additional manual switchover steps for Oracle Data Guard 11g Release 2:
1.
The primary database is converted to a standby database. This disconnects all sessions and brings the database to the mount state. Oracle Data Guard Broker shuts down any read/write services. Client sessions receive a ORA-3113 and begin going through their retry logic (TAF for OCI and application code logic for JDBC). The standby database is converted to a primary database and any existing sessions are disconnected. Oracle Data Guard Broker shuts down read-only services. Read-only connections receive an ORA-3113 and begin going through their retry logic (TAF for OCI and application code logic for JDBC). As the new primary and the new standby are opened, the respective services are started for each role and clients performing retries now see the services available and connect.
2. 3. 4. 5.
Ensure that the proper reconnection logic has been configured (for more information, see Section 11.1, "Configure JDBC and OCI Clients for Failover" and Section 11.2, "Configure Oracle RAC Databases for Failover"). For example, configure TAF and RETRY_COUNT for OCI applications and code retry logic for JDBC applications. Stop the services that the primary application uses and the read-only applications enabled on the standby database. Disconnect or shutdown the primary and read-only application sessions. Once the switchover has completed, restart the services used by the primary application and the read-only application. Sessions that were terminated reconnect once the service becomes available as part of the retry mechanism.
2. 3. 4. 5.
6.
Note that FAN is not needed to transition clients during a switchover operation if the application performs retries. FAN is only needed to break clients out of TCP timeout, a state that should only occur during unplanned outages.
See Also:
Implement the Connection Rate Limiter The primary method of controlling login storms is to implement the Connection Rate Limiter feature of the Oracle listener. This feature limits the number of connections that can be processed in seconds. Slowing down the rate of connections ensures that CPU resources remain available and that the system remains responsive.
Configure Oracle Database for shared server operations In addition to implementing the Connection Rate Limiter, some applications can control login storms by configuring Oracle Database for shared server operations. By using shared server, the number of processes that must be created at failover time are greatly reduced, thereby avoiding a login storm.
Adjust the maximum number of connections in the mid tier connection pool If such a capability is available in your application mid tier, try limiting the number of connections by adjusting the maximum number of connections in the mid tier connection pool.
See Also:
Oracle Database Administrator's Guide for more information about configuring and controlling shared server operations The "Oracle Net Listener Connection Rate Limiter" white paper for information about the Connection Rate Limiter at http://www.oracle.com/technetwork/database/enterp rise-edition/oraclenetservices-connectionratelim133050.pdf
The "Best Practices for Optimizing Availability During Unplanned Outages Using Oracle Clusterware and Oracle Real Application Clusters" white paper for information and examples about listener connection rate throttling from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Application support
PeopleSoft PeopleTools version 8.50.09 and higher supports FAN. This enables PeopleSoft applications to automatically failover database connections to a surviving instance in an Oracle RAC cluster or to a new primary database in an Oracle Data Guard configuration should its database connection be lost. In the event of an Oracle RAC instance failure, primary database failure, or a shutdown/restart of the Oracle Database, PeopleSoft servers and clients continue running and users are not required to login a second time. In Oracle WebLogic Server 10.3.4, a single data source implementation has been introduced to support an Oracle RAC cluster. It responds to FAN events to provide Fast Connection Failover (FCF), Runtime Connection Load-Balancing (RCLB), and Oracle RAC instance graceful shutdown. XA affinity is supported at the global transaction ID level. The new feature is called WebLogic Active GridLink for RAC, which is implemented as the GridLink data source within Oracle WebLogic Server. For applications that do not support FAN events, this includes a number of applications from Oracle (for example, Siebel and Oracle E-Business Suite), all of the steps described in this section should be completed for the fastest client failover possible. Even though FAN events cannot be used in such cases, applications can still be configured for efficient failover by using timeouts and application retries. For more information see the MAA white paper "Client Failover Best Practices for Highly Available Oracle Databases: Oracle Database 11g Release 2" at http://www.oracle.com/goto/maa
Application support
12
12
This chapter provides best practices for monitoring your system using Enterprise Manager and to monitor and maintain a highly available environment across all tiers of the application stack. This chapter includes the following sections:
Overview of Monitoring and Detection for High Availability Using Enterprise Manager for System Monitoring Managing the High Availability Environment with Enterprise Manager Using Cluster Health Monitor
more information, Section 12.3.3, "Manage Database Availability with the High Availability Console"). Each target type has a default generated home page that displays a summary of relevant details for a specific target. You can group different types of targets by function; that is, as resources that support the same application. Every target is monitored by an Oracle Management Agent. Every Management Agent runs on a system and is responsible for a set of targets. The targets can be on a system that is different from the one that the Management Agent is on. For example, a Management Agent can monitor a storage array that cannot host an agent natively. When a Management Agent is installed on a host, the host is automatically discovered along with other targets that are on the machine. Moreover, to help you implement the Maximum Availability Architecture (MAA) best practices, Enterprise Manager provides the MAA Advisor (for more information, see Section 12.3.4, "Configure High Availability Solutions with MAA Advisor"). The MAA Advisor page recommends Oracle solutions for most outage types and describes the benefits of each solution. In addition to monitoring infrastructure with Enterprise Manager in the Oracle HA environment, Oracle Auto Service Request (ASR) can be used to resolve problems faster by using auto-case generation for Oracle's Sun server and storage systems when specific hardware faults occur. For more information, see "Oracle Auto Service Request" in My Oracle Support Note 1185493.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id =1185493.1
See Also: Oracle Enterprise Manager Concepts for information on Enterprise Manager Architecture and the Oracle Management Agent
A snapshot of the current availability of all targets. The All Targets Status pie chart gives the administrator an immediate indication of any target that is Available (Up), unavailable (Down), or has lost communication with the console (Unknown). An overview of how many alerts and problems (for jobs) are known in the entire monitored system. You can display detailed information by clicking the links, or by navigating to the Alerts tab from any Enterprise Manager page. A view of the severity and total number of policy violations for all managed targets. Drill down to determine the source and type of violation. All Targets Jobs lists the number of scheduled, running, suspended, and problem (stopped/failed) executions for all Enterprise Manager jobs. Click the number next to the status group to view a list of those jobs. An overview of what is actually discovered in the system. This list can be shown at the hardware level and the Oracle level.
Alerts are generated by a combination of factors and are defined on specific metrics. A metric is a data point sampled by a Management Agent and sent to the Oracle Management Repository. An alert could be the availability of a component through a simple heartbeat test, or an evaluation of a specific performance measurement such as "disk busy" or percentage of processes waiting for a specific wait event. There are four states that can be checked for any metric: error, warning, critical, and clear. The administrator must make policy decisions such as:
What objects should be monitored (databases, nodes, listeners, or other services)? What instrumentation should be sampled (such as availability, CPU percent busy)?
How frequently should the metric be sampled? What should be done when the metric exceeds a predefined threshold?
All of these decisions are predicated on the business needs of the system. For example, all components might be monitored for availability, but some systems might be monitored only during business hours. Systems with specific performance problems can have additional performance tracing enabled to debug a problem.
See Also: Oracle Enterprise Manager Concepts for more information about monitoring and using metrics in Enterprise Manager
Modify each rule for high-value components in the target architecture to suit your availability requirements by using the rules modification wizard. For the database rule, set the metrics in Table 121, Table 122, and Table 123 for each target. The frequency of the monitoring is determined by the service-level agreement (SLA) for each component. Use Beacon functionality to track the performance of individual applications. A Beacon can be set to perform a user transaction representative of normal application work. Enterprise Manager can then break down the response time of that transaction into its component pieces for analysis. In addition, an alert can be triggered if the execution time of that transaction exceeds a predefined limit. Add Notification Methods and use them in each Notification Rule. By default, the easiest method for alerting an administrator to a potential problem is to send e-mail. Supplement this notification method by adding a callout to an SNMP trap or operating system script that sends an alert by some method other than e-mail. This avoids problems that might occur if a component of the e-mail system fails. Set additional Notification Methods by using the Setup link at the top of any Enterprise Manager page.
Modify Notification Rules to notify the administrator when there are errors in computing target availability. This might generate a false positive reading on the availability of the component, but it ensures the highest level of notification to system administrators.
See Also:
Oracle Enterprise Manager Concepts for conceptual information about Beacons Oracle Enterprise Manager Advanced Configuration for information about configuring service tests and Beacons
Figure 122 shows the Edit Notification Rule property page for choosing availability states, with the Down option chosen.
Figure 122 Setting Notification Rules for Availability
In addition, ensure that the metrics listed in Table 121, Table 122, and Table 123 are added to the database notification rule. Configure those metrics using the Metrics and Policy Settings page, which can be accessed from the Related Links section of the Database Homepage. Use the metrics shown in Table 121 to monitor space management conditions that have the potential to cause a service outage.
Recommendations for Monitoring Space Recommendation Set this database-level metric to check the Available Space Used (%) for each tablespace. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. This metric enables the administrator to choose the threshold percentages that Enterprise Manager tests against, and the number of samples that must occur in error before a message is generated and sent to the administrator. If the percentage of used space is greater than the values specified in the threshold arguments, then a warning or critical alert is generated. The recommended default settings are 85% for a warning and 97% for a critical space usage threshold, but you should adjust these values appropriately, depending on system usage. Also, you can customize this metric to monitor specific tablespaces. Note: there is an Enterprise Manager Job in the Job Library named: DISABLE TABLESPACE USED (%) ALERTS FOR UNDO AND TEMP TABLESPACES Use this Job to disable alerts for all UNDO and TEMP tablespaces. This job is useful if you encounter too many alerts on TEMP and UNDO tablespaces.
Set this metric to monitor the dump directory destinations. Dump space must be available so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70% for a warning and 90% for an error, but these should be adjusted depending on system usage. Set this metric in the Dump Area metric group.
This is a database-level metric that is evaluated by the server every 15 minutes or during a file creation, whichever occurs first. The metric is also printed in the alert log. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. The Critical Threshold is set for < 3% and the Warning Threshold is set for < 15%. You cannot customize these thresholds. An alert is returned the first time the alert occurs, and the alert is not cleared until the available space rises above 15%.
By default, this metric monitors the root file system per host. The default warning level is 20% and the critical warning is 5%. Set this metric to return the percentage of space used on the archive area destination. If the space used is more than the threshold value given in the threshold arguments, then a warning or critical alert is generated. If the database is not running in ARCHIVELOG mode, this metric fails to register. The default warning threshold is 80%, but consider using 70% full to send a warning or 90% for the critical threshold.
In Enterprise Manager 11g the mechanism for monitoring the Database Alert Log is tightly integrated with the Support Workbench, with the benefits of being able to generate packages for each problem or incident reported and quickly upload them to support. As part of integrating with the Support Workbench, errors are categorized into different classes and groups, each served by a separate metric. At the highest level of categorization there are two different classes of errors: incidents and operational errors.
Incidents are errors that are recorded in the database alert log file, which signify that the database being monitored has detected a critical error condition. For example a critical error condition could be a generic internal error or an access violation. Operational Errors are errors that are recorded in the database alert log file, which signify that the database being monitored has detected an error that may affect the
operation of the database. For example, an operational error could be an indication that the archiver is hung or a media failure. Configure the metrics that raise alerts for errors reported in the Alert Log as shown in Table 122.
Note: For more information on the Alert Log metrics in Table 122 and Alert Log Monitoring for 11g database targets in Enterprise Manager, see "Monitoring 11g Database Alert Log Errors in Enterprise Manager" in My Oracle Support Note 949858.1 at
https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=949858.1
Recommendations for Monitoring the Alert Log Recommendation Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-0600 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-7445 or ORA-3113 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-603 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-4030 or ORA-4031 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-353, ORA-355, or ORA-356 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-1410 or ORA-8103 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-4020 errors have been reported in the Alert Log since the last time the metric was collected. Note: This metric does not raise alerts when application level deadlocks (ORA-0060 errors) are reported in the Alert Log.
Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-604 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-29740 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-1578, ORA-1157, or ORA-27048 errors have been reported in the Alert Log since the last time the metric was collected.
Table 122 (Cont.) Recommendations for Monitoring the Alert Log Metric Media Failure Status Recommendation Set the Critical threshold to 0 to ensure an alert is raised each time one or more ORA-1242 or ORA-1243 errors have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more of the errors: ORA-700, ORA-255, ORA-239, ORA-601, ORA-602, ORA-255, ORA-240, ORA-494, ORA-3137, ORA-202, ORA-214, ORA-227, ORA-1103, ORA-312, ORA-313, ORA-1110, ORA-1542, ORA-32701, ORA-32703, or ORA-12751 have been reported in the Alert Log since the last time the metric was collected. Set the Critical threshold to 0 to ensure an alert is raised each time one or more generic operational errors have been reported in the Alert Log since the last time the metric was collected.
Monitor the system to ensure that the processing capacity is not exceeded. The warning and critical thresholds for these metrics should be modified based on the usage pattern of the system, following the recommendations in Table 123.
Table 123 Metric Process limit Session limit Recommendations for Monitoring Processing Capacity Recommendation Set thresholds for this metric to warn if the number of current processes approaches the value of the PROCESSES initialization parameter. Set thresholds for this metric to warn if the instance is approaching the maximum number of concurrent connections allowed by the database.
Figure 123 shows the Metric and Policy settings page for setting and editing metrics. The online help contains complete reference information for every metric. To access reference information for a specific metric, use the online help search feature.
See Also:
Oracle Database 2 Day DBA for information about setting up notification rules and metric thresholds Oracle Database 2 Day DBA for more on Viewing Problems Using the Enterprise Manager Support Workbench Oracle Enterprise Manager Oracle Database and Database-Related Metric Reference Manual for information about available metrics
12.2.3 Use Database Target Views to Monitor Health, Availability, and Performance
The Database target Home page in Figure 124 shows system performance, space usage, and the configuration of important availability components such as the percentage space used in the Fast Recovery Area, and Flashback Database Logging status (the Fast Recovery Area is labeled Flash Recovery Area (%) in Enterprise Manager 11g). You can see the most recent alerts for the target under the Alerts table, as shown in Figure 124. You can access further information about alerts by clicking the links in the Message column.
Performance Analysis and Performance Baseline Many of the metrics for Database targets in Enterprise Manager pertain to performance. A system that is not meeting performance service-level agreements is not meeting High Availability system requirements. While performance problems seldom cause a major system outage, they can still cause an outage to a subset of customers. Outages of this type are commonly referred to as application service brownouts. The primary cause of brownouts is the intermittent or partial failure of one or more infrastructure components. IT managers must be aware of how the infrastructure components are performing (their response time, latency, and availability), and how they are affecting the quality of application service delivered to the end user. A performance baseline, derived from normal operations that meet the service-level agreement should determine what constitutes a performance metric alert. Baseline data should be collected from the first day that an application is in production and should include the following:
Application statistics (transaction volumes, response time, Web service times) Database statistics (transaction rate, redo rate, hit ratios, top 5 wait events, top 5 SQL transactions) Operating system statistics (CPU, memory, I/O, network)
You can use Enterprise Manager to capture a baseline snapshot of database performance and create an Automatic Workload Repository (AWR) baseline. Enterprise Manager compares these values against system performance and displays the result on the database Target page. Enterprise Manager can also send alerts if the values deviate too far from the established baseline. See "Use Automatic Performance Tuning Features" on page 5-11 for more information about Automatic Workload Repository. Set the database notification rule to capture the metrics listed in Table 124 for all database targets.
Table 124 Metric I/O Requests (per second) Recommendations for Performance Related Metrics Level Instance Recommendation This metric represents the total rate of I/O read and write requests for the database. It sends an alert when the number of operations exceeds a user-defined threshold. Use this metric with operating system-level metrics that are also available with Enterprise Manager. Set this metric based on the total I/O throughput available to the system, the number of I/O channels available, network bandwidth (in a SAN environment), the effects of the disk cache if you are using a storage array device, and the maximum I/O rate and number of spindles available to the database. Database CPU Time (%) Instance This metric represents the percentage of database call time that is spent on the CPU. It can be used to detect a change in the operation of a system, for example, a drop in Database CPU time from 50% to 25%. The Consecutive Number of Occurrences Preceding Notification column indicates the consecutive number of times the comparison against thresholds should hold TRUE before an alert is generated. This usage might be normal at peak periods, but it might also be an indication of a runaway process or of a potential resource shortage. Wait Time (%) Instance Excessive idle time indicates that a bottleneck for one or more resources is occurring. Set this instance-level metric based on the system wait time when the application is performing as expected. This metric reports network traffic that Oracle generates. This metric can indicate a potential network bottleneck. Set this metric based on actual usage during peak periods. For UNIX-based systems, represents the number of pages paged in (read from disk to resolve fault memory references) per second. This metric checks the number of pages paged in for the CPU(s) specified by the Host CPU(s) parameter, such as cpu_stat0 or * (for all CPUs on the system). For Microsoft Windows, this metric is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. Run Queue Length Host For UNIX-based systems, the Run Queue Length metrics represent the average number of processes in memory and subject to be run in the last interval (1 minute average, 5 minute average, and 15 minute average). It is recommended to alert when Run Queue Length = # of CPUs. (An alternative way to do this is to monitor the Load Average metric and compare it to Maximum CPU.) This metric is not available on Microsoft Windows.
Host
12-11
See Also:
Oracle Database Performance Tuning Guide for more information about performance monitoring Oracle Database 2 Day DBA for more information about monitoring and tuning using Enterprise Manager
Check Enterprise Manager Policy Violations Use Enterprise Manager to Manage Oracle Patches and Maintain System Baselines Manage Database Availability with the High Availability Console Configure High Availability Solutions with MAA Advisor
Figure 125 Database Home Page with Targets Showing Policy Violations
To see more details on violations, select a link in the Policy Violations area. Figure 126 shows the Policy Tend Overview page.
12-13
To see Policy Violations, select Violations from the Compliance tab, as shown in Figure 127.
See Also: Oracle Enterprise Manager Policy Reference Manual for definitions of existing policies
12.3.2 Use Enterprise Manager to Manage Oracle Patches and Maintain System Baselines
For any monitored system in the application environment, you can use Enterprise Manager to download and manage patches from My Oracle Support at https://support.oracle.com/ You can set up a job to routinely check for patches that are relevant to the user environment. Those patches can be downloaded and stored directly in the Management Repository. Patches can be staged from the Management Repository to multiple systems and applied during maintenance windows. You can examine patch levels for one system and compare them between systems in either a one-to-one or one-to-many relationship. In this case, a system can be identified as a baseline and used to demonstrate maintenance requirements in other systems. This can be done for operating system patches and database patches.
12-15
See Also:
Oracle Enterprise Manager Administrator's Guide for Software and Server Provisioning and Patching for information on Patching Using My Oracle Support Oracle Enterprise Manager Administrator's Guide for Software and Server Provisioning and Patching for information on Patching Oracle Database Section 14.2, "Eliminating or Reducing Downtime for Scheduled Outages"
Display high availability events including events from related targets such as standby databases View the high availability summary that includes the status of the database View the last backup status View the Fast Recovery Area Usage, if configured If Oracle Data Guard is configured: View the Data Guard summary, set up Data Guard standby databases for any database target, manage switchover and failover of database targets other than the database that contains the Management Repository, and monitor the health of a Data Guard configuration at a glance If Oracle RAC is configured: View the Oracle RAC Services summary including Top Services
Note:
Oracle Enterprise Manager Database Control uses the name Fast Recovery Area for the renamed Flash Recovery Area. In places, the HA Console and Enterprise Manager use the name Flash Recovery Area. For more information on the Fast Recovery Area, see Section 5.1.3, "Use a Fast Recovery Area".
Figure 128 shows the HA Console. This figure shows summary information, details, and historical statistics for the primary database and shows the standby databases for the primary target, various Data Guard standby performance metrics and settings, and the data protection mode.
In Figure 128, the Availability Summary shows that the primary database is up and its availability is currently 100%. The Availability Summary also shows Oracle ASM instances status. The Availability Events table shows specific high availability events (alerts). You can click the message to obtain more details (or to suppress the event). To set up, manage, and configure a specific solution area for this database, under Availability Summary, next to MAA Advisor, click Details to go to the Maximum Availability Architecture (MAA) Advisor page (described in more detail in Section 12.3.4, "Configure High Availability Solutions with MAA Advisor"). The Backup/Recovery Summary area displays the Last Backup and Next Backup information. The Fast Recovery Area Usage chart indicates about 83% of the fast recovery area is currently used. The Used (Non-reclaimable) Fast Recovery Area (%) chart shows the usage over the last 2 hours. You can click the chart to display the page with the metric details. The Data Guard Summary area shows the primary database is running in Maximum Availability mode and has Fast-Start Failover enabled. You can click the link next to Protection Mode to modify the data protection mode. In the Standby Databases table, the physical standby database (north) is caught up with the primary database (Apply/Transport Lag) metrics are showing 0 seconds, and the Used Fast Recovery Area (FRA) is 16.02%. The Primary Database Redo Rate chart shows the redo trend over the past 2 hours. Note that if Data Guard is not configured, the "Switch To" box in the corner of the console is not displayed. Figure 129 shows information similar to figure Figure 128, but for the standby database (north), which is a physical standby database running real-time query. In the Standby Databases table, the Apply/Transport Lag metrics indicate that the physical
Monitoring for High Availability 12-17
standby database is caught up with the primary database, and the Used Fast Recovery Area (FRA) is 16%. Note that if Data Guard is not configured, the "Switch To" box in the corner of the console is not displayed.
Figure 129 Monitoring the Standby Database in the High Availability Console
Figure 1210 shows sample values for Services Summary and Services Details areas. These areas show summary and detail information on Oracle RAC Services, including Top Services and problem services.
Figure 1210 Monitoring the Cluster in the High Availability Console Showing Services
See Also: Oracle Enterprise Manager Concepts for information on Database Management
View recommended Oracle solutions for each outage type (site failures, computer failures, storage failures, human errors, and data corruptions) View the configuration status and use the links in the Oracle Solution column to go to the Enterprise Manager page where the solution can be configured. Understand the benefits of each solution Link to the MAA Web site for white papers, documentation, and other information
The MAA Advisor page contains a table that lists the outage type, Oracle solutions for each outage, configuration status, and benefits. The MAA Advisor allows you to view High Availability solutions in the following ways:
Primary Database Recommendations OnlyThis condensed view shows only the recommended solutions (the default view) for the primary database. All Solutions This expanded view shows all configuration recommendations and status for all primary and standby databases in this configuration. It includes
12-19
an extra column Target Name:Role that provides the database name and shows the role (Primary, Physical Standby, or Logical Standby) of the database. Figure 1211 shows an example of the MAA Advisor page with the Show All Solutions view selected.
Figure 1211 Maximum Availability Architecture (MAA) Advisor Page in Enterprise Manager
You can click the link in the Oracle Solution column to go to a page where you can set up, manage, and configure the specific solution area. Once a solution has been configured, click Refresh to update the configuration status. Once the page is refreshed, click Advisor Details on the Console page to see the updated values.
See:
Oracle Clusterware Administration and Deployment Guide for an Overview of Managing Oracle Clusterware Environments and for more information on Cluster Health Monitor (CHM)
12-21
13
13
Overview of Unscheduled Outages Recovering from Unscheduled Outages Restoring Fault Tolerance
See Also: Chapter 14, "Reducing Downtime for Planned Maintenance" for information about scheduled outages
Your monitoring and high availability infrastructure should provide rapid detection and recovery from downtime. Chapter 12, "Monitoring for High Availability" describes detection, while this chapter focuses on reducing downtime.
13-1
Recovery Times and Steps for Unscheduled Outages on the Primary Site Oracle Database 11g Hours to days
1. 2. 3.
Oracle Database 11g with Oracle RAC and Oracle Clusterware Hours to days
1. 2. 3.
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.1, "Complete Site Failover (Failover to Secondary Site)" Section 13.2.4, "Application Failover"
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.1, "Complete Site Failover (Failover to Secondary Site)" Section 13.2.4, "Application Failover"
2.
2.
3.
3.
clusterwide failure
Not applicable
Hours to days
1.
Not applicable
Seconds to 5 minutes
1.
Restore cluster or restore at least one node. Optionally restore from tape backups if the data is lost or corrupted. Recover database. Seconds to 5 minutes3
1.
2.
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.4, "Application Failover"
2.
3.
Minutes to hours3
1.
No downtime4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)"
No downtime4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)"
Restart node and restart database with Oracle Restart. See Oracle Database Administrator's Guide Reconnect users.
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.4, "Application Failover"
2.
2.
Minutes3
1. 2.
No downtime4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)"
Minutes3
1. 2.
No downtime4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)"
or Seconds to 5 minutes2
1.
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.4, "Application Failover"
2.
storage failure
No downtime5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures"
No downtime5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures"
No downtime5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures"
No downtime5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures"
Table 131 (Cont.) Recovery Times and Steps for Unscheduled Outages on the Primary Site Outage Scope data corruption Oracle Database 11g Minutes to hours Section 13.2.6, "Recovering from Data Corruption" Oracle Database 11g with Oracle RAC and Oracle Clusterware Minutes to hours Section 13.2.6, "Recovering from Data Corruption" Oracle Database 11g with Data Guard1 Possible no downtime with Active Data Guard: Section 13.2.6.2, "Use Active Data Guard" Seconds to 5 minutes
1.
Oracle Database 11g MAA Possible no downtime with Active Data Guard: Section 13.2.6.2, "Use Active Data Guard" Seconds to 5 minutes
1.
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.4, "Application Failover"
Section 13.2.2, "Database Failover with a Standby Database" Section 13.2.4, "Application Failover"
2.
2.
human error
< 30 minutes6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)"
< 30 minutes6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" customized and configurable 7 Section 13.2.4, "Application Failover"
<30 minutes6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" customized and configurable 8 Section 13.2.4, "Application Failover"
< 30 minutes6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" customized and configurable7 and8 Section 13.2.4, "Application Failover"
3 4 5 6
While Data Guard physical replication is the most common data protection and availability solution used for Oracle Database, there are cases where active-active logical replication may be preferred, especially when control over the application makes it possible to implement. You may use Oracle GoldenGate in place of Data Guard for these requirements. See the topic, "Oracle Active Data Guard and Oracle GoldenGate" for additional discussion of the trade-offs between physical and logical replication at http://www.oracle.com/technetwork/database/features/availability/dataguardgoldengate-096557.ht ml Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time. Recovery time consists largely of the time it takes to restart the failed system. Database is still available, but portion of application connected to failed system is temporarily affected. Storage failures are prevented by using Oracle ASM with mirroring and its automatic rebalance capability. Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flash back the appropriate transactions, if properly rehearsed. Referential or integrity constraints must be considered. Oracle Enterprise Manager or a customized application heartbeat can be configured to detect application or response time slowdown and react to these SLA breaches. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain threshold expires, Enterprise Manager can alert and possibly restart the database. Oracle Enterprise Manager or a customized application heartbeat can be configured to detect application or response time slowdown and react to these SLA breaches. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain threshold expires, Enterprise Manager can call the Oracle Data Guard DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure to initiate a failover.
13-3
See:
Oracle Data Guard Broker for more information on "Application Initiated Fast-Start Failover" The topic, "Oracle Active Data Guard and Oracle GoldenGate" for additional discussion of the trade-offs between physical and logical replication at http://www.oracle.com/technetwork/database/featur es/availability/dataguardgoldengate-096557.html
Outages to a system that uses the Active Data Guard option with the standby database can affect applications that are using the standby database for read activity, but such outages do not impact the availability of the primary database (the availability is based on the mode you specify).
Data Guard Maximum Protection, however, has an impact on availability if the primary database does not receive acknowledgment from a standby database running in SYNC transport mode (net_timeout does not apply to Maximum Protection). For this reason, if you are using Maximum Protection you should follow the MAA best practice of deploying two SYNC standby databases, each at its own site. With two standby databases a single standby outage does not impact primary availability or zero data loss protection. If limited system resources make it impractical to deploy two standby databases, then the availability of the primary database can be restored simply by downgrading the data protection mode to Maximum Availability and restarting the primary database. Table 132 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 13.2, "Recovering from Unscheduled Outages".
Table 132
Recovery Steps for Unscheduled Outages on the Secondary Site Recovery Steps for Single-Instance or Oracle RAC Standby Database
1. 2.
Restart node and standby instance when they are available. Restart recovery.
The broker automatically restarts the log apply services. Note 1: If there is only one standby database and if Maximum Protection is configured, then the primary database shuts down to ensure that there is no data divergence with the standby database (no unprotected data). Note 2: If this is an Oracle RAC standby database, then there is no affect on primary database availability if you configured the primary database Oracle Net descriptor to use connect-time failover to an available standby instance. If you are using the broker, connect-time failover is configured automatically. Data corruption Primary database opens with RESETLOGS because of Flashback Database operations or point-in-time media recovery Section 13.3.5, "Restoring Fault Tolerance After a Standby Database Data Failure" Section 13.3.6, "Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs"
See Also:
Section 8.1, "Oracle Data Guard Configuration Best Practices" Oracle Data Guard Concepts and Administration for information on "Data Guard Protection Modes" Oracle Data Guard Concepts and Administration for more information on Oracle Active Data Guard option and when Redo Apply can be active while the physical standby database is open Oracle Data Guard Broker for information on "How the Protection Modes Influence Broker Operations"
Complete Site Failover (Failover to Secondary Site) Database Failover with a Standby Database Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures) Application Failover Oracle ASM Recovery After Disk and Storage Failures Recovering from Data Corruption Recovering from Human Error (Recovery with Flashback) Recovering Databases in a Distributed Environment
13-5
Primary site disaster, such as natural disasters or malicious attacks Primary network-connectivity failures Primary site power failures
Use the Data Guard configuration best practices in Section 8.3, "General Data Guard Configuration Best Practices" Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO) of less than 30 seconds (described in Section 8.4.2.3, "Fast-Start Failover Best Practices") Maintain a running middle-tier application server on the secondary site to avoid the startup time, or redirect existing applications to the new primary database using the Fast Connection Failover best practices described in: Chapter 11, "Configuring Fast Connection Failover" The MAA white paper: "Client Failover Best Practices for Data Guard 11g Release 2" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Configure Automatic Domain Name Server (DNS) failover procedure. Automatic DNS failover occurs after a primary site is inaccessible and the wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect.
The potential for data loss is dependent on the Data Guard protection mode used: Maximum Protection, Maximum Availability, or Maximum Performance.
Client requests enter the client tier of the primary site and travel by the WAN traffic manager.
2. 3. 4. 5. 6.
Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier. Requests are forwarded through the active load balancer to the application servers. Requests are sent through another firewall and into the database server tier. The application requests, if required, are routed to an Oracle RAC instance. Responses are sent back to the application and clients by a similar path.
Internet
Standby Components
Primary Site
Secondary Site
Tier 1 - Client
Router Router Primary WAN Traffic Manager Router Router Secondary WAN Traffic Manager
Firewall Firewall
hb
Firewall
Firewall
hb
Firewall
RAC Instance
hb hb
RAC Instance
RAC Database
RAC Instance
hb hb
RAC Instance
RAC Database
Figure 132 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow the same path on the secondary site that they followed on the primary site.
13-7
Internet
Standby Components
Primary Site
Secondary Site
Tier 1 - Client
Router Router Primary WAN Traffic Manager Router Router Secondary WAN Traffic Manager
Firewall Firewall
hb
Firewall
Firewall
hb
Firewall
RAC Instance
hb hb
RAC Instance
RAC Database
RAC Instance
hb hb
RAC Instance
RAC Database
The following steps describe the effect of a failover or switchover on network traffic:
1. 2. 3.
The administrator has failed over or switched over the primary database to the secondary site. This is automatic if you are using Data Guard fast-start failover. The administrator starts the middle-tier application servers on the secondary site, if they are not running. The wide-area traffic manager selection of the secondary site can be automatic for an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover. Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:
a.
The master (primary) DNS server is updated with the zone information, and the change is announced with the DNS NOTIFY announcement. The slave DNS servers are notified of the zone update with a DNS NOTIFY announcement, and the slave DNS servers pull the zone information.
Note:
The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information.
b.
Clear affected records from caching DNS servers. A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information. Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions: BIND 9.3.0: The command rndc flushname name flushes individual entries from the cache. BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndc flush. BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.
c.
Refresh local DNS service caching. Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly. Solaris: nscd Linux: /etc/init.d/nscd restart Microsoft Windows: ipconfig /flushdns Apple Mac OS X: lookupd -flushcache
d. e.
The secondary site load balancer directs traffic to the secondary site middle-tier application server. The secondary site is ready to take client requests.
Failover also depends on the client's Web browser. Most browser applications cache the DNS entry for a period. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.
13-9
primary database and there is no possibility of recovering the primary database in a timely fashion. With Oracle Data Guard, you can automate the failover process using the broker and fast-start failover, or you can perform the failover manually:
Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero loss or minimum-loss failover (that you configure using the FastStartFailoverLagLimit property) within seconds of an outage being detected. See Section 8.4.2.3, "Fast-Start Failover Best Practices" for configuration best practices. Manual failover allows for a failover process where decisions are user driven using any of the following methods: Oracle Enterprise Manager The broker command-line interface (DGMGRL) SQL*Plus statements See Section 13.2.2.3, "Best Practices for Performing Manual Failover".
A database failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site hosts the primary database. You must reinstate the original primary database as a new standby database to restore fault tolerance of the configuration. See Section 13.3.2, "Restoring a Standby Database After a Failover." A failover operation typically occurs in under a minute, and with little or no data loss.
See Also:
Oracle Data Guard Concepts and Administration.for a complete description of failover processing The "Data Guard Fast-Start Failover" and "Data Guard Switchover and Failover" MAA best practice white papers available from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
A site disaster, which results in the primary database becoming unavailable Damage resulting from user errors that cannot be repaired in a timely fashion Data failures, which impact the production application
A failover requires that you reinstate the initial primary database as a standby database to restore fault tolerance to your environment. You can quickly reinstate the standby database using Flashback Database provided the original primary database has not been damaged. See Section 13.3.2, "Restoring a Standby Database After a Failover."
There are no procedural best practices to consider when performing a fast-start failover. However, it is important to address all of the configuration best practices described in Section 8.4.2.3, "Fast-Start Failover Best Practices".
See Also: The MAA white paper "Data Guard Switchover and Failover Best Practices" from the MAA Best Practices area for Oracle Database at
http://www.oracle.com/goto/maa
Follow the configuration best practices outlined in Section 8.4.2.4, "Manual Failover Best Practices." Choose from the following methods: Oracle Enterprise Manager See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases. Oracle Data Guard broker command-line interface (DGMGRL) See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases. SQL*Plus statements: * Oracle Data Guard Concepts and Administration for information on Physical standby database steps for "Performing a Failover to a Physical Standby Database" Oracle Data Guard Concepts and Administration for information on Logical standby database steps for "Performing a Failover to a Logical Standby Database"
13.2.3 Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)
Oracle RAC Recovery is performed automatically when there is a node or instance failure. In regular multi instance Oracle RAC environments, surviving instances automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and Oracle RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems, with no data loss. For Oracle RAC One Node configurations recover times are expected to take longer than full Oracle RAC; with Oracle RAC One Node a replacement instance must be started first before it can do the instance recovery. For instance or node failures with Oracle RAC and Oracle RAC One Node, use the following recovery methods:
Automatic Instance Recovery for Failed Instances Automatic Service Relocation Oracle Cluster Registry Recovery
13-11
Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions
When multiple instances fail, if one instance survives Oracle RAC performs instance recovery for any other instances that fail. If all instances of an Oracle RAC database fail, then on subsequent restart of any instance a crash recovery occurs and all committed transactions are recovered. Data Guard is the recommended solution to survive outages when all instances of a cluster fail.
Oracle Clusterware automatically moves any services on the failed database instance to another available instance, as configured with DBCA or Enterprise Manager. Oracle Clusterware recognizes when a failure affects a service and automatically fails over the service across the surviving instances supporting the service.
Note:
With Oracle RAC One Node the relocation occurs when another instance on a different node is started and enabled for the appropriate services. Thus, Oracle RAC One Node starts a new instance when an instance fails but the new instance is not a "surviving instance."
A service can be made available on multiple instances by default. In this case, when one of those multiple instances is lost the clients continue to use the available services across the surviving instances, but there are less resources to do the work.
In parallel, Oracle Clusterware attempts to restart and integrate the failed instances and dependent resources back into the system and Cluster Ready Services (CRS) will try to restart the database instance three times. Clients can "subscribe" to node failure events, in this way clients can be notified of instance problems quickly and new connections can be setup (Oracle Clusterware does not setup the new connections, the clients setup the new connections). Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification
13-12 Oracle Database High Availability Best Practices
Service (ONS), advanced queuing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Restart and recovery after a node shutdown or clusterware restart are done automatically.
Section 6.3.2, "Regularly Back Up OCR to Tape or Offsite" Oracle Real Application Clusters Administration and Deployment Guide for information on Administering Storage in Real Application Clusters Oracle Clusterware Administration and Deployment Guide for information on Restoring Oracle Cluster Registry Oracle Clusterware Administration and Deployment Guide for information on Restoring Voting Disks
13-13
Then, after a certain time threshold expires, Enterprise Manager can call the Oracle Data Guard DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure to initiate a database failover immediately followed by an application failover using FAN notifications and service relocation. FAN notifications and service relocation enable automatic and fast redirection of clients in the event of any failure or planned maintenance that results in an Oracle RAC or Oracle Data Guard fail over.
See Also:
Chapter 11, "Configuring Fast Connection Failover" The MAA white paper "Client Failover for Highly Available Oracle Databases" from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Oracle Data Guard Broker for more information about Application Initiated Fast-Start Failover and the DBMS_DG.INITIATE_FS_ FAILOVER PL/SQL procedure
Table 133 (Cont.) Types of Oracle ASM Failures and Recommended Repair Failure Oracle ASM disk failure Description One or more Oracle ASM disks fail, but all disk groups remain online Impact All data remains accessible. This is possible only with normal or high redundancy disk groups Recommended Repair Oracle ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an ORA-15041. For more information, see Section 4.6.2, "Oracle Storage Grid Best Practices for Planned Maintenance" Note: External redundancy disk groups should use mirroring in the storage array to protect from disk failure. Disk failures should not be exposed to Oracle ASM. Perform Data Guard failover or local recovery as described in Section 13.2.5.3, "Data Area Disk Group Failure" Perform local recovery or Data Guard failover as described in Section 13.2.5.4, "Fast Recovery Area Disk Group Failure"
One or more Oracle ASM disks fail, and data area disk group goes offline One or more Oracle ASM disks fail, and the fast recovery area disk group goes offline
Databases accessing the data area disk group shut down Databases accessing the fast recovery area disk group shut down
If the primary database is an Oracle RAC database, then application failover occurs automatically and clients connected to the database instance reconnect to remaining instances. Thus, the service is provided by other instances in the cluster and processing continues. The recovery time typically occurs in seconds. If the primary database is not an Oracle RAC database, then an Oracle ASM instance failure shuts down the entire database. If the configuration uses Oracle Data Guard and fast-start failover is enabled, a database failover is triggered automatically and clients automatically reconnect to the new primary database after the failover completes. The recovery time is the amount of time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then you must recover from this outage by either restarting the Oracle ASM and database instances manually, or by performing a manual Data Guard failover. If the configuration includes neither Oracle RAC nor Data Guard, then you must manually restart the Oracle ASM instance and database instances. The recovery time depends on how long it takes to perform these tasks.
External redundancy If an Oracle ASM disk group is configured as an external redundancy type, then a failure of a single disk is handled by the storage array and should not be seen by the Oracle ASM instance. All Oracle ASM and database operations using the disk group continue normally.
13-15
However, if the failure of an external redundancy disk group is seen by the Oracle ASM instance, then the Oracle ASM instance takes the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then you can restart Oracle ASM and the database instances and crash recovery occurs after the disk group is brought back online.
Normal or a high-redundancy If an Oracle ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by Oracle ASM and the databases accessing the disk group are not affected. An Oracle ASM instance automatically starts an Oracle ASM rebalance operation to distribute the data of one or more failed disks to the remaining, intact disks of the Oracle ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be remirrored. When the rebalance operation completes successfully, the Oracle ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an Oracle ASM disk group with normal redundancy.
The failure of multiple disks in multiple failure groups where a primary extent and all of its mirrors have been lost causes the disk group to go offline. When Oracle ASM disks fail, use the following recovery methods:
Using Enterprise Manager to Repair Oracle ASM Disk Failure Using SQL to Add Replacement Disks Back to the Disk Group Using Enterprise Manager to Repair Oracle ASM Disk Failure
13.2.5.2.1
Figure 133 shows Enterprise Manager reporting disk failures. Five of 14 alerts are shown. The five alerts shown are Offline messages for Disk RECO2.
Figure 134 shows Enterprise Manager reporting the status of data area disk group DATA, database Data Guard disk group DBFS_DG, and recovery area disk group RECO.
Figure 134 Enterprise Manager Reports Oracle ASM Disk Groups Status
13-17
Figure 135 shows Enterprise Manager reporting a pending REBAL operation on the DATA disk group. The operation is almost done, as shown in % Complete, and the Remaining Time is estimated to be 0 minutes.
Figure 135 Enterprise Manager Reports Pending REBAL Operation
13.2.5.2.2 Using SQL to Add Replacement Disks Back to the Disk Group Perform these steps after one or more failed disks of one specific failure group have been dropped and need to be replaced with new disks:
1.
Add the one or more replacement disks to the failed disk group with the following SQL command:
ALTER DISKGROUP disk_group ADD FAILGROUP failure_group DISK 'disk1','disk2',...;
2.
Table 134
Recovery Options for Data Area Disk Group Failure Recovery Time Objective (RTO) Five minutes or less Recovery Point Objective (RPO) Varies depending on the data protection level chosen
Recovery Option Data Guard failover (see Section 13.2.5.4, "Fast Recovery Area Disk Group Failure") Local Recovery (see "Local Recovery Steps")
Zero
If Data Guard is being used and fast-start failover is configured, then an automatic failover occurs when the database shuts down due to the data area disk group going offline. If fast-start failover is not configured, then perform a manual failover. If you decide to perform a Data Guard failover then the recovery time objective (RTO) is expressed in terms of minutes or seconds, depending on the presence of the Data Guard observer process and fast-start failover. However, if a manual failover occurs and not all data is available on the standby site, then data loss might result. After Data Guard failover has completed and the application is available, you must resolve the data area disk group failure. Continue with the following "Local Recovery Steps" procedure to resolve the Oracle ASM disk group failure. The RTO for local recovery only is based on the time required to:
1. 2.
Repair and replace the failed storage components Restore and recover the database
Because the loss affects only the data-area disk group, there is no loss of data. All transactions are recorded in the Oracle redo log members that reside in the fast recovery area, so complete media recovery is possible. If you are not using Data Guard, then perform the following local recovery steps. The time required to perform local recovery depends on how long it takes to restore and recover the database. There is no data loss when performing local recovery. Local Recovery Steps Perform these steps after one or more failed disks have been replaced and access to the storage has been restored:
Note:
If you have performed an Oracle Data Guard failover to a new primary database, then you can now use the following procedure to restore and sync the Data Guard setup. Also, see Section 13.3.2, "Restoring a Standby Database After a Failover".
1.
Rebuild the Oracle ASM disk group using the new storage location by issuing the following SQL*Plus statement on the Oracle ASM instance:
SQL> CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK 'path1','path2',...force;
2.
Start the database instance NOMOUNT by issuing the following RMAN command:
RMAN> STARTUP FORCE NOMOUNT;
3.
Restore the control file from the surviving copy located in the recovery area:
RMAN> RESTORE CONTROLFILE FROM 'recovery_area_controlfile';
13-19
4.
5.
6.
7.
If you use block change tracking, then disable and re-enable the block change tracking file using SQL*Plus statements:
SQL> ALTER DATABASE DISABLE BLOCK CHANGE TRACKING; SQL> ALTER DATABASE ENABLE BLOCK CHANGE TRACKING;
8.
9.
Re-create the log file members on the failed Oracle ASM disk group:
SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename'; SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
10. Perform an incremental level 0 backup using the following RMAN command: RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
Recovery Option Local recovery (see Section 13.2.5.4.1, "Local Recovery for Fast Recovery Area Disk Group Failure") Data Guard failover or switchover (see Section 13.2.5.4.2, "Data Guard Role Transition for Fast Recovery Area Disk Group Failure")
Zero
13.2.5.4.1 Local Recovery for Fast Recovery Area Disk Group Failure If you decide to perform local recovery then you must perform a fast local restart to start the primary database after removing the controlfile member that is located in the fast recovery area from the init.ora and allocate another disk group as the fast recovery area for archiving. For a fast local restart, perform the following steps on the primary database:
1.
Change the CONTROL_FILES initialization parameter to specify only the members in the Data Area:
ALTER SYSTEM SET CONTROL_FILES='+DATA/sales/control1.dbf' SCOPE=spfile;
2.
Change local archive destinations and the fast recovery area to the local redundant, scalable destination:
ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='+DATA' SCOPE=spfile;
3.
4.
Drop the redo log members that were in the lost disk group. For example, issue the following command:
ALTER DATABASE DROP LOGFILE MEMBER '+RECO/dbm/onlinelog/group_2.258.750768395';
5.
If the flashback logs were damaged or lost, it may be necessary to disable and reenable Flashback Database:
ALTER DATABASE FLASHBACK OFF; ALTER DATABASE FLASHBACK ON; ALTER DATABASE OPEN;
However, this is a temporary fix until you create a fast recovery area to replace the failed storage components. Oracle recommends using the Local Recovery Steps. For more information, see "Data Guard Role Transition for Fast Recovery Area Disk Group Failure Local Recovery Steps". 13.2.5.4.2 Data Guard Role Transition for Fast Recovery Area Disk Group Failure If you decide to perform a Data Guard role transition then the recovery time objective (RTO) can be expressed in terms of seconds or minutes, depending on the presence of the Data Guard observer process and fast-start failover. If the protection level is maximum performance or the standby database is unsynchronized with the primary database, then:
1. 2. 3. 4.
Temporarily start the primary database by removing the controlfile member and pointing to a temporary fast recovery area (file system) in the SPFILE. Perform a Data Guard switchover to ensure no data loss. After the switchover has completed and the application is available, resolve the fast recovery area disk group failure. Shut down the affected database and continue by using the instructions in the Local Recovery Steps to resolve the Oracle ASM disk group failure. For more information, see "Data Guard Role Transition for Fast Recovery Area Disk Group Failure Local Recovery Steps". Data Guard Role Transition for Fast Recovery Area Disk Group Failure Local Recovery
13.2.5.4.3 Steps
13-21
If you performed an Oracle Data Guard failover to a new primary database, then you cannot use this procedure to reintroduce the original primary database as a standby database. This is because Flashback Database log files that are required as part of reintroducing the database have been lost. You must perform a full reinstatement of the standby database.
1. 2.
Replace or get access to storage to use for a fast recovery area Rebuild the Oracle ASM disk group using the storage location by issuing the following SQL*Plus statement:
SQL> CREATE DISKGROUP RECO NORMAL REDUNDANCY DISK 'path1','path2',...force;
3.
Start the database instance NOMOUNT using the following RMAN command:
RMAN> STARTUP FORCE NOMOUNT;
4.
Restore the control file from the surviving copy located in the data area:
RMAN> RESTORE CONTROLFILE FROM 'data_area_controlfile';
5.
6.
If you use Flashback Database, then disable it with the following SQL*Plus statement:
SQL> ALTER DATABASE FLASHBACK OFF;
7.
8.
9.
Re-create the log file members on the failed Oracle ASM disk group:
SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename'; SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
10. Synchronize the control file and the fast recovery area using the following RMAN
commands:
RMAN> RMAN> RMAN> RMAN> RMAN> RMAN> CATALOG RECOVERY AREA; CROSSCHECK ARCHIVELOG ALL; CROSSCHECK BACKUPSET; CROSSCHECK DATAFILECOPY ALL; LIST EXPIRED type; DELETE EXPIRED type;
In the example, the type variable is a placeholder for both LIST EXPIRED BACKUP and LIST EXPIRED COPY commands, and also for the DELETE
EXPIRED BACKUP and DELETE EXPIRED COPY commands. You should run all of these commands now.
11. Assuming that data has been lost, perform a backup: RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
Data Recovery Advisor is the easiest way to diagnose and repair most problems. For more information, see Section 13.2.6.1, "Use Data Recovery Advisor". Active Data Guard can automatically repair corrupt data blocks in a primary or standby database. For more information, see Section 13.2.6.2, "Use Active Data Guard". RMAN block media recovery can repair individual corrupted blocks by retrieving the blocks from good backups. For more information, see Section 13.2.6.3, "Use RMAN and Block Media Recovery". Data Guard switchover or failover to a standby database. For more information, see Section 13.2.6.2.2, "Extracting Data from a Physical Standby Databases". Datafile Media Recovery with RMAN. For more information, see Section 13.2.6.3, "Use RMAN and Block Media Recovery" When encountering lost write corruptions that result in ORA-752 or ORA-600 [3020], follow the guidelines in "Resolving ORA-752 or ORA-600 [3020] During Standby Recovery" in My Oracle Support Note 1265884.1 at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT &id=1265884.1
Whatever method you use to recover corrupted blocks, you first need to analyze the type and degree of corruption to perform the recovery. Implementing the optimal techniques to prevent and prepare for data corruptions can save time, effort, and stress when dealing with the possible consequences-lost data and downtime. MAA best practices provide a step-by-step process for resolving most corruptions and stray or lost writes, including the following:
1. 2. 3. 4. 5.
Use Data Recovery Advisor Use Active Data Guard Use RMAN and Block Media Recovery Perform a Data Guard Role Transition Use RMAN and Datafile Media Recovery
13-23
See Also:
For more information see the "Preventing, Detecting, and Repairing Block Corruption: Database 11g" MAA white paper from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
See, "Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration" in My Oracle Support Note 1302539.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1302539.1
Perform block media recovery of data files that have corrupted blocks Perform point-in-time recovery of the database or selected tablespaces Rewind the entire database with Flashback Database Completely restore and recover the database from a backup
Data Recovery Advisor has both a command-line and GUI interface. The GUI interface is available when you click Perform Recovery with Oracle Enterprise Manager Database Control (Support Workbench); this allows you use Data Recovery Advisor. Using the RMAN command-line interface, the Data Recovery Advisor commands include: LIST FAILURE, ADVISE FAILURE, REPAIR FAILURE, and CHANGE FAILURE. If the Data Recover Advisor fixes the problem, then there is no need to continue with any further recovery methods. However, continue to periodically check the alert log for any ORA- errors and any corruption warnings on the primary and standby databases. While the database is operational and corruption is detected, corruption errors are recorded as ORA-600 or ORA-01578 in the alert log.
Note:
In the current release, Data Recovery Advisor only supports single-instance databases. Oracle RAC databases are not supported. See Oracle Database Backup and Recovery User's Guide for more information on Data Recovery Advisor supported database configurations.
See Also:
Section 5.2.2, "Use Data Recovery Adviser to Detect, Analyze and Repair Data Failures" Oracle Database 2 Day DBA documentation for details on how to use the GUI interface for Data Recovery Advisor Oracle Database Backup and Recovery User's Guide for more information on Diagnosing and Repairing Failures with Data Recovery Advisor
Oracle Active Data Guard and Automatic Block Repair Extracting Data from a Physical Standby Databases
Alternatively, if the corruption is widespread, you may choose to failover or switchover to the standby database while you make repairs to the primary database. For more information, see Section 13.2.6.4, "Perform a Data Guard Role Transition". 13.2.6.2.1 Oracle Active Data Guard and Automatic Block Repair Starting in Oracle Database 11g Release 2 (11.2), the primary database automatically attempts to repair a corrupted block in real time by fetching a good version of the same block from a physical standby database. This capability is referred to as Automatic Block Repair; Automatic Block Repair allows corrupt data blocks to be automatically repaired as soon as the corruption is detected. Automatic Block Repair reduces the amount of time that data is inaccessible due to block corruption and reduces block recovery time by using up-to-date good blocks in real-time, as opposed to retrieving blocks from disk or tape backups, or from Flashback logs. Thus, with Automatic Block Repair you use an Oracle Active Data Guard standby database for automatic repair of data corruptions detected by the primary database. Additionally if the corruption is discovered on an Active Data Guard physical standby database the corruption is automatically repaired with a good block from the Primary. Both of these operations are transparent to the applications.
Note:
Automatic Block Repair requires the use of the Oracle Active Data Guard option.
See Also: Oracle Data Guard Concepts and Administration for more information on Oracle Active Data Guard option and the Automatic Block Repair feature
13.2.6.2.2 Extracting Data from a Physical Standby Databases You can use a Data Guard physical standby database to repair data file wide block corruption on the primary database by replacing the corrupted data files with good copies from the standby database. Once the files are restored on the primary database, data file or tablespace recovery makes the datafiles consistent with the rest of the database.
See Also: Oracle Data Guard Concepts and Administration for information on Recovery from Loss of Datafiles on the Primary Database Using Files On a Standby Database
A small number of blocks require media recovery and you know which blocks need recovery. Blocks are marked corrupt (you can verify this with the RMAN VALIDATE CHECK LOGICAL command). The backup file for the corrupted data file is available locally or can be retrieved from a remote location.
Note:
Do not use block media recovery to recover from user errors or software bugs that cause logical corruption where the data blocks are intact.
If a significant portion of the data file is corrupt or if the amount of corruption is unknown, then use either RMAN to restore the file from a backup or switch to an on disk image copy, or switchover to your Data Guard standby database. When corruption is detected, recover the block through the Oracle Enterprise Manager Restore and Recovery Wizard or directly with RMAN. For example, to recover a specific corrupt block using RMAN block media recovery:
RMAN> RECOVER BLOCK DATAFILE 7 BLOCK 3;
After a corrupt block is repaired, the row identifying this corrupted block is deleted from the V$DATABASE_BLOCK_CORRUPTION view.
See Also: Oracle Database Backup and Recovery User's Guide for information on RMAN's block media recovery
The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown. Recovering locally takes longer than the business service-level agreement or RTO.
See Also: Oracle Data Guard Concepts and Administration for more information about Data Guard failovers and switchovers
Use Data Recovery Advisor Use Active Data Guard Use RMAN and Block Media Recovery Perform a Data Guard Role Transition
Note:
If you do not have a Data Guard Physical standby, then you must use traditional media recovery. Using traditional media recovery, a backup copy of one or more files is restored and then datafile, tablespace, or database recovery brings the database back to a consistent state.
Datafile media recovery affects an entire data file or set of data files for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked "media corrupt" and require media recovery, or when an entire file is lost, you must restore and recover the applicable data files.
See Also: Oracle Database Backup and Recovery User's Guide for information on Advanced User-Managed Recovery Scenarios
Erroneous or malicious update, delete, or insert transactions Erroneous or malicious DROP TABLE statements Erroneous or malicious batch job or wide-spread application errors
Flashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See Section 13.2.2, "Database Failover with a Standby Database" to repair these outages.
Note: For information on Flashback Database configuration best practices, see Section 5.1.4, "Enable Flashback Database"
Table 136 summarizes the Flashback solutions for outage varying in scope from destroying a row, such as through a bad update, to destroying a whole database (such as by deleting all the underlying files at the operating system level).
13-27
Table 136
Flashback Solutions for Different Outages Examples of Human Errors Flashback Solutions Flashback Query Flashback Version Query Flashback Transaction Query Flashback Transaction See Also See Also: Section 13.2.7.2, "Resolving Row and Transaction Inconsistencies"
Outage Scope
Table
Tablespace or database
Erroneous batch job affecting many tables or an unknown set of tables Series of database-wide malicious transactions
Flashback Version Query Flashback Transaction Query Flashback Transaction Flashback Drop Flashback Table Flashback Database
Physical and logical standby databases Physical and logical standby databases Physical and logical standby databases Physical standby databases Physical and logical standby databases Physical and logical standby databases
Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. For more information, see the configuration best practices for the database, as documented in Section 5.1, "Database Configuration High Availability
and Fast Recoverability Best Practices" to configure Flashback technologies to ensure that the resources from these solutions are available at a time of failure.
See Also:
Oracle Database Administrator's Guide for information on Recovering Tables Using Oracle Flashback Table Oracle Database Backup and Recovery User's Guide for information on Using Flashback Database and Restore Points Oracle Database Concepts for information on Oracle Flashback Technology
In general, the recovery time when using Flashback technologies is equivalent to the time it takes to cause the human error plus the time it takes to detect the human error. Flashback technologies allow recovery up to the point that the human error occurred. Use the following recovery methods:
Resolving Table Inconsistencies Resolving Row and Transaction Inconsistencies Resolving Database-Wide Inconsistencies Resolving One or More Tablespace Inconsistencies
Flashback Table statement to restore a table to a previous point in the database Flashback Drop statement to recover from an accidental DROP TABLE statement Flashback Transaction statement to roll back one or more transactions and their dependent transactions, while the database remains online
Flashback Table Flashback Table provides the ability to quickly recover a table or a set of tables to a specified point in time. In many cases, Flashback Table alleviates the more complicated point-in-time recovery operations. For example:
FLASHBACK TABLE orders, order_items TO TIMESTAMP TO_DATE('28-Jun-11 14.00.00','dd-Mon-yy hh24:mi:ss');
This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and a specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.
13-29
Flashback Drop Flashback Drop provides a safety net when dropping objects. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users view the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and all its dependent objects would be undropped by the following statement:
FLASHBACK TABLE employees TO BEFORE DROP;
Flashback Transaction Oracle Flashback Transaction increases availability during logical recovery by easily and quickly backing out a specific transaction or set of transactions and their dependent transactions, while the database remains online. Use the DBMS_FLASHBACK.TRANSACTION_BACKOUT() PL/SQL procedure to roll back a transaction and its dependent transactions. This procedure uses undo data to create and execute the compensating transactions that return the affected data to its pre-transaction state.
See Also:
Oracle Database Advanced Application Developer's Guide for information on Using Flashback Transaction DBMS_FLASHBACK.TRANSACTION_BACKOUT() in Oracle Database PL/SQL Packages and Types Reference
Flashback Version Query Flashback Version Query provides a way to view changes made to the database at the row level. Flashback Version Query is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:
SELECT * FROM EMPLOYEES VERSIONS BETWEEN TIMESTAMP TO_DATE('28-Jun-11 14:00','dd-Mon-YY hh24:mi') AND TO_DATE('28-Jun-11 15:00','dd-Mon-YY hh24:mi') WHERE ...
This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. on June 28, 2011. A database administrator can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. Flashback Version Query enables the database administrator to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code. Flashback Transaction Query Flashback Transaction Query provides a way to view changes made to the database at the transaction level. Flashback Transaction Query is an extension to SQL that enables you to see all changes made by a transaction. For example:
SELECT UNDO_SQL FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = '000200030000002D';
This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like Flashback Transaction Query, the database administrator and application developer can precisely diagnose and correct logical problems in the database or application. Consider a human resources (HR) example involving the SCOTT schema. The HR manager reports to the database administrator that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he instructed his staff to reset the salary to the previous level of $1250. This was completed around 9:15 a.m. The following steps show how to approach the problem.
1.
Assess the problem. Fortunately, the HR manager has provided information about the time when the change occurred. You can query the information as it was at 9:00 a.m. using Flashback Query.
SELECT EMPNO, ENAME, SAL FROM EMP AS OF TIMESTAMP TO_DATE('28-JUN-11 09:00','dd-Mon-yy hh24:mi') WHERE ENAME = 'WARD'; EMPNO ENAME SAL ---------- ---------- ---------7521 WARD 1875
You can confirm that you have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, you can now use the employee number for subsequent investigation.
Recovering from Unscheduled Outages 13-31
2.
Query previous rows or versions of the data to acquire transaction information. Although it is possible to restrict the row version information to a specific date or SCN range, you might want to query all the row information that is available for the employee WARD using Flashback Version Query.
SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME, VERSIONS_XID FROM EMP VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE WHERE EMPNO = 7521 ORDER BY NVL(VERSIONS_STARTSCN,1); EMPNO ----7521 7521 7521 ENAME SAL ------ --WARD 1250 WARD 1875 WARD 1250 VERSIONS_STARTTIME ---------------------28-JUN-11 08.48.43 AM 28-JUN-11 08.54.49 AM 28-JUN-11 09.10.09 AM VERSIONS_ENDTIME VERSIONS_XID -------------------- --------------28-JUN-11 08.54.49 AM 0006000800000086 28-JUN-11 09.10.09 AM 0009000500000089 000800050000008B
You can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09. Also, you can see that the ID of the erroneous transaction that increased WARD's salary to $1875 was "0009000500000089".
3.
Query the erroneous transaction and the scope of its effect. With the transaction information (VERSIONS_XID pseudocolumn), you can now query the database to determine the scope of the transaction, using Flashback Transaction Query.
SELECT UNDO_SQL FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = HEXTORAW('0009000500000089'); UNDO_SQL ---------------------------------------------------------------------------update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL'; update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ'; update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF'; update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE'; update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB'; 6 rows selected.
You can see that WARD's salary was not the only change that occurred in the transaction. Now you can send the information that was changed for the other four employees at the same time as employee WARD, back to the HR manager for review.
4.
Determine if the corrective statements should be executed. If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the database administrator can execute the statements individually.
5.
Query the FLASHBACK_TRANSACTION_QUERY view for additional transaction information. For example, to determine the user that performed the erroneous update, issue the following query:
SELECT LOGON_USER FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = HEXTORAW('0009000500000089');
LOGON_USER ---------------------------MSMITH
In this example, the query shows that the user MSMITH was responsible for the erroneous transaction.
No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a primary database, a physical standby database, or a logical standby database. These steps are recommended for using Flashback Database:
1. 2.
Determine the time or the SCN to which to flash back the database. Verify that there is sufficient flashback log information.
SELECT OLDEST_FLASHBACK_SCN, TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') FROM V$FLASHBACK_DATABASE_LOG;
3.
Flash back the database to a specific time or SCN. (The database must be mounted to perform a Flashback Database.)
FLASHBACK DATABASE TO SCN scn;
or
FLASHBACK DATABASE TO TIMESTAMP TO_DATE date; 4.
Open the database in read-only mode to verify that it is in the correct state.
ALTER DATABASE OPEN READ ONLY;
13-33
If more flashback data is required, then issue another FLASHBACK DATABASE statement. (The database must be mounted to perform a Flashback Database.) If you want to move forward in time, then issue a statement similar to the following:
RECOVER DATABASE UNTIL [TIME date | CHANGE scn]; 5.
If there are not sufficient flashback logs to flash back to the target time, then use an alternative: Use Data Guard to recover to the target time if the standby database lags behind the primary database or flash back to the target time if there's sufficient flashback logs on the standby. Restore from backups.
After flashing back a database, any dependent database such as a standby database must be flashed back. See Section 13.3, "Restoring Fault Tolerance".
Flashback Database does not automatically fix a dropped tablespace, you can use Flashback Database to significantly reduce the downtime. You can flash back the primary database to a point before the tablespace was dropped and then restore a backup of the corresponding data files using SET NEWNAME from the affected tablespace and recover to a time before the tablespace was dropped.
To recover a logical database to a point different from the rest of the physical database, when multiple logical databases exist in separate tablespaces of one physical database. For example, you maintain logical databases in the Orders and Personnel tablespaces. An incorrect batch job or DML statement corrupts the data in only one tablespace. To recover data lost after DDL operations that change the structure of tables. You cannot use Flashback Table to rewind a table to before the point of a structural change such as a truncate table operation. To recover a table after it has been dropped with the PURGE option. To recover from the logical corruption of a table.
Recover the database that requires the recovery operation using time-based recovery. For example, if a database must be recovered because of a media failure, then recover this database first using time-based recovery. Do not recover the other databases at this point.
2.
After you have recovered the database and opened it with the RESETLOGS option, search the alert_SID.log of the database for the RESETLOGS message. Your next step depends on the message that you find in the log file, as described in following table:
Then ... Recovery is complete. You have applied all the changes in the database and performed complete recovery. Do not recover any of the other databases in the distributed system because this unnecessarily removes database changes. You have successfully performed an incomplete recovery. Record the change number from the message and proceed to the next step.
If the message returned is ... "RESETLOGS after complete recovery through change nnn"
3.
Recover or flash back all other databases in the distributed database system using change-based recovery, specifying the change number (SCN) that you recorded in Step 2.
Note:
If a database that is participating in distributed transactions fails, in-doubt distributed transactions may exist in the participating databases. If the failed database recovers completely and communications resume between the databases, then the in-doubt transactions is automatically resolved by the Oracle recoverer process (RECO) process. If you cannot wait until the failed database becomes available, you can also manually commit or rollback in-doubt transactions.
13-35
See Also:
Oracle Database Backup and Recovery User's Guide for more information about performing time-based recovery Oracle Database Administrator's Guide for information about how to handle in-doubt transactions and about recovery from distributed transaction failures For an additional methodology for recovering multiple Oracle databases to a consistent state with local and distributed database transactions, see My Oracle Support Note 1096993.1. The participating databases may be involved in distributed or remote transactions or can be completely independent but are required to be "synchronized" for application consistency. Siebel, Peoplesoft, SAP, and other custom applications that include multiple databases are real world examples that may require global consistency across multiple databases. For more information, see "Recovery for Global Consistency in an Oracle Distributed Database Environment ", in My Oracle Support Note 1096993.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1096993.1
For Oracle Database 11g with Oracle RAC or Oracle RAC One Node Restoring Failed Nodes or Instances in Oracle RAC and Oracle RAC One Node
For Oracle Database 11g with Data Guard and Oracle Database 11g with Oracle RAC and Data Guard - MAA Restoring a Standby Database After a Failover Restoring Oracle ASM Disk Groups after a Failure Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster Restoring Fault Tolerance After a Standby Database Data Failure Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs Restoring Fault Tolerance After Dual Failures
13.3.1 Restoring Failed Nodes or Instances in Oracle RAC and Oracle RAC One Node
Ensuring that application services fail over quickly and automatically in an Oracle RAC clusteror between primary and secondary sitesis important when planning for both scheduled and unscheduled outages. Similarly, using Oracle RAC One Node you need to make sure that applications failover to the new instance that starts if an Oracle RAC One Node instance fails. To ensure that the environment is restored to full fault tolerance after any errors or issues are corrected, it is also important to understand the steps and processes for restoring failed instances or nodes within an Oracle RAC cluster or databases between sites. Adding a failed node back into the cluster or restarting a failed Oracle RAC instance or Oracle RAC One Node instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, you should also consider:
When to perform these tasks to incur minimal or no effect on the current running environment Failing back or rebalancing existing connections
After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the Oracle RAC environment at any time. For an Oracle RAC One Node, you can also restart a failed instance and go back to running the instance on the original node. Processing to complete the reconfiguration of a node may require additional system resources. Table 138 summarizes additional processing that may be required when adding a node.
Table 138 Action Restarting a node or rejoining a node into a cluster Additional Processing When Restarting or Rejoining a Node or Instance Additional Resources When using only Oracle Clusterware, there is no impact when a node joins the cluster. When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload. Restarting or rejoining of an Oracle RAC instance When you restart an Oracle RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload.
Recovering Service Availability for Oracle RAC Recovering Service Availability for Oracle RAC One Node Considerations for Client Connections After Restoring an Oracle RAC Instance
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for more information about restarting an Oracle RAC instance Your vendor-specified cluster management documentation for detailed steps on how to start and join a node back into a cluster
13-37
After a failed node has been brought back into the cluster and its instance has been started, Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service might or might not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change. For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance. Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.
See Also: Oracle Real Application Clusters Administration and Deployment Guide
13.3.1.3 Considerations for Client Connections After Restoring an Oracle RAC Instance
After an Oracle RAC instance has been restored, additional steps might be required, depending on the current resource usage and system performance, the application configuration, and the network load balancing that has been implemented. Existing connections, that might have failed over or started as a new session, on the surviving Oracle RAC instances are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users might or might not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving Oracle RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it might be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.
Note:
In Oracle RAC One Node there is only one instance for a database (unless you are migrating). Thus an Oracle RAC One Node configuration does not require you to rethink the strategy for 'rebalancing' the connections as there is only one. Clients using Oracle RAC One Node need to be able to work with FAN and other client and service facilities in order to be informed about the status of services.
Connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the connections are automatically load-balanced over time. An application service can be:
Managed with services running on a subset of Oracle RAC instances Nonpartitioned so that all services run equally across all nodes
This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For cases where an application is partitioned or has a combination of partitioning and nonpartitioning, you should consider the response time and availability aspects for each service. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads automatically using Oracle Universal Connection Pool (UCP). If you are using UCP, then connections are automatically redistributed to the new node.
Note:
Oracle Universal Connection Pool (UCP) provides fast and automatic detection of connection failures and removes terminated connections for any Java application using, Fast Connection Failover, and FAN events
For load-balancing application services across multiple Oracle RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there might be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is
13-39
restored and available to start receiving connections, a manual step might be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration. Table 139 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or are a combination of both. The actual redistribution of existing connections might or might not be required depending on the resource utilization and response times.
Table 139 Application Services Partitioned Restoration and Connection Failback Failback or Restore Existing Connections Existing sessions are not automatically relocated back to the restored instance. Use the SRVCTL utility to manually start, stop, and relocate services. See Oracle Real Application Clusters Administration and Deployment Guide "Administering Services" for more information. No action is necessary unless the load must be rebalanced, because restoring the instance means that the load there is low. If the load must be rebalanced, then the same problems are encountered as if application services were partitioned. Failback or Restore New Connections Automatically routes to the restored instance by using the Oracle Net Services configuration.
Nonpartitioned
Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration
Figure 136 shows a two-node partitioned Oracle RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.
Node 1 Instance 1 hb hb
Node 2 Instance 2
HR Service
Sales Service
RAC Database
Heartbeat
hb
Figure 137 shows what happens when one Oracle RAC instance fails.
13-41
Node 1 Instance 1 hb hb
Node 2 Instance 2
RAC Database
Heartbeat
hb
If one Oracle RAC instance fails, then the service and existing client connections can be automatically failed over to another Oracle RAC instance. In this example, the HR and Sales services are both supported by the remaining Oracle RAC instance. In addition, you can route new client connections for the Sales service to the instance now supporting this service. After the failed instance has been repaired and restored to the state shown in Figure 136 and the Sales service is relocated to the restored instance, then you might need to identify and failback any failed-over clients and any new clients that had connected to the Sales service on the failed-over instance. Client connections that started after the instance has been restored should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times. Figure 138 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.
Node 2 Instance 2
Sales Service
RAC Database
Heartbeat
hb
If one Oracle RAC instance fails, then Oracle Clusterware moves the services that were running on the failed instance. If one Oracle RAC instance fails, new client connections are only accepted on the remaining instances that offers that service. After the failed instance has been repaired and restored to the state shown in Figure 138, some clients might have to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single-instance database is not able to adequately service the requests. Client connections that started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load evenly balances again across all available Oracle RAC instances. Rebalancing the load immediately after restoration depends on the resource usage and application response times.
13-43
the original primary database as a standby database. Reinstatement restores high availability to the broker configuration so that, in the event of a failure of the new primary database, another fast-start failover can occur. The reinstated database can act as the fast-start failover target for the primary database, making a subsequent fast-start failover possible. The standby database is a viable target of a failover when it begins applying redo data received from the new primary database. If you want to prevent automatic reinstatement (for example, to perform diagnostic or repair work after failover has completed), set the FastStartFailoverAutoReinstate configuration property to FALSE. The FastStartFailoverAutoReinstate configuration property controls whether the observer should automatically reinstate the original primary after a fast-start failover occurred because a fast-start failover was initiated due to the primary database being isolated for longer than the number of seconds specified by the FastStartFailoverThreshold property. In some cases, an automatic reinstatement might not be wanted until further diagnostic or recovery work is done. To reinstate the original primary database, the database must be started and mounted, but it cannot be opened. The broker reinstates the database as a standby database of the same type (physical or logical) as the original standby database. If the original primary database cannot be reinstated automatically, you can manually reinstate it using either the DGMGRL REINSTATE command or Enterprise Manager. Step-by-step instructions for manual reinstatement are described in Oracle Data Guard Broker. Standby databases do not have to be re-created if you use the Oracle Flashback Database feature. Flashback Database has the following advantages:
Saves hours of database restoration time Reduces overall complexity in restoring fault tolerance Reduces the time that the system is vulnerable because the standby database is re-created more quickly
See Also:
Oracle Data Guard Concepts and Administration for information on Flashing Back a Failed Primary Database into a Physical Standby Database Oracle Data Guard Concepts and Administration for information on Flashing Back a Failed Primary Database into a Logical Standby Database
Reinstating the Original Primary Database After a Fast-Start Failover Reinstating a Standby Database Using Enterprise Manager After a Failover
broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager, as shown in Figure 139.
Figure 139 Fast-Start Failover and the Observer Are Successfully Enabled
13-45
Figure 1310 Reinstating the Original Primary Database After a Fast-Start Failover
13.3.4 Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster
After performing the planned maintenance on the secondary site, the standby database and log apply services must be restarted, and then the Data Guard redo transport services automatically catch up the standby database with the primary database. You can use Enterprise Manager and the broker to monitor the Data Guard state. The following steps are required to restore full fault tolerance after planned downtime on a secondary site or clusterwide outage:
Note:
The following steps can be accomplished manually (as described below) or automatically using Enterprise Manager.
1.
Start the standby database You might have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new primary database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration. After the standby database has been reestablished, start the standby database.
Table 1310
SQL Statements for Starting Standby Databases SQL Statement STARTUP MOUNT; STARTUP; STARTUP;
Table 1311
Verify redo transport services on the primary database You might have to reenable the primary database remote archive destination. Query the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:
SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL FROM V$ARCHIVE_DEST_STATUS; ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE; ALTER SYSTEM ARCHIVE LOG CURRENT;
Verify redo transport services between the primary and standby databases by checking for errors. Query the V$ARCHIVE_DEST and V$ARCHIVE_DEST_ STATUS views:
SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM V$ARCHIVE_DEST; SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE'; 4.
For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied the redo from the archived redo log files:
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD; SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_STANDBY;
For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the redo from the archived redo logs:
SELECT THREAD#, SEQUENCE# SEQ# FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# ORDER BY NEXT_CHANGE#;
5.
Restore primary database protection mode If you had to change the protection mode of the primary database from maximum protection to either maximum availability or maximum performance because of
13-47
the standby database outage, then change the primary database protection mode back to maximum protection depending on your business requirements.
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
See Also:
Use RMAN and Block Media Recovery (described in Section 13.2.6.3, "Use RMAN and Block Media Recovery") Use RMAN and Datafile Media Recovery (described in Section 13.2.6.5, "Use RMAN and Datafile Media Recovery") Re-Create Objects Manually for logical standby databases only (described in Section 14.2.6.7, "Re-Create Objects Manually")
If you had to change the protection mode of the primary database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the primary database protection mode back to maximum protection (depending on your business requirements).
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
13.3.6 Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs
If the primary database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database might require additional maintenance. No additional work is required if the primary database completed recovery with no resetlogs. After opening the primary database with the RESETLOGS option, execute the queries shown in Table 1312.
Table 1312 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS Database Primary Physical standby Logical standby Query SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATABASE; SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE; SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
Table 1313 shows the actions you take to restore fault tolerance if the standby database is behind the primary database's resetlogs SCN.
Physical standby
Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database" on page 13-47
2.
Logical standby
Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database" on page 13-47
Table 1314 shows the actions you take to restore fault tolerance if the standby database is ahead of the primary database's resetlogs SCN.
Table 1314 Database Physical standby SCN on the Standby is Ahead of Resetlogs SCN on the Primary Database Action
1.
Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database" on page 13-47
2. 3. 4.
Issue the SHUTDOWN IMMEDIATE statement, if necessary. Issue the STARTUP MOUNT statement. Issue the FLASHBACK DATABASE TO SCN flashback_ scn statement where flashback_scn is the SCN returned from the primary database query in Table 1312. The SCN returned from the primary database query is 2 less than the RESETLOGS_CHANGE#. Issue the FLASHBACK DATABASE TO SCN resetlogs_ change#_minus_2 statement.
5.
Restart Redo Apply with or without real-time apply: With real-time apply: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT; Without real-time apply: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
13-49
(Cont.) SCN on the Standby is Ahead of Resetlogs SCN on the Primary Action
1.
Logical standby
Determine the SCN at the primary database. On the primary database, use the following query to obtain the value of the system change number (SCN) that is 2 SCNs before the RESETLOGS operation occurred on the primary database: SQL> SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) AS FLASHBACK_SCN FROM V$DATABASE;
2.
Determine the target SCN for flashback operation at the logical standby: SQL> SELECT DBMS_LOGSTDBY.MAP_PRIMARY_SCN (PRIMARY_ SCN => FLASHBACK_SCN) 2> AS TARGET_SCN FROM DUAL;
3.
Flash back the logical standby to the TARGET_SCN returned. Issue the following SQL statements to flash back the logical standby database to the specified SCN, and open the logical standby database with the RESETLOGS option: SQL> SQL> SQL> SQL> SHUTDOWN; STARTUP MOUNT EXCLUSIVE; FLASHBACK DATABASE TO SCN TARGET_SCN; ALTER DATABASE OPEN RESETLOGS;
4.
Start SQL Apply: SQL> ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
Available Backups Local backup on primary and standby databases Local backup only on standby database. Tape backups on standby database. Tape backups only
See Also: Oracle Data Guard Concepts and Administration for the steps for creating a standby database after the primary database is re-created
14
14
This chapter describes scheduled outages and the Oracle operational best practices that can tolerate or manage each outage type and minimize downtime. This chapter contains these topics:
Hardware maintenance, repair, and upgrades Software upgrades and patching Application (programmatic) changes, patches, and upgrades Changes to improve performance and manageability of systems
You can implement many of these tasks while maintaining continuous application availability. The following sections provide best practice recommendations for reducing scheduled outages on the primary and secondary sites:
Managing Scheduled Outages on the Primary Site Managing Scheduled Outages On the Secondary Site
Table 141
Solutions for Scheduled Outages on the Primary Site Description and Examples Maintenance performed on the entire site where the current primary database resides is unavailable. Usually known well in advance.
Preferred Oracle Solution Section 14.2.1, "Site, Hardware, and Software Maintenance Using Database Switchover"
Scheduled power outages Site maintenance Regular planned switchovers to test infrastructure Section 14.2.1, "Site, Hardware, and Software Maintenance Using Database Switchover" < 5 minutes
Hardware maintenance or system software maintenance that impacts the entire database cluster
Upgrade of the cluster interconnect Upgrade to the storage tier that requires downtime on the database tier
Hardware maintenance or system software maintenance that impacts a subset of the database cluster
Hardware maintenance or system software maintenance on a database server. The scope of the downtime is restricted to a node of the database cluster.
Oracle RAC service relocation (see Section 14.2.11, "Automatic Workload Management for System Maintenance")
No downtime
Proactive replacement of RAID card battery Addition of memory or CPU to an existing node in the database tier Upgrade of a software component such as the operating system Changes to the configuration parameters for the operating system
Table 141 (Cont.) Solutions for Scheduled Outages on the Primary Site Planned Maintenance Perform patch set, maintenance, or major upgrade to Oracle Grid Infrastructure upgrade (includes Oracle Clusterware and Oracle ASM) Description and Examples Software maintenance of Grid Infrastructure.
Patch set upgrade Grid from 11g Release 2 (11.2.0.1) to 11g Release 2 (11.2.0.2) Patch Set 1 Maintenance release upgrade from 11g Release 1 to 11g Release 2 Major release upgrade from 10g to 11g
Section 14.2.5, "Grid Infrastructure Maintenance" or Section 14.2.1, "Site, Hardware, and Software Maintenance Using Database Switchover" or Section 14.2.3, "Data Guard Standby-First Patch Apply"
< 5 minutes
See Oracle Database 2 Day + Real Application Clusters Guide and see your platform-specific Oracle Grid Infrastructure Installation Guide for complete details, in the appendix, "How to Upgrade to Oracle Grid Infrastructure" Perform patch set, maintenance, or major upgrade to Oracle Database Software maintenance of Oracle Database.
Patch set upgrade Grid from 11g Release 2 (11.2.0.1) to 11g Release 2 (11.2.0.2) Patch Set 1 Maintenance release upgrade from 11g Release 1 to 11g Release 2 Major release upgrade from 10g to 11g
Oracle Database rolling upgrade with Data Guard SQL Apply (see Section 14.2.6.2, "Upgrading with Data Guard SQL Apply or Transient Logical Standby Database") or Oracle GoldenGate (see Section 14.2.6.3, "Upgrading with Oracle GoldenGate") Oracle RAC rolling patch upgrade using OPatch (see Section 14.2.4, "Oracle RAC Patches") or Section 14.2.3, "Data Guard Standby-First Patch Apply"
< 5 minutes
Apply Patch Set Update (PSU), Critical Patch Update (CPU), or patch bundle
No downtime
Installation of Patch Set Update 11.2.0.2.3 Installation of 11.2.0.2 Grid Infrastructure Bundle 1 Installation of Critical Patch Update July 2011 Installation of Exadata Database Bundle Patch 8
Table 141 (Cont.) Solutions for Scheduled Outages on the Primary Site Planned Maintenance Apply Oracle interim ("one-off") or diagnostic patch Description and Examples Patch Oracle software to fix a specific customer issue.
Oracle RAC rolling patch upgrade using OPatch (see Section 14.2.4, "Oracle RAC Patches") or Section 14.2.3, "Data Guard Standby-First Patch Apply" or
No downtime No downtime
Changes to the logical structure or the physical organization of Oracle Database objects, primarily to improve performance or manageability. Changes to the data or schema. Using the Oracle Database online redefinition feature enables objects to be available during the reorganization or redefinition.
Online object reorganization with DBMS_REDEFINITION (see Section 14.2.10, "Data Reorganization and Redefinition")
Moving an object to a different tablespace Converting a table to a partitioned table Add, modify, or drop one or more columns in a table or cluster
Table 141 (Cont.) Solutions for Scheduled Outages on the Primary Site Planned Maintenance Database storage maintenance Description and Examples Maintenance of storage where database files reside.
Preferred Oracle Solution Online storage maintenance using Oracle ASM (see Section 14.2.5.2, "Storage Maintenance")
Converting to Oracle ASM Adding or removing storage Patching or upgrading storage firmware or software
Changing operating system platform of the primary and standby databases. Changing physical location of the primary database
Moving to the Linux operating system Moving the primary database from one data center to another
Application changes
Application upgrades
Section 14.2.8, "Edition-Based Redefinition for Online Application Maintenance and Upgrades" or Section 14.2.6.3, "Upgrading with Oracle GoldenGate"
< 5 minutes
If maximum protection database mode is configured and there is only one standby database protecting the primary database, then you must downgrade the protection mode before scheduled outages on the standby instance or database so that there is no downtime on the primary database. If maximum protection database mode is configured and there are multiple standby databases, there is no need to downgrade the protection mode if at least one standby database that is configured with the LGWR SYNC AFFIRM attributes is available, and to which the primary database can transmit redo data.
When scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time that the standby database lags behind the primary database, which in turn lengthens the time to restore fault tolerance. See Section 8.2, "Determine Protection Mode and Data Guard Transport"for an overview of the Data Guard protection modes. Table 142 describes the steps for performing scheduled outages on the secondary site.
Table 142
Managing Scheduled Outages on the Secondary Site Oracle Database 11g with Data Guard Before the outage: Section 14.1.2, "Managing Scheduled Outages On the Secondary Site" After the outage: Section 13.3.4, "Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster" Oracle Database 11g - MAA Before the outage: Section 14.1.2, "Managing Scheduled Outages On the Secondary Site" After the outage: Section 13.3.4, "Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster" Before the outage: Section 14.1.2, "Managing Scheduled Outages On the Secondary Site"
Hardware or non-Oracle database software maintenance on the node that is running the managed recovery process (MRP) Hardware or non-Oracle database software maintenance on a node that is not running the MRP
Before the outage: Section 14.1.2, "Managing Scheduled Outages On the Secondary Site"
Not applicable
No effect because the primary standby node or instance receives redo logs that are applied with the managed recovery process After the outage: Restart node and instance, when available
Not applicable
Before the outage: Section 14.1.2, "Managing Scheduled Outages On the Secondary Site" After the outage: Section 13.3.4, "Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster"
Downtime needed for upgrade, but there is no effect on the primary node unless the configuration is in maximum protection database mode
Downtime needed for upgrade, but there is no effect on the primary node unless the configuration is in maximum protection database mode
Site, Hardware, and Software Maintenance Using Database Switchover Online Patching Data Guard Standby-First Patch Apply Oracle RAC Patches Storage Maintenance Database Upgrades Database Platform or Location Migration Edition-Based Redefinition for Online Application Maintenance and Upgrades Oracle GoldenGate for Online Application Upgrades Data Reorganization and Redefinition Automatic Workload Management for System Maintenance
Before performing any update to your system, Oracle recommends you perform extensive testing.
Scheduled maintenance such as hardware maintenance or firmware patches on the primary host Resolution of data failures when the primary database is still opened Testing and validating the secondary resources, as a means to test disaster recovery readiness When using SQL Apply to perform a rolling upgrade (see Section 14.2.6.2, "Upgrading with Data Guard SQL Apply or Transient Logical Standby Database")
Archived redo log files that are needed for apply are missing A point-in-time recovery is required The primary database is not open and cannot be opened
Using Oracle Enterprise Manager, as described in Oracle Data Guard Broker Using the DGMGRL command-line interface, as described in Oracle Data Guard Broker Using SQL*Plus: Role Transitions Involving Physical Standby Databases:
See Oracle Data Guard Concepts and Administration for detailed steps on Role Transitions Involving Physical Standby Databases. Role Transitions Involving Logical Standby Databases: See Oracle Data Guard Concepts and Administration for detailed step on Role Transitions Involving Logical Standby Databases. After performing the Data Guard Switchover do the following:
If the database is moved to the secondary site and the application tier is also moved to the secondary site, perform complete site failover. For more information see Section 13.2.1, "Complete Site Failover (Failover to Secondary Site)." If only the database is moved to the secondary site, perform application failover. See Section 13.2.4, "Application Failover" for more information.
See the patch README for details on whether a patch supports online installable.
During the next scheduled maintenance, when instances can be shutdown, rollback all online patches and apply the patches in an offline manner. Patches that are online installable should be installed in an online manner when the patch needs to be applied urgently and downtime cannot be taken to apply the patch. If instance downtime is acceptable, then apply the patch in an offline manner (as described in the patch README). Apply the patch to one instance at a time. When rolling back online patches, ensure all patched instances are included to avoid the dangerous and confusing situation of having different software across instances using the same $ORACLE_HOME. Assess memory impact on a test system before deploying to production (for example: using the pmap command). Never remove the $ORACLE_HOME/hpatch directory.
See Also:
Oracle Universal Installer and OPatch User's Guide for Windows and UNIX for information on Patching Oracle Software with OPatch For the most up-to-date information about online patching, installation and rollback, see "RDBMS Online Patching - Hot Patching" in My Oracle Support Note 761111.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=761111.1
Qualified Oracle patches applied to the database home can be applied and tested on the standby database first. Examples of Oracle database software in this category are: Exadata Database Bundle Patch Patch Set Update (PSU) Critical Patch Update (CPU) Interim (one-off) patches
Any other Oracle or system software can be applied and tested on the standby database first. Examples of software in this category are: Oracle patches applied to the grid home Operating system patches and firmware Storage patches Network patches
If the standby database shares infrastructure or server components with the primary database then you cannot evaluate patches to the shared components in a manner that will reduce risk to the primary database. For example, if you have a standby database
Reducing Downtime for Planned Maintenance 14-9
running on a cluster separate from the primary database but it shares the same storage grid as the primary database, then you cannot patch the standby storage first without affecting the primary database. The following are the advantages for Oracle Data Guard Standby-First Patch Apply:
Ability to apply software changes to the physical standby database for recovery, backup, or query validation before role transition, or before application on the primary production database. This mitigates risk and potential downtime on the production database. Ability to switch over to the targeted database after completing validation with reduced risk and minimum downtime. Ability to switch back, also known as fallback, if there are any major stability or performance regressions.
Oracle patch sets and major release upgrades do not apply. Use the Data Guard transient logical standby method for patch sets and major releases. For more information, see Section 14.2.6.2, "Upgrading with Data Guard SQL Apply or Transient Logical Standby Database".
See Also:
Oracle Database Upgrade Guide for information on Considerations for Downgrading and Compatibility and the Oracle Database COMPATIBLE parameter Oracle Automatic Storage Management Administrator's Guide for information on Disk Group Compatibility Attributes See "Oracle Patch Assurance - Data Guard Standby-First Patch Apply" in My Oracle Support Note 1265700.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=1265700.1
can be applied in an Oracle RAC rolling manner. Typically, patches that can be installed in a rolling manner include:
Exadata Database Bundle Patches Patch Set Update (PSU) Critical Patch Update (CPU) Interim (one-off) patches Diagnostic patches
Rolling upgrade of patches is currently available for one-off patches only. Rolling upgrade is not available for patch sets. Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.
14.2.4.1 Best Practices to Minimize Downtime for All Database Patch Upgrades
Use the following recommended practices for all database patch upgrades:
Always confirm with Oracle Support Services that the patch is valid for your problem and for your deployment environment. Have a plan for applying the patch and a plan for backing out the patch. Apply the patch to your test environment first and verify that it fixes the problem. When you plan the elapsed time for applying the patch, include time for starting up and shutting down the other tiers of your technology stack if necessary. If the patch is not a candidate for Oracle RAC rolling upgrade and you can incur the downtime for applying the patch, go to Section 14.2.6, "Database Upgrades" on page 14-16 to assess whether other solutions are feasible.
If multiple instances share an Oracle home, then all of them are affected by application of a patch. Administrators should verify that this does not cause unintentional side effects. Also, you must shut down all such instances on a node during the patch application. You must take this into account when scheduling a planned outage. As a best practice, only similar applications should share an Oracle home on a node. This provides greater flexibility for patching. The Oracle inventory on each node is the repository that keeps a central inventory of all Oracle software installed. The inventory is node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle inventory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inventory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventory on each node of the cluster. Use the Oracle Universal Installer to install all Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the
cluster. Also, use the Oracle Universal Installer to add nodes to an existing Oracle RAC cluster. However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach option of the opatch utility. Node information can be also added with this option.
The nature of the Oracle rolling patch upgrade enables it to be applied to only some nodes of the Oracle RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the Oracle RAC deployment is activated. A mixed environment is useful if a patch must be tested before deploying it to all the instances. Applying the patch with the -local option is the recommended way to do this. In the interest of keeping all instances of the Oracle RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the Oracle RAC installation. When instances of an Oracle RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch might have fixed.
Maintain all patches (including those applied by rolling upgrades) online and do not remove them after they have been applied. Keeping the patches is useful if a patch must be rolled back or applied again. Store the patches in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.
Perform rolling patch upgrades, just like any other patch upgrade, when no other patch upgrade or Oracle installation is being performed on the node. The application of multiple patches is a sequential process, so plan the scheduled outage accordingly. If you must apply multiple patches at the same time but only some patches are eligible for rolling upgrade, then apply all of the patches in a nonrolling manner. This reduces the overall time required to accomplish the patching process. For patches that are not eligible for rolling upgrade, the next best option for Oracle RAC deployments is the MINIMIZE_DOWNTIME option of the APPLY command. Perform the rolling upgrade when system usage is low to ensure minimal disruption of service for the end users.
See Also: Oracle Universal Installer and OPatch User's Guide for Windows and UNIX for more information about the opatch utility
Major release Maintenance release Patch set (beginning with 11g Release 2)
The following software installations are performed in-place by the OPatch utility. OPatch installs the software update into an existing ORACLE_HOME by overwriting existing software with updated software from the patch being installed:
Interim patch installation Bundle patch installation Patch Set Update (PSU) installation Critical Patch Update (CPU) installation Diagnostic patch installation
Advantages of out-of-place patching Applications remain available while software is upgraded in the new ORACLE_HOME.
The configuration inside the ORACLE_HOME is retained because the cloning procedure involves physically copying the software (examples are files such as LISTENER.ORA, TNSNAMES.ORA, and INITSID.ORA). It is easier to rollback or test between the original ORACLE_HOME and the patched ORACLE_HOME. When consolidating, you could have multiple versions of ORACLE_HOME, so this option should better support consolidation.
Considerations for using out-of-place patching When performing out-of-place patch installation with cloning, you must change any ORACLE_HOME environment variable hard coded in application code and Oracle-specific scripts. Out-of-place patching requires more disk space than in-place patching.
Out-of-place patching with OPatch Traditionally, patches installed with OPatch are done in-place, which means that the new code is applied directly over the old code. The disadvantages of in-place patching are:
The application cannot connect to the database while new code is being installed. If patch rollback is required, the application cannot connect to the database while old code is being reinstalled.
Note: This downside to an in-place database patch set upgrade does not apply when you use Standby-First Patch apply.
For more information, see Section 14.2.3, "Data Guard Standby-First Patch Apply." Software installation performed by OPatch to the Oracle Database software home or the Grid Infrastructure software home can be performed out-of-place by using ORACLE_HOME cloning techniques to copy the software to a new home directory before applying a patch to the new ORACLE_HOME with OPatch. The high-level approach to perform out-of-place patching is:
1. 2. 3.
Clone the active ORACLE_HOME to a new ORACLE_HOME. Patch the new ORACLE_HOME. Switch to make the new ORACLE_HOME the active software home. This can be done in a rolling manner one node at a time.
See Also: For details about out-of-place patching, see "Minimal downtime patching via cloning 11gR2 ORACLE_HOME directories on Oracle Database Machine" My Oracle Support Note 1136544.1 at
https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=1136544.1
Oracle Grid Infrastructure Installation Guide for your platform for complete details, in the Appendix, "How to Upgrade to Oracle Grid Infrastructure"
Migrating to Oracle ASM Storage Adding and Removing Storage Upgrading Oracle ASM Nodes
14.2.5.2.1 Migrating to Oracle ASM Storage If you have an existing Oracle database that stores database files on a file system or on raw devices, you can migrate some or all of these database files to Oracle ASM. To minimize downtime, use a physical standby database to migrate data to Oracle ASM storage. Use Oracle Recovery Manager (RMAN) or the ASMCMD utility to migrate to Oracle ASM with very little downtime. The Oracle Recovery Manager (RMAN) and ASMCMD utility allow you to copy individual files into Oracle ASM.
For complete migrations Oracle Data Guard or Oracle GoldenGate are better alternatives to migrate to Oracle ASM with even less downtime (migration occurs in approximately the same amount of time it takes to perform a switchover).
See Also:
Oracle Database Backup and Recovery User's Guide for information about performing Oracle ASM data migration using RMAN The MAA white paper: "Minimal Downtime Migration to ASM" at
http://www.oracle.com/goto/maa
14.2.5.2.2 Adding and Removing Storage Disks can be added to and removed from Oracle ASM with no downtime. When disks are added or removed, Oracle ASM automatically starts a rebalance operation to evenly spread the disk group contents over all drives in the disk group. The best practices for adding or removing storage include:
Make sure your host operating system and storage hardware can support adding and removing storage with no downtime before using Oracle ASM to do so. Use a single ALTER DISKGROUP command when adding or removing multiple disk drives (this way there is only one rebalance operation where, with separate drops and adds there are two or more rebalance operations. For more information, see Section 4.5.4, "Use a Single Command to Add or Remove Storage"). For example, if the storage maintenance is to add drives and remove existing drives, use a single ALTER DISKGROUP command with the DROP DISK clause to remove the existing drives and the ADD DISK clause to add the drives:
ALTER DISKGROUP data DROP DISK diska5 ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9;
When dropping disks from a disk group, specify the WAIT option in the REBALANCE clause so the ALTER DISKGROUP statement does not return until the contents of the drives being dropped have been moved to other drives. After the statement completes, the drives can be safely removed from the system. For example:
ALTER DISKGROUP data DROP DISK diska5 ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9 REBALANCE WAIT;
When dropping disks in a normal or high redundancy disk group, ensure there is enough free disk space in the disk group to reconstruct full redundancy. Monitor the progress of rebalance operations using Enterprise Manager or by querying V$ASM_OPERATION. For long-running rebalance operations that occur during periods of low database activity, increase the rebalance power limit to reduce the rebalance time.
See Also:
14.2.5.2.3 Upgrading Oracle ASM Nodes Perform an Oracle ASM rolling upgrade to independently upgrade or patch clustered Oracle ASM nodes without affecting database availability, thus providing greater uptime. You can use Oracle ASM rolling
Reducing Downtime for Planned Maintenance 14-15
upgrades only to upgrade clustered Oracle ASM instances for environments running Oracle Database 11g or later releases.
See Also:
Oracle Automatic Storage Management Administrator's Guide for complete information on Using Oracle ASM Rolling Upgrades
Upgrading with Database Upgrade Assistant (DBUA) Upgrading with Data Guard SQL Apply or Transient Logical Standby Database Upgrading with Oracle GoldenGate Upgrading with Transportable Tablespaces
The method you choose to perform database upgrades can vary depending on the following considerations:
Downtime required to complete the upgrade Setup time and effort required before the downtime Temporary additional resources necessary (for example, disk space or CPU) Complexity of the steps allowed to complete the upgrade
Table 143 lists the methods that you can use for database upgrades, and recommends what method to use for particular cases.
Table 143 Database Upgrade Options Use This Method When... Recommended method when the maintenance window is sufficient or when data type constraints prohibit the use of the other methods in this table. DBUA cannot finish within the maintenance window and the database is not a candidate for Oracle RAC rolling patch upgrade. Use a transient logical standby when the configuration has only a physical standby database. Oracle GoldenGate is already used for complete database replication or when the database version predates Oracle 10g (the minimum version for Oracle Data Guard database rolling upgrades), or when additional flexibility for replicating back to the previous version is required (fast fall back option) or where zero downtime upgrades using multi-master replication is required. The database is using data types unsupported by Data Guard SQL Apply or Oracle GoldenGate, and the user schemas are simple.
Upgrade Method Upgrading with Database Upgrade Assistant (DBUA) Upgrading with Data Guard SQL Apply or Transient Logical Standby Database
Regardless of the upgrade method you use, you should follow the guidelines and recommendations provided in the Oracle Database Upgrade Guide and its companion document, "Oracle 11gR2 Upgrade Companion" in My Oracle Support Note 785351.1 at
https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id =785351.1
See Also:
"Oracle Support Lifecycle Advisors" in My Oracle Support Note 250.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=250.1
Oracle Technology Network Oracle Database Upgrade web page at http://www.oracle.com/technetwork/database/upgrad e/index.html
DBUA upgrades the database dictionary and all components. For example: Java, XDB, and so on, that have been installed while the database is unavailable for normal user activity. Downtime required for a database upgrade when using DBUA is determined by the time needed to: Upgrade all database dictionary objects to the new version Restart the database Reconnect the clients to the upgraded database
To reduce the amount of downtime required for a database upgrade when using DBUA: Remove any database options that are not being used. DBUA upgrades all of the installed database options, whether they are required by an application. By reducing the number of options that must be upgraded, you can reduce the overall upgrade time. Update data dictionary statistics immediately before the upgrade.
Use DBUA for a database upgrade when the time to perform the upgrade with this method fits within the maintenance window.
See Also:
Oracle Database Upgrade Guide for more information on DBUA and upgrading your Oracle Database software "Oracle 11gR2 Upgrade Companion" in My Oracle Support Note 785351.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=785351.1
To upgrade from Oracle9i to Oracle Database 11g, see "Oracle 11gR1 Upgrade Companion" in My Oracle Support Note 601807.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=601807.1
14.2.6.2 Upgrading with Data Guard SQL Apply or Transient Logical Standby Database
Use Data Guard SQL Apply or a transient logical standby database to upgrade a database with minimal downtime using a process called a rolling upgrade. Data Guard currently supports homogeneous environments where the primary and standby databases run on the same platform.
See Also: For exceptions that are specific to heterogeneous environments and for other late-breaking information about rolling upgrades with SQL Apply, see "Data Guard Support for Heterogeneous Primary and Physical Standbys in Same Data Guard Configuration" in My Oracle Support Note 413484.1 at
https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=413484.1 14.2.6.2.1 SQL Apply Rolling Upgrades Use Data Guard SQL Apply for rolling database upgrade when a conventional upgrade cannot complete the upgrade within the maintenance window and the application does not use user-defined types. Oracle Data Guard using SQL Apply is the recommended solution for performing patch set and database upgrades with minimal downtime. Note the following points when deciding if Data Guard SQL Apply is the appropriate method for minimizing downtime during a database upgrade:
SQL Apply has some data type restrictions (see Oracle Data Guard Concepts and Administration for a list of the restrictions). If there are data type restrictions, consider implementing Extended Datatype Support (EDS). If the source database is using data types not natively supported by SQL Apply, you can use Extended Datatype Support (EDS) to accommodate several more advanced data types. You can perform a SQL Apply rolling upgrade for any upgrade, including a major release upgrade if the source release is Oracle Database 10g release 1 (10.1.0.3) or higher. Before you begin, review the detailed steps for a SQL Apply rolling upgrade and verify the supported data types in Oracle Data Guard Concepts and Administration. If the source database is using a software version not supported by SQL Apply rolling upgrade (earlier than Oracle Database release 10.1.0.3) or using EDS cannot sufficiently resolve SQL Apply data type conflicts, then consider using Database Upgrade Assistant (DBUA), transportable tablespace, or Oracle GoldenGate.
Downtime required for a database upgrade (rolling upgrade) when using Data Guard SQL Apply is determined by the time needed to: Perform a Data Guard switchover Reconnect the clients to the new database
See Also:
Oracle Data Guard Concepts and Administration for information on Using SQL Apply to Upgrade the Oracle Database The MAA white paper "Database Rolling Upgrade Using Data Guard SQL Apply Oracle Database 11g and 10gR2" at http://www.oracle.com/goto/maa For SQL Apply, EDS support starts from Oracle Database Release 10.2.0.4. The ESD implementation is different from 10.2.0.4 to 11.1 and in Oracle Database Release 11.2, for more information: From 10.2.0.4 to 11.1 you need to build ESD following the examples specified in detail in "Extended Datatype Support (EDS) for SQL Apply" in My Oracle Support Note 559353.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=559353.1 In Oracle Database Release 11.2 EDS-related procedures are part of the DBMS_LOGSTDBY package; for more information see "SQL Apply Extended Datatype Support - 11.2" in My Oracle Support Note 949516.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=949516.1
14.2.6.2.2 Transient Logical Standby Database Rolling Upgrade You can use a transient logical standby database to perform a rolling database upgrade using your current physical standby database by temporarily converting it to a logical standby database. Use a transient logical standby when your configuration only has a physical standby database. Performing a rolling upgrade using a transient logical standby is similar to the standard SQL Apply rolling upgrade with the following differences:
A guaranteed restore point is created on the primary database to flash the database back to a physical standby database after the switchover. The conversion of a physical standby database to a logical standby database uses the KEEP IDENTITY clause to retain the same DB_NAME and DBID as that of its primary database. The ALTER DATABASE CONVERT TO PHYSICAL STANDBY statement converts the original primary database from a logical standby to a physical standby database. The original primary database is actually upgraded through Redo Apply after it is converted from the transient logical standby database role to a physical standby database.
Figure 141 shows the flow of processing that occurs when you perform a rolling upgrade with a transient logical standby database.
Note:
To simplify the operation shown in Figure 141, a Bourne shell script is available that automates the database rolling upgrade procedure (starting with Oracle Database 11g Release 1). The database rolling upgrade is performed using an existing Data Guard physical standby database and the transient logical standby rolling upgrade process. The Bourne shell script, named physru, is available for download with details in "Oracle 11g Data Guard: Database Rolling Upgrade Shell Script" in My Oracle Support Note 949322.1 at https://support.oracle.com/CSP/main/article?cmd=show &type=NOT&id=949322.1
Figure 141 Using a Transient Logical Standby Database for Database Rolling Upgrade
Synchronize
Upgrade
See Also:
Oracle Data Guard Concepts and Administration for more information on "Performing a Rolling Upgrade With an Existing Physical Standby Database" The MAA white paper, "Database Rolling Upgrades Made Easy by Using a Data Guard Physical Standby Database", which describes the process of automating many of the tasks associated with a database rolling upgrade, available from the MAA Best Practices area for Oracle Database at http://www.oracle.com/goto/maa
Oracle GoldenGate can upgrade an Oracle Database in rolling fashion from an Oracle Database release prior to Oracle Database 10g (Data Guard Database Rolling Upgrades are supported beginning with Oracle Database 10g). Oracle GoldenGate can be configured for one-way replication from a later Oracle Database release to a previous Oracle Database release to enable a fast fall-back option (Oracle Data Guard can only replicate from a earlier database release to a later release). This is useful in cases where you want to operate at the new release for a period and have the option to quickly revert to the previous release should unanticipated issues arise days after production cut-over. By configuring one-way replication from the new release to the previous release, production can be switched to the prior release quickly, without losing data or incurring the time of a downgrade, while the problems are resolved. Oracle GoldenGate can be configured for multi-master replication between different Oracle Database releases to facilitate a zero downtime upgrade (Oracle Data Guard is a one-way replication solution). When the new Oracle release is deployed and ready for user connections, new user connections can be directed to the new release while existing user connections at the old release continue to process transactions. As existing user connections terminate, utilization of the Oracle Database operating at the previous release diminishes naturally without users perceiving any downtime. Multi-master replication keeps both databases synchronized during this transitional phase. Once all users have migrated to the new release, simpler one-way replication can maintain synchronization of the previous database release to provide a fast fall-back option as described in the previous bullet item. Note that multi-master replication is not suitable for all applications - conflict detection and resolution is required. If you cannot use the procedure described in Section 14.2.6.2, "Upgrading with Data Guard SQL Apply or Transient Logical Standby Database" to upgrade your database and you require zero-to-minimum downtime while performing the database or application upgrade, then configure Oracle GoldenGate to perform a database upgrade with little or no downtime. For more information, see the White Paper, "Zero-Downtime Database Upgrades Using Oracle GoldenGate" at http://www.oracle.com/technetwork/middleware/goldengate/overv iew/ggzerodowntimedatabaseupgrades-174928.pdf
See Also: Oracle GoldenGate For Windows and UNIX Administrator's Guide for more information about database upgrades using Oracle GoldenGate
The SYSTEM tablespace cannot be moved with transportable tablespaces. The target database SYSTEM tablespace contents, including user definitions and objects
necessary for the application, must be built manually. Use Data Pump to move the contents of the SYSTEM tablespace.
Downtime required for a database upgrade when using transportable tablespaces is determined by the time needed to: Place the source database tablespaces in read-only mode. Perform a network import of the transportable metadata. If the target database is on a remote system, then include the time to transfer all data files from the source system to the target system. However, note that using transportable tablespaces to perform a database upgrade is useful only if you can use the data files in their current location. Using the transportable tablespace method is not recommended if doing so requires that you copy the data files to the target location. The time it takes to transfer the data files can be reduced significantly by using a storage infrastructure that can make the data files available to the target system without physically moving the files, or by using a physical standby database.
You can use the data files in their current location to avoid copying data files as part of the transport process. If the target database is on a different machine, this requires that the storage is accessible to both the source and target systems. DBUA cannot complete within the maintenance window. Oracle GoldenGate or Data Guard SQL Apply cannot be used due to data type restrictions. The Oracle database has a simple schema.
See Also:
Oracle Database Administrator's Guide for an Introduction to Transportable Tablespaces The MAA white paper "Database Upgrade Using Transportable Tablespaces" available at http://www.oracle.com/goto/maa
Simplify: during a migration, simplify your implementation. Most database environments that have evolved through different versions and different DBAs contain old information (and the current DBA might question why something is used in the system). The purpose of simplifying is to make administration easier and more reliable; this simplification leads to a more highly available system. Optimize: during a migration you can optimize your implementation. In many cases the migration involves an updated database version so you have new
features available. While performing a migration you should consider adopting new features and practices. Add the following steps to your migration planning to simplify and optimize:
Consider Your Options and Your Migration Strategy Plan Your Migration Oracle Features for Platform Migration and Upgrades
Update init.ora during migration. Take your existing init.ora file and remove parameters you consider no longer important. For changes that take parameters away from their default setting, justify the changes. For example, you might be able to remove underscore parameters that are set to work around issues found in previous releases (for example to handle an optimizer problem you resolved in a previous release).
Update SQL during migration. Remove SQL hints added in a previous Oracle Database version that were put in place to force the optimizer to generate the desired plan. The optimizer generally creates a good execution plan without the need for hints when provided good statistics.
Simplify or change schema objects during migration. You should consider if there are changes to the schema layout that you can make during a migration. For example, consider the following: Changes in the partitioning scheme for large tables Adoption of newly available compression capabilities, such as Exadata Hybrid Columnar Compression (EHCC) if migrating to Oracle Exadata Database Machine Adoption of Transparent Data Encryption (TDE), especially if migrating to a system that provides cryptographic hardware acceleration
Also, determine if there are objects that should not be migrated, such as excessive use of indexes. If you are going to have altered or fewer schema objects in the database you have to consider whether it is better to migrate the database in its current form, then perform the changes after migration, or be more selective during the migration.
Remove unused tablespaces and data files during migration. You should consider if you can remove unused or unnecessary tablespaces and data files during a migration. Using fewer tablespaces and data files leads to better manageability and performance.
Consider upgrading the source database to Oracle Database 11g Release 2 as this may improve the migration (in some cases significantly). For example, the parallel capabilities of Data Pump are significantly better in Oracle Database 11g Release 2 than in Oracle Database Release 10.2, so a database export from the source system could be improved and completed faster if the source database is upgraded to Oracle Database 11g Release 2. Consider dropping schema objects that are not needed in the source database prior to the migration. This can reduce the amount of data that has to be migrated. Determine and consider the business needs and downtime requirements. Review the Oracle features for platform migration in Section 14.2.7.3, "Oracle Features for Platform Migration and Upgrades," for the factors that influence the amount of downtime required. Consider whether there is a requirement or an opportunity to perform the migration in stages. For example, if there is a large amount of read only data in the source database, it might be migrated well before the live data migration to reduce downtime. Any platform migration exercise should include a significant amount of testing.
Physical Standby Databases for Platform Migration Transportable Database for Platform Migration Oracle GoldenGate for Platform Migration Oracle Data Pump for Platform Migration Transportable Tablespaces for Platform Migration Data Guard Redo Apply (Physical Standby Database) for Location Migration
The method you choose to perform these database maintenance tasks depends on the following considerations:
Downtime required to complete the maintenance operations Setup time and effort required before the downtime Amount of temporary additional resources necessary, such as disk space or CPU Complexity of the steps allowed to complete maintenance operations
Table 144 summarizes the methods you can use for platform migrations and database upgrades, and recommends which method to use for each operation.
Platform and Location Migration Options Recommended Method Physical Standby Databases for Platform Migration Alternate Methods
1.
Use Transportable Database for Platform Migration when a cross-platform physical standby database is not available for the platform combination to be migrated. Use Oracle GoldenGate for Platform Migration transportable database cannot finish within the maintenance window. Use Oracle GoldenGate for Platform Migration when Data Pump cannot finish within the maintenance window. Use Transportable Tablespaces for Platform Migration when the database is using data types unsupported by Oracle GoldenGate.
2.
1.
2.
Data Guard Redo Apply (Physical Standby Database) for Location Migration
None.
Note:
Query the V$TRANSPORTABLE_PLATFORM view to determine the endian format of all platforms. Query the V$DATABASE view to determine the platform ID and platform name of the current system.
System upgrades that cannot be upgraded using Oracle RAC rolling upgrades due to system restrictions. Migrations to Oracle ASM, to Oracle RAC from a nonclustered environment, to 64-bit systems, to a different platform with the same endian format or to a different platform with the same processor architecture, or to Windows from Linux or to Linux from Windows. When you have a primary database with 32-bit Oracle binaries on Linux 32-bit, and a physical standby database with 64-bit Oracle binaries on Linux 64-bit. Such configurations must follow additional procedures during Data Guard role transitions (switchover and failover) as described in Support Note 414043.1.
See Also:
See "Data Guard Support for Heterogeneous Primary and Physical Standbys in Same Data Guard Configuration" in My Oracle Support Note 413484.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=413484.1
See "Role Transitions for Data Guard Configurations Using Mixed Oracle Binaries" in My Oracle Support Note 414043.1 at https://support.oracle.com/CSP/main/article?cmd=s how&type=NOT&id=414043.1
Transportable database supports moving databases between platforms with the same endian format. Downtime required for a platform migration when using transportable database is determined by the time needed to: Place the source database in read-only mode. Convert data files. Only files that contain undo segments, or files that contain automatic segment-space management (ASSM) segment headers if converting from or to HP Tru64, require conversion. Transfer all data files from the source system to the target system. You can significantly reduce the amount of downtime by using a storage infrastructure that can make the data files available to the target system without physically moving the files.
See Also:
Oracle Database Backup and Recovery User's Guide for more information about cross platform use of transportable database The MAA white paper "Platform Migration Using Transportable Database" at http://www.oracle.com/goto/maa
Oracle GoldenGate does not support user-defined types, such as object types, REF values, varrays, and nested tables. Extra administrative effort may be required to set up and maintain the Oracle GoldenGate environment. Downtime required for a platform migration when using Oracle GoldenGate is determined by the time needed to apply the remaining transactions in the queue and to reconnect clients to the new database.
See:
Downtime required for a platform migration when using Data Pump is determined by the time needed to perform a full database export, transfer the export dump files to the target system, then perform a full database import. Downtime may be reduced by performing the export to storage that is shared between the source and target systems, this eliminating the need to transfer the export dump files. Data Pump supports the ability to load the target database directly from the source database over database links, known as network import. In some cases a network import may be faster than the multi-step approach of export database, transfer dump files, and import database.
Use Data Pump when moving a database to a platform with different endian format when the network import time is acceptable.
See Also:
Oracle Database Utilities for more information about Oracle Data Pump Oracle Database Upgrade Guide for more information about upgrading your Oracle Database software
The SYSTEM tablespace cannot be moved with transportable tablespaces. The target database SYSTEM tablespace contents, including user definitions and objects necessary for the clients, must be built manually. Use Data Pump to move the necessary contents of the SYSTEM tablespace. Downtime required for a platform migration or database upgrade when using transportable tablespaces is determined by the time needed to:
Place the source database tablespaces in read-only mode. Perform a network import of the transportable metadata. Transfer all data files from the source system to the target system. This time can be reduced significantly by using a storage infrastructure that can make the data files available to the target system without the physically moving the files.
Convert all data files to the new platform format using RMAN.
Use transportable tablespaces to migrate to a platform when Oracle Data Pump cannot complete within the maintenance window, and Oracle GoldenGate or Data Guard SQL Apply cannot be used due to data type restrictions.
See Also: Oracle Database Administrator's Guide for more information about transportable tablespaces
14.2.7.9 Data Guard Redo Apply (Physical Standby Database) for Location Migration
You can use Data Guard Redo Apply to change the location of a database to a remote site with minimal downtime by setting up a temporary standby database at a remote location and performing a switchover operation. The downtime required for a location migration when using Data Guard Redo Apply is determined by the time required to perform a switchover operation.
See Also: Oracle Data Guard Concepts and Administration for more information on Redo Apply and physical standby databases
Oracle Database Advanced Application Developer's Guide for information on Edition-Based Redefinition Oracle Database Administrator's Guide for information on Managing Editions
Minimize concurrent activity on the table during an online operation. During an online operation, Oracle recommends users minimize activities on the base table. Database activities should affect less than 10% of the table while an online operation is in progress. Also the database administrator can use the Database Resource Manager to minimize the affect of the data reorganization to users by allocating enough resources to the users.
Oracle does not recommend running online operations at peak times or running a batch job that modifies a large amount of data during an online data reorganization. Rebuild indexes online versus dropping an index and then re-creating an index online. Rebuilding an index online requires additional disk space for the new index during the operation, whereas dropping an index and then re-creating an index does not require additional disk space.
Coalesce an index online versus rebuilding an index online. Online index coalesce is an in-place data reorganization operation, hence does not require additional disk space like index rebuild does. Index rebuild requires temporary disk space equal to the size of the index plus sort space during the operation. Index coalesce does not reduce the height of the B-tree. It only tries to reduce the number of leaf blocks. The coalesce operation does not free up space for users but does improve index scan performance. If a user must move an index to a new tablespace, use online index rebuild.
Perform online maintenance of local and global indexes. Oracle Database 11g supports both local and global partitioned indexes with online operations. When tables and indexes are partitioned, this allows administrators to perform maintenance on these objects, one partition at a time, while the other partitions remain online.
See Also:
Oracle Database Administrator's Guide for more information about redefining tables online The Online Reorganization link from the Oracle Database High Availability page at http://www.oracle.com/technetwork/database/featur es/availability/index.html
Oracle Real Application Clusters Administration and Deployment Guide for information about Automatic Workload Management Oracle Real Application Clusters Administration and Deployment Guide for information about using the crsctl command The MAA white paper: "Optimizing Availability During Planned Maintenance Using Oracle Clusterware and Oracle RAC" at http://www.oracle.com/goto/maa
Glossary
Oracle Active Data Guard option A physical standby database can be open for read-only access while Redo Apply is active if a license for the Oracle Active Data Guard option has been purchased. This capability, known as real-time query, also provides the ability to have block-change tracking on the standby database, thus allowing incremental backups to be performed on the standby. clusterwide failure The whole cluster hosting the Oracle RAC database is unavailable or fails. This includes failures of nodes in the cluster, and any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable. computer failure An outage that occurs when the system running the database becomes unavailable because it has crashed or is no longer accessible. data corruption A corrupt block is a block that has been changed so that it differs from what Oracle Database expects to find. Block corruptions fall under two categories: physical and logical block corruptions. hang or slow down Hang or slow down occurs when the database or the application cannot process transactions because of a resource or lock contention. Perceived hang can be caused by lack of system resources. human error An outage that occurs when unintentional or malicious actions are committed that cause data in the database to become logically corrupt or unusable. The service level impact of a human error outage can vary significantly depending on the amount and critical nature of the affected data. logical unit numbers (LUNs) Three-bit identifiers used on a SCSI bus to distinguish between up to eight devices (logical units) with the same SCSI ID.
Glossary-1
lost write
lost write A lost write is another form of data corruption that can occur when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage. No error is reported by the I/O subsystem back to Oracle. MAA environment The Maximum Availability architecture provides the most comprehensive set of solutions for both unplanned and because it inherits the capabilities and advantages of both Oracle Database 11g with Oracle RAC and Oracle Database 11g with Data Guard. MAA involves high availability best practices for all Oracle products across the entire technology stackOracle Database, Oracle WebLogic Server, Oracle Applications, Oracle Collaboration Suite, and Enterprise Manager. network server processes The Data Guard network server processes, also referred to as LNSn processes, on the primary database perform a network send to the RFS process on the standby database. There is one network server process for each destination. real-time query If a license for the Oracle Active Data Guard option has been purchased, you can open a physical standby database while Redo Apply continues to apply redo data received from the primary database. recovery point objective (RPO) The maximum amount of data an IT-based business process may lose before causing harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, five hours or two days worth of data loss. recovery time objective (RTO) The maximum amount of time that an IT-based business process can be down before the organization suffers significant material losses. RTO indicates the downtime tolerance of a business process or an organization in general. site failure An outage that occurs when an event causes all or a significant portion of an application to stop processing or slow to an unusable service level. A site failure may affect all processing at a data center, or a subset of applications supported by a data center. snapshot standby database An updatable standby database that you create from a physical standby database. A snapshot standby database receives and archives redo data received from the primary database, but the snapshot standby database does not apply redo data from the primary database while the standby database is open for read/write I/O. Thus, the snapshot standby typically diverges from the primary database over time. Moreover, local updates to the snapshot standby database cause additional divergence. However, a snapshot standby protects the primary database because the snapshot standby can be converted back into a physical standby database. storage failure An outage that occurs when the storage holding some or all of the database contents becomes unavailable because it has shut down or is no longer accessible.
Glossary-2
transient logical standby database A transient logical standby database is a physical standby database that has been temporarily converted into a logical standby database to perform a rolling database upgrade.
Glossary-3
Glossary-4
Index
A
ACFS snapshot, 9-16 Active Data Guard option assessing database waits, 8-14 Active Session History Reports (ASH), 5-11 Advanced Queuing (AQ), 8-6 alerts Enterprise Manager, 12-3 ALTER DATABASE statement CONVERT TO SNAPSHOT STANDBY, 8-25 specifying a default temporary tablespace, 5-14 ALTER DISKGROUP ALL MOUNT statement, 4-13 ALTER SESSION ENABLE RESUMABLE statement, 5-14 ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE, 8-13 application failover DBMS_DG.INITIATE_FS_FAILOVER, 13-14 in an Oracle Data Guard configuration, 13-13 in an Oracle RAC configuration, 13-13 application workloads database performance requirements for, 4-1 applications defining as services, 6-6 failover, 13-13 Fast Application Notification (FAN), 6-3, 13-14 fast failover, 11-6 login storms, 11-6 monitor response times, 13-13 service brownouts, 12-10 tracking performance with Beacon, 12-4 upgrades, 14-30 Apply Lag metric in Enterprise Manager, 12-12 AQ_HA_NOTIFICATIONS parameter, 11-3 AQ_TM_PROCESSES parameter, 8-18 architecture high availability, 1-1 archival backups keeping, 9-6 ARCHIVELOG mode, 5-1 archiver (ARCn) processes reducing, 8-19 archiving strategy, 8-8 ASM See Oracle Automatic Storage Management (Oracle ASM), 4-2 ASM_DISKGROUPS initialization parameter, 4-13 ASM_DISKSTRING parameter, 4-10 ASM_POWER_LIMIT initialization parameter, 4-13 ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter in extended clusters, 7-5 ASMCA utility storage management, 4-15 ASMCMD command-line utility storage management, 4-14 ASMLib, 4-11 disk labels, 4-11 ASR See Oracle Auto Service Request (ASR) asynchronous disk I/O, 5-8 asynchronous I/O enabling, 7-2 V$IOSTAT_FILE view, 5-8 AUTOBACKUP statement RMAN, 9-8 Automatic Database Diagnostic Monitor (ADDM), 5-11 automatic performance tuning, 5-11 automatic segment space management, 5-13 using, 5-13 Automatic Shared Memory Management, 5-9 Automatic Storage Management (Oracle ASM) redundancy, 4-7 automatic tablespace point-in-time recovery TSPITR, 13-34 automatic undo management described, 5-12 Automatic Workload Repository (AWR), 5-4, 5-11, 8-26 best practices, 5-11 evaluating performance requirements, 4-1 AWR See Automatic Workload Repository (AWR)
B
backup and recovery best practices, 9-5
Index-1
checksums calculated during, 5-7 enabling with ARCHIVELOG mode, 5-1 backup files fast recovery area disk group failure, 13-20 backup undo optimization, 9-4 backups automatic, 9-8 comparing options, 9-9 configuring, 9-5 creating and synchronizing, 9-7 determine a retention policy, 9-5 keeping archival (long term), 9-6 OCR, 6-13 performing regularly, 9-15 RMAN recovery catalog, 9-7 Beacons, 12-4 configuring, 12-4 benefits Data Guard broker, 8-6 high availability best practices, 1-1 best practices AWR, 5-11 backup and recovery, 9-5 Data Guard configuration, 8-1 Database Resource Manager, 5-15 failover manual, 8-23 failover (fast-start), 8-22, 13-11 failover (manual), 13-11 fast connection failover configuration, 11-1 high availability, 1-1 Oracle ASM configuration, 4-9 Oracle ASM operational best practices, 4-12 Oracle ASM strategic, 4-3 Oracle Clusterware configuration, 6-3 Oracle Clusterware operations and management, 6-12 Oracle Database configuration, 5-1 Oracle Database operations and management, Oracle GoldenGate configuration, 10-4 Oracle RAC configuration, 7-1 Oracle RAC rolling upgrades, 14-11 storage subsystems, 4-1 switchover, 14-7 upgrades, 14-11 BLOCK CHANGE TRACKING clause, 9-7 brownouts, 12-10
5-9
C
capacity planning, 6-12 change tracking for incremental backups, 9-7 checkpointing bind Mean Time To Recover (MTTR), 7-1 client connections migrating to and from nodes, 6-2 clients application failover, 13-13 configuring for failover, 11-2
load balancing, 6-8 cluster file system using shared during software patching, 6-4 Cluster Health Monitor (CHM), 12-20 Cluster Ready Services (CRS) described, 13-38 moving services, 13-12 recovering service availability, 13-38 relationship to OCR, 13-13 Cluster Time Synchronization Service (CTSS), 6-10 clustered ASM enabling the storage grid, 4-2 clusters extended, 7-3 clusterwide outage restoring the standby database after, 13-46 COMMIT NOWAIT Oracle GoldenGate, 10-5 COMPATIBLE initialization parameter, 14-9 complete site failover recovery time objective (RTO), 13-6 compression redo transport, 8-12 configuring databases for high availability with the MAA Advisor, 12-19 configuring Oracle Database for shared server, 11-6 connection pools adjusting number of, 11-6 Connection Rate Limiter listener, 11-6 connect-time failover, 13-39 control files in a fast recovery area disk group failure, 13-20 CONTROL_FILES initialization parameter, 13-21 coordinated, time-based, distributed database recovery, 13-35 corruptions checking database files, 9-14 preventing with Data Recovery Advisor, 5-10 CREATE DISKGROUP statement examples, 4-3, 4-6 CRS See Cluster Ready Services (CRS) crsctl command for system maintenance, 14-32 CRSD process OCR backups, 6-13 CTSS time management, 6-10 cumulative incremental backup set, 9-9
D
Dark Fiber Dense Wavelength Division Multiplexing (DWDM), 7-2 data criticality and RPO, 9-5 recovering backups and RTO, 9-5 data area contents, 4-3
Index-2
disk partitioning, 4-5 data area disk group failure recovery options, 13-19 data corruption detecting, 5-7 protection through Oracle ASM redundancy disk groups, 5-7 solution, 5-6 data failure restoring fault tolerance on standby database, 13-48 Data Guard archiving strategies, 8-8 broker, 8-6 using FAN/AQ, 8-6 database upgrades, 14-18 failover best practices (fast-start), 8-22 best practices (manual), 8-23 recovery for data area disk group failures, 13-19 when to perform, 13-10 log apply services, 8-12 managing targets, 12-16 monitoring, 12-12 multiple standby databases, 8-16 performance, 8-26 platform migration, 14-26 protection against data corruption, 5-6 redo transport services, 8-5 restoring standby databases, 13-43 role transitions, 8-24 snapshot standby databases, 8-25 SQL Apply, 14-18 standby-first patch apply, 14-9 status events in Enterprise Manager, 12-12 switchover, 14-7 Data Pump for platform migration, 14-28 moving the contents of the SYSTEM tablespace, 14-28 Data Recovery Advisor detect and prevent data corruption, 5-10 data retaining backups, 9-5 data type restrictions resolving with Extended Datatype Support (EDS), 14-18 data-area disk group failure See Also Data Guard failover, fast-start failover, local recovery database files management optimizations, 4-2 Oracle ASM integration, 4-2 recovery-related, 4-4 database patch upgrades recommendations, 14-11 Database Resource Manager, 5-14 best practices, 5-15 Database Upgrade Assistant (DBUA), 14-17 database upgrades
with edition-based redefinition, 14-29 with transient logical standby database, 14-19 databases checking files for corruption, 9-14 configuration recommendations, 5-1 configuring with the MAA Advisor, 12-19 evaluating performance requirements, 4-1 migration, 14-27 object reorganization, 14-30 recovery in a distributed environment, 13-35 resolving inconsistencies, 13-33 switching primary and standby roles among, 14-7 upgrades, 14-16 DB_BLOCK_CHECKING initialization parameter, 5-7, 8-13 DB_BLOCK_CHECKSUM initialization parameter, 4-15, 5-7, 8-13 DB_CACHE_SIZE initialization parameter, 8-14 DB_CREATE_FILE_DEST initialization parameter, 8-10 enabling Oracle Managed Files, 4-4 DB_CREATE_ONLINE_LOG_DEST_n initialization parameter location of Oracle managed files, 4-4 DB_FLASHBACK_RETENTION_TARGET parameter, 5-4 DB_KEEP_CACHE_SIZE initialization parameter, 8-14 DB_LOST_WRITE_PROTECT initialization parameter, 5-7, 8-14 DB_RECOVERY_FILE_DEST initialization parameter, 8-10 fast recovery area, 5-3 DB_RECOVERY_FILE_DEST_SIZE initialization parameter limit for fast recovery area, 5-3 DB_RECYCLE_CACHE_SIZE initialization parameter, 8-14 DBCA balancing client connections, 6-8 DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure application failover, 13-14 DBMS_FLASHBACK.TRANSACTION_BACKOUT PL/SQL procedure, 13-30 DBMS_REDEFINITION PL/SQL package, 14-30 DBVERIFY utility, 8-13 decision support systems (DSS) application workload, 4-1 default temporary tablespace specifying, 5-14 DEFAULT TEMPORARY TABLESPACE clause CREATE DATABASE statement, 5-14 DEFAULT_SDU_SIZE sqlnet.ora parameter, 8-12 Dense Wavelength Division Multiplexing (DWDM or Dark Fiber), 7-2 Device Mapper disk multipathing, 4-10 differential incremental backup set, 9-9
Index-3
DISABLE BLOCK CHANGE TRACKING, 9-7 disabling parallel recovery, 5-9 disk backup methods, 9-9 disk devices ASMLib disk name defaults, 4-11 configuration, 4-3, 4-6, 4-8 disk labels, 4-11 multipathing, 4-10 naming ASM_DISKSTRING parameter, 4-10 ASMLib, 4-11 partitioning for Oracle ASM, 4-5 protecting from failures, 4-6 disk errors mining vendor logs, 4-14 disk failures protection from, 4-6 restoring redundancy after, 4-8 disk groups checking with V$ASM_DISK_IOSTAT view, 4-14 configuration, 4-3 determining size of, 4-8 failure of fast recovery area, 13-20 imbalanced, 4-14 mounting, 4-13 offline after failures, 13-20 SYSASM access to Oracle ASM instances, 4-12 disk multipathing, 4-10 DISK_ASYNCH_IO initialization parameter, 5-8, 8-15 DISK_REPAIR_TIME parameter, 4-11 disks Oracle ASM failures, 13-15, 13-16 distributed databases recovering, 13-35 DNS failover, 13-8 dropped tablespace fix using Flashback Database, 13-34 dropping database objects, 13-29 dual failures restoring, 13-50 DWDM Dense Wavelength Division Multiplexing., 7-2
metrics, 12-3 notification rules, 12-4, 12-11 performance, 12-9 Policy Violations, 12-12 policy violations, 12-12 Support Workbench, 12-6 Enterprise Manager monitoring, 12-1 equations standby redo log files, 8-9 Estimated Failover Time event in Enterprise Manager, 12-12 Exadata Database Machine and MAA best practices, 2-7 HARD, 4-15 extended clusters overview, 7-3 setting the ASM_PREFERRED_READ_FAILURE_ GROUPS parameter, 7-5 extents Oracle ASM mirrored, 5-7 external redundancy Oracle ASM disk failures, 13-15 Oracle ASM server-based mirroring, 7-5 EXTERNAL REDUNDANCY clause on the CREATE DISKGROUP statement, 4-6 Extraction, Transformation, and Loading (ETL) application workload, 4-1
F
failovers application, 13-13 comparing manual and fast-start failover, 8-20 complete site, 13-6 defined, 13-9 described, 13-10 effect on network routes, 13-6 Fast Application Notification (FAN), 8-6 Fast Connection Failover, 11-6 manual best practices, 8-23 when to perform, 8-21, 13-10 nondisruptive, 4-10 restoring standby databases after, 13-43 failure detection CRS response, 13-12 failure groups ASM redundancy, 4-8 defining, 4-7 multiple disk failures, 13-20 specifying in an extended cluster, 7-5 failures rebalancing Oracle ASM disks, 13-16 space allocation, 5-14 Fast Application Notification (FAN), 6-3, 13-13 after failovers, 8-6 Fast Connection Failover, 11-6 fast local restart after fast recovery area disk group failure, 13-21 fast recovery area
E
edition-based redefinition, 10-3, 14-29 ENABLE BLOCK CHANGE TRACKING, 9-7 endian format determining, 14-26 Enterprise Manager alerts, 12-3 Beacon application failover, 13-13 Database Targets page, 12-9 High Availability Console (HA Console), 12-16 home page, 12-2 MAA Advisor, 12-19 managing Data Guard targets, 12-16 managing patches, 12-15
Index-4
backups, 9-13 contents, 4-4 disk group failures, 13-20 disk partitioning, 4-5 local recovery steps, 13-22 using, 5-3 FAST_START_MTTR_TARGET initialization parameter, 5-9, 7-2 controlling instance recovery time, 5-5 setting in a single-instance environment, 7-2 FAST_START_PARALLEL_ROLLBACK initialization parameter determining how many processes are used for transaction recovery, 7-2 fast-start failover comparing to manual failover, 8-20 fast-start fault recovery instance recovery, 5-5 FastStartFailoverAutoReinstate configuration property, 13-44 fault tolerance configuring storage subsystems, 4-1 restoring, 13-36 to 13-50 restoring after OPEN RESETLOGS, 13-48 flash recovery area See fast recovery area, 12-16 Flashback Database, 13-28, 13-33 enabling, 5-3 in Data Guard configurations, 8-7 setting maximum memory, 5-9 Flashback Drop, 13-28, 13-30 flashback logs fast recovery area disk group failure, 13-20 Flashback Query, 13-28, 13-30 Flashback Table, 13-28, 13-29 flashback technology example, 13-31 recovering from user error, 13-27 resolving database-wide inconsistencies, 13-33 resolving tablespace inconsistencies, 13-34 solutions, 13-27 Flashback Transaction, 13-28 DBMS_FLASHBACK.TRANSACTION_ BACKOUT PL/SQL procedure, 13-30 Flashback Transaction Query, 13-28, 13-31 Flashback Version Query, 13-28, 13-31 FORCE LOGGING mode, 5-1, 8-7 full data file copy, 9-9 full or level 0 backup set, 9-9
H
HA (Oracle High Availability technologies), 1-2 HARD Hardware Assisted Resilient Data, 4-15 Hardware Assisted Resilient Data (HARD) when using Oracle ASM, 4-9 hardware RAID storage subsystem deferring mirroring to, 7-5 High Availability (HA) Console monitoring databases, 12-16 described, 1-1 restoring after fast-start failover, 13-44 high redundancy Automatic Storage Management (Oracle ASM) disk groups, 4-7 Oracle ASM disk failures, 13-16 Oracle ASM disk groups, 4-3 host bus adapters (HBA) load balancing across, 4-10 HR service scenarios, 13-38 human errors recovery, 13-27
I
imbalanced disk groups checking, 4-14 incremental backups BLOCK CHANGE TRACKING, 9-7 incrementally updated backup, 9-10 initialization parameters primary and physical standby example, 8-9 instance failures recovery, 5-5 single, 13-12 instance recovery controlling with fast-start fault recovery, 5-5 interconnect subnet verifying, 6-11 interim patches, 14-8 I/O operations load balancing, 4-10 tuning, 8-15
K
KEEP IDENTITY clause, 14-19 KEEP option RMAN BACKUP command, 9-6
G
gap resolution compression, 8-12 GoldenGate (Oracle GoldenGate), 10-1 Grid Control (Oracle Grid Control) monitoring, 12-1 guaranteed restore points, 9-5 GV$SYSSTAT view gathering workload statistics, 4-1
L
library ASMLib support for Oracle ASM, 4-11 listener connection rate throttling, 11-6 listeners balancing clients across, 6-8 Connection Rate Limiter, 11-6 load balancing
Index-5
application services, 13-39 client connections, 6-8 I/O operations, 4-10 through disk multipathing, 4-10 LOAD_BALANCE parameter, 6-8, 11-3 balancing client connections, 6-8 local homes use during rolling patches, 6-4 local recovery after fast recovery area disk group failure, 13-21 for data area disk group failures, 13-19 for fast recovery area disk group failures, 13-22 locally managed tablespaces, 5-13 described, 5-13 log apply services best practices, 8-12 LOG_ARCHIVE_DEST_n initialization parameter, 8-17 LOG_ARCHIVE_FORMAT initialization parameter, 8-9 LOG_ARCHIVE_MAX_PROCESSES initialization parameter, 8-11 setting in a multiple standby environment, 8-11 setting in an Oracle RAC, 8-11 LOG_BUFFER initialization parameter, 5-4, 5-9 LOG_FILE_NAME_CONVERT initialization parameter, 8-19, 8-22 logical standby databases failover, 13-11 switchover, 14-8 upgrades on, 14-18 logical unit numbers (LUNs), 4-6 defined, Glossary-1 login storms controlling with shared server, 11-6 preventing, 11-6 low bandwidth networks compression on, 8-12 low-cost storage subsystems, 4-1 LUNs See Also logical unit numbers (LUNs) See logical unit numbers (LUNs), 4-6
adjusting in the mid tier connection pool, 11-6 maximum performance mode redo transport requirements, 8-5 maximum protection mode initialization parameter example, 8-9 Mean Time To Recover (MTTR) checkpointing, 7-1 reducing with Data Recovery Advisor, 5-10 memory consumption managing with MEMORY_TARGET parameter, 4-10 memory management, 5-9 MEMORY_MAX_TARGET parameter, 4-10 MEMORY_TARGET initialization parameter, 4-10 metrics Enterprise Manager, 12-3 for Data Guard in Enterprise Manager, 12-12 mid tier connection pool adjusting maximum number of connections, 11-6 migrating planning for, 14-23 transportable database, 14-27 migration planning, 14-25 migration strategy scheduled outages, 14-24 minimizing space usage, 9-10 minimizing system resource consumption, 9-10 mining vendor logs for disk errors, 4-14 mirrored extents protection from data corruptions, 5-7 mirroring across storage arrays, 4-7 deferring to RAID storage subsystem, 7-5 monitoring application response time, 13-13 Enterprise Manager, 12-1 Oracle Grid Control, 12-1 rebalance operations, 14-15 mounting disk groups, 4-13 multipathing (disks) path abstraction, 4-10 multiple disk failures, 13-20
M
MAA See Oracle Maximum Availability Architecture (MAA) manageability improving, 5-9 to 5-15 managing scheduled outages, 14-1, 14-5 manual failover best practices, 8-23, 13-11 comparing to fast-start failover, 8-20 when to perform, 8-21, 13-10 Maximum Availability Architecture (MAA) Advisor page, 12-19 maximum availability mode redo transport requirements, 8-5 maximum number of connections
N
net services parameter DEFAULT_SDU_SIZE, 8-12 RECV_BUF_SIZE, 8-11 SEND_BUF_SIZE, 8-11 TCP.NODELAY, 8-12 Network Attached Storage (NAS), 8-15 network detection and failover Oracle Clusterware and Oracle RAC, 6-11 network routes after site failover, 13-7 before site failover, 13-6 network server processes (LNSn), Glossary-2 Network Time Protocol (NTP), 6-10 NOCATALOG Mode creating backups, 9-7
Index-6
node failures multiple, 13-12 nodes migrating client connections, 6-2 nondisruptive failovers, 4-10 normal redundancy Oracle ASM disk failures, 13-16 NORMAL REDUNDANCY clause on the CREATE DISKGROUP statement, 4-7 notification rules recommended, 12-11 service-level requirement influence on monitoring, 12-4 notifications application failover, 13-13 NTP, 6-10
O
OCI_EVENTS parameter, 11-3 OCR backups of, 6-13 failure of, 13-13 recovering, 13-13 ocrconfig -showbackup command, 6-13 OMF See Oracle Managed Files online patching, 14-8 online redo log files multiplex, 5-2 Online Reorganization and Redefinition, 14-30 Online Transaction Processing (OLTP) application workload, 4-1 opatch command-line utility, 14-10 optimizing recovering times, 9-10 Oracle ACFS snapshot, 9-16 Oracle ASM See Oracle Automatic Storage Management (Oracle ASM), 4-2 Oracle Auto Service Request (ASR), 2-7 Oracle Automatic Storage Management (Oracle ASM) ASM_DISKSTRING parameter, 4-10 ASMLib, 4-11 clustering to enable the storage grid, 4-2 configuring with ASMCA, 4-15 database file management, 4-2 disk device allocation, 4-5 disk failures, 13-15, 13-16 disk group size, 4-8 failure groups, 7-5 failure groups and redundancy, 4-8 handling disk errors, 4-14 HARD-compliant storage, 4-9 imbalanced disk groups, 4-14 managing memory with MEMORY_TARGET parameter, 4-10 managing with ASMCMD, 4-14 migrating databases to and from, 14-14 multiple disk failures, 13-20
Oracle Restart, 4-2 power limit for faster rebalancing, 4-16 REBALANCE POWER, 4-13 rebalancing, 4-13 rebalancing disks after a failure, 13-16 recovery, 13-14 redundancy, 4-7, 5-7 rolling upgrade, 14-15 server-based mirroring, 7-5 SYSASM role, 4-12 using disk labels, 4-11 using normal or high redundancy, 4-7, 7-5 volume manager, 7-5 with disk multipathing software, 4-10 Oracle Cluster Registry (OCR) failure of, 13-13 Oracle Clusterware capacity planning, 6-12 CTSS time management, 6-10 system maintenance, 14-32 verifying the interconnect subnet, 6-11 Oracle Data Guard See Data Guard, 8-1 Oracle Data Pump for platform migration, 14-28 platform migrations, 14-28 Oracle Database 11g configuration recommendations, 5-1 Data Guard, 8-1 extended cluster configurations, 7-3 Oracle RAC configuration recommendations, 7-1 Oracle Enterprise Manager High Availability (HA) Console, 12-16 MAA Advisor page, 12-19 Oracle Flashback Database restoring fault tolerance to configuration, 13-44 Oracle GoldenGate and Oracle RAC, 10-2 best practices, 10-1 configuring, 10-1 database migration, 14-27 for database upgrades, 14-22 overview, 10-1 replicat commit, 10-5 replicat COMMIT NOWAIT, 10-5 upgrades using, 14-22 with Oracle Data Guard, 10-2 Oracle Grid Control home page, 12-2 monitoring, 12-1 Oracle High Availability technologies, 1-2 Oracle Managed Files (OMF) database file management, 4-4 disk and disk group configuration, 4-4 fast recovery area, 5-3 Oracle Management Agent, 12-2 monitoring targets, 12-2 Oracle Maximum Availability Architecture (MAA) defined, Glossary-2 described, 1-2
Index-7
Web site, 1-3 Oracle Notification Service (ONS) after failovers, 8-6 Oracle RAC rolling patch upgrades, 14-10 Oracle Real Application Clusters (Oracle RAC) adding disks to nodes, 4-11 application failover, 13-13 configuration, 7-1 extended clusters, 7-3 network detection and failover, 6-11 preparing for switchovers, 8-19 recovery from unscheduled outages, 13-11 restoring failed nodes or instances, 13-37 rolling upgrade, 14-10 rolling upgrades, 14-10 setting LOG_ARCHIVE_MAX_PROCESSES initialization parameter, 8-11 system maintenance, 14-32 using redundant dedicated connections, 7-2 verifying the interconnect subnet, 6-11 voting disk, 6-10, 7-4, 13-13 Oracle Restart, 4-2, 5-10 Oracle Secure Backup, 9-3 OCR backups, 6-13 Oracle Storage Grid, 4-15 Oracle Streams and Oracle GoldenGate, 10-2 Oracle Sun SFS Storage Appliance, 9-16 Oracle Universal Installer, 14-11 outages unscheduled, 13-1
platform migrations, 14-16 endian format for, 14-26 with physical standby database, 14-26 point-in-time recovery TSPITR, 13-34 pool resizing, 5-9 power limit setting for rebalancing, 4-16 preferred read failure groups specifying Oracle ASM, 7-5 preventing login storms, 11-6 primary database reinstating after a fast-start failover, 13-44 restoring fault tolerance, 13-48 PROCESSES initialization parameter, 4-11
Q
quorum disk voting disk, 7-4
R
RAID protection, 4-6 real-time apply configuring for switchover, 8-19 real-time query Active Data Guard option, 8-24 rebalance operations, 4-13 monitoring, 14-15 Oracle ASM disk partitions, 4-5, 4-6 REBALANCE POWER limits, 4-13 rebalancing, 4-13 Oracle ASM disk groups, 4-13, 4-14 Oracle ASM disks after failure, 13-16 setting Oracle ASM power limit, 4-16 recommendations database configuration, 5-1 recovery coordinated, time-based, distributed database recovery, 13-35 options for fast recovery area, 13-20 steps for unscheduled outages, 13-1 testing procedures, 9-15 times optimizing, 9-10 recovery catalog including in regular backups, 9-15 RMAN repository, 9-7 recovery files created in the recovery area location, 5-3 Recovery Manager See Also RMAN recovery point objective (RPO) criticality of data, 9-5 defined, Glossary-2 for data area disk group failures, 13-19 solutions for disk group failures, 13-20 recovery time objective (RTO)
P
parallel recovery disabling, 5-9 PARALLEL_EXECUTION_MESSAGE_SIZE parameter, 5-4 partitions allocating disks for Oracle ASM use, 4-5 patch sets rolling upgrades, 14-10 patches managing with Enterprise Manager, 12-15 rolling, 6-4 using shared cluster file system, 6-4 path failures protection from, 4-10 performance application, tracking with Beacon, 12-4 asynchronous disk I/O, 5-8 automatic tuning, 5-11 Data Guard, 8-26 database, gathering requirements, 4-1 physical standby databases as snapshot standby databases, 8-25 failover, 13-11 location migrations, 14-29 real-time query, 8-24 switchover, 14-7
Index-8
defined, Glossary-2 described, 13-6 for data-area disk group failures, 13-19 recovery time, 9-5 solutions for disk group failures, 13-20 RECOVERY_ESTIMATED_IOS initialization parameter for parallel recovery, 5-9 RECOVERY_PARALLELISM initialization parameter, 5-9 RECV_BUF_SIZE sqlnet.ora parameter, 8-11 recycle bin, 13-30 Redo Apply real-time query, 8-24 Redo Apply Rate event in Enterprise Manager, 12-12 redo data compressing, 8-12 redo log members fast recovery area disk group failure, 13-20 redo transport services best practices, 8-5 redundancy Automatic Storage Management (Oracle ASM), 4-7 CREATE DISKGROUP DATA statement, 4-6 dedicated connections, 7-2 disk devices, 4-6 restoring after disk failures, 4-8 reinstatement, 13-44 FastStartFailoverAutoReinstate property, 13-44 remote archiving, 8-9 REMOTE_LISTENER parameter, 6-7, 6-8 resetlogs on primary database restoring standby database, 13-48 resource consumption minimizing, 9-10 resource management using Database Resource Manager, 5-14 response times detecting slowdown, 13-13 restore points, 9-5 restoring client connections, 13-39 failed instances, 13-37 failed nodes, 13-37 services, 13-38 resumable space allocation, 5-14 RESUMABLE_TIMEOUT initialization parameter, 5-14 RESYNC CATALOG command resynchronize backup information, 9-7 RETENTION GUARANTEE clause, 5-12, 5-13, 9-9 retention policy for backups, 9-5 RMAN backup undo optimization, 9-4 BACKUP VALIDATE command, 8-13 calculates checksums, 5-7 creating standby databases, 8-6 database backups, 9-3
DUPLICATE command, 9-15 DUPLICATE TARGET DATABASE FOR STANDBY command, 8-7 FROM ACTIVE DATABASE command, 8-7 recovery catalog, 9-7 TSPITR, 13-34 unused block compression, 9-4 VALIDATE command, 9-14 RMAN BACKUP command KEEP option, 9-6 role transitions best practices, 8-24 role-based destinations, 8-9 rolling patches, 6-4 rolling upgrade Oracle RAC, 14-10 rolling upgrades patch set, 14-10 row and transaction inconsistencies, 13-30 RPO See recovery point objective (RPO) RTO See recovery time objective (RPO)
S
SALES scenarios setting initialization parameters, 8-9 SAME See stripe and mirror everything (SAME) scenarios fast-start failover, 13-45 HR service, 13-38 object reorganization, 14-30 Oracle ASM disk failure and repair, 13-16 recovering from human error, 13-30 SALES, 8-9 scheduled outages Data Guard standby-first patch apply, 14-9 described, 14-1 edition-based redefinition, 14-29 migration, 14-23 migration planning, 14-25 migration strategy, 14-24 online patching, 14-8 Oracle ASM rolling upgrade, 14-15 Oracle Real Application Clusters (Oracle RAC) rolling patch upgrades, 14-10 platform migration, 14-26 primary site, 14-1 recommended solutions, 14-1, 14-5 reducing downtime for, 14-6 to 14-32 secondary site, 14-5 switchback, 14-7 switchover, 14-7 transportable tablespaces upgrades, 14-22 upgrades with Oracle GoldenGate, 14-22 See Also unscheduled outages secondary site outage
Index-9
restoring the standby database after, 13-46 SEND_BUF_SIZE sqlnet.ora parameter, 8-11 server parameter file (SPFILE), 5-12, 8-9 backup with RMAN, 9-8 server-based mirroring Oracle ASM, 7-5 service availability recovering, 13-38 service level agreements (SLA), 1-2 effect on monitoring and notification, 12-4 service tests and Beacons configuring, 12-4 SERVICE_TIME service-level goal, 6-9 services and FAN, 6-2 automatic relocation, 13-12 definition of, 6-2 making highly available, 6-4 Oracle RAC application failover, 13-13 Oracle RAC application workloads, 6-6 relocation after application failover, 13-14 tools for administration, 6-6 SGA_TARGET initialization parameter, 5-9 shared server configuring Oracle Database, 11-6 site failover network routes, 13-7 SMON process in a surviving instance, 7-1 sort operations improving, 5-14 space management, 5-13 space usage minimizing, 9-10 SQL Access Advisor, 5-11 SQL Apply, 14-18 SQL Tuning Advisor, 5-11 SRVCTL Oracle Restart, 4-2 standby databases configuring multiple, 8-16 creating, 8-6 restoring, 13-43 standby redo log files determining number of, 8-9 standby-first patch apply (Data Guard), 14-9 STATISTICS_LEVEL initialization parameter, 8-27 Statspack assessing database waits, 8-14 storage mirroring to RAID, 7-5 storage appliance, 9-16 Storage Area Network (SAN), 8-15 storage arrays mirroring across, 4-7 multiple disk failures in, 13-20 storage grid through clustered Oracle ASM, 4-2 storage subsystems, 4-1 to 4-14
configuring Oracle ASM, 4-2 configuring redundancy, 4-6 performance requirements, 4-1 stripe and mirror everything (SAME), 4-2 Support Workbench, 12-6 switchovers configuring real-time apply, 8-19 described, 14-7 in Oracle RAC, 8-19 reducing archiver (ARCn) processes, 8-19 See Also Data Guard setting the LOG_FILE_NAME_CONVERT initialization parameter, 8-19 to a logical standby database, 14-8 to a physical standby database, 14-7 SYSASM role Oracle ASM Authentication, 4-12 system failure recovery, 5-5 system maintenance, 14-32 system resources assessing, 8-16 SYSTEM tablespace moving the contents of, 14-28
T
table inconsistencies, 13-29 tablespace point-in-time recovery (TSPITR), 13-34 tablespaces locally managed, 5-13 resolving inconsistencies, 13-34 temporary, 5-14 targets in Enterprise Manager, 12-1 monitoring, 12-2 TCP Nagle algorithm disabling, 8-12 TCP.NODELAY sqlnet.ora parameter, 8-12 temporary tablespaces, 5-14 THROUGHPUT service-level goal, 6-9 transaction recovery determining how many processes are used, 7-2 transient logical standby database rolling upgrade, 14-19 Transport Lag event in Enterprise Manager, 12-12 transportable database, 14-27 transportable tablespaces database upgrades, 14-22 platform migration, 14-28
U
undo retention tuning, 5-13 undo space managing, 5-12 UNDO_MANAGEMENT initialization
Index-10
parameter, 9-8 automatic undo management, 5-12 UNDO_RETENTION initialization parameter, 9-8 automatic undo management, 5-12 UNDO_TABLESPACE initialization parameter automatic undo management, 5-12 unscheduled outages Data Guard switchover, 14-7 described, 13-1 to 13-4 Oracle RAC recovery, 13-11 recovery from, 13-1, 13-5 to 13-35 types, 13-1 See Also scheduled outages unused block compression, 9-4 upgrades application, 14-30 applying interim patches, 14-8 best practices, 14-11 Database Upgrade Assistant (DBUA), 14-17 methods, 14-16 online patching, 14-8 Oracle RAC rolling best practices, 14-11 USABLE_FILE_MB column on the V$ASM_DISKGROUP view, 4-8 user error flashback technology, 13-27
volume manager Oracle ASM, 7-5 voting disk (Oracle RAC) best practices, 6-10 corrupted, 13-13 quorum disk, 7-4
W
wait events assessing with Active Data Guard and Statspack, 8-14 Web sites ASMLib, 4-12 MAA, 1-3 workloads examples, 4-1 gathering statistics, 4-1
Z
ZFS storage appliance, 9-16
V
V$ASM_DISK view, 8-16 V$ASM_DISK_IOSTAT view checking disk group imbalance, 4-14 V$ASM_DISKGROUP view REQUIRED_MIRROR_FREE_MB column, USABLE_FILE_MB column, 4-8 V$ASM_OPERATION view monitoring rebalance operations, 14-15 V$EVENT_HISTOGRAM view, 8-14 V$INSTANCE_RECOVERY view tuning recovery processes, 5-9 V$IOSTAT_FILE view asynchronous I/O, 5-8 V$OSSTAT view, 8-16, 8-26 V$SESSION_WAIT view, 8-14 V$SYSMETRIC_HISTORY view, 8-26 V$SYSMETRIC_SUMMARY view, 8-26 V$SYSTEM_EVENT view, 8-14, 8-16 VALID_FOR attribute, 8-9 VALIDATE option on the RMAN BACKUP command, 8-13 validation checksums during RMAN backup, 5-7 verifying the interconnect subnet, 6-11 VIP address connecting to applications, 6-6 described, 6-6 during recovery, 13-38 Virtual Internet Protocol (VIP) Address See VIP address
4-8
Index-11
Index-12