CLSRSC-540: The root script failed to get a unique name of the OCR backup file on the ASM diskgroup

SYMPTOM

Rootupgrade.sh failed to upgrade GI from 12.1.0.2 to 12.2.0.1.

# /u01/app/12.2.0.1/grid/rootupgrade.sh
..
.
Relinking oracle with rac_on option
Using configuration parameter file: /u01/app/12.2.0.1/grid/crs/install/crsconfig_params
The log of current session can be found at:
 /u01/app/grid/crsdata/racnode1/crsconfig/rootcrs_racnode1_2018-02-05_12-10-50AM.log
2018/02/05 12:11:10 CLSRSC-595: Executing upgrade step 1 of 19: 'UpgradeTFA'.
2018/02/05 12:11:10 CLSRSC-4015: Performing install or upgrade action for Oracle Trace File Analyzer (TFA) Collector.
2018/02/05 12:12:44 CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.
2018/02/05 12:12:44 CLSRSC-595: Executing upgrade step 2 of 19: 'ValidateEnv'.
2018/02/05 12:13:00 CLSRSC-595: Executing upgrade step 3 of 19: 'GenSiteGUIDs'.
2018/02/05 12:13:03 CLSRSC-595: Executing upgrade step 4 of 19: 'GetOldConfig'.
2018/02/05 12:13:03 CLSRSC-464: Starting retrieval of the cluster configuration data
2018/02/05 12:13:16 CLSRSC-515: Starting OCR manual backup.
2018/02/05 12:13:38 CLSRSC-540: The root script failed to get a unique name of the OCR backup file on the ASM diskgroup using CLSTEST_backup12.1.0.2.0.ocr.
Died at /u01/app/12.2.0.1/grid/crs/install/oraocr.pm line 1353.
The command '/u01/app/12.2.0.1/grid/perl/bin/perl -I/u01/app/12.2.0.1/grid/perl/lib -I/u01/app/12.2.0.1/grid/crs/install /u01/app/12.2.0.1/grid/crs/install/rootcrs.pl -upgrade' execution failed

‘rootcrs_racnode1_2018-02-05_12-10-50AM’ under ‘/u01/app/grid/crsdata/racnode1/crsconfig’

2018-02-05 12:13:37: ORACLE_SID = +ASM1
2018-02-05 12:13:37: ORACLE_HOME = /u01/app/12.1.0.2/grid
2018-02-05 12:13:37: leftVersion=12.1.0.2.0; rightVersion=12.1.0.0.0
2018-02-05 12:13:37: [12.1.0.2.0] is higher than [12.1.0.0.0]
2018-02-05 12:13:37: Running as user grid: /u01/app/12.1.0.2/grid/bin/asmcmd --nocp find --type "OCRBACKUP" "*" "*CLSTEST_backup12.1.0.2.0.ocr*"
2018-02-05 12:13:37: s_run_as_user2: Running /bin/su grid -c ' echo CLSRSC_START; /u01/app/12.1.0.2/grid/bin/asmcmd --nocp find --type "OCRBACKUP" "*" "*CLSTEST_backup12.1.0.2.0.ocr*" '
2018-02-05 12:13:38: Removing file /tmp/rpd0KsVwL2
2018-02-05 12:13:38: Successfully removed file: /tmp/rpd0KsVwL2
2018-02-05 12:13:38: pipe exit code: 0
2018-02-05 12:13:38: /bin/su successfully executed

2018-02-05 12:13:38: Return code: 0 
Output: +OCR_VOTE1/CLSTEST/OCRBACKUP/clstest_backup12.1.0.2.0.ocr.276.967291997

2018-02-05 12:13:38: Reset the environment variable 'ORACLE_HOME' as: /u01/app/12.2.0.1/grid
2018-02-05 12:13:38: Failed to get the OMF name of the OCR backup file on DG.
2018-02-05 12:13:38: Executing cmd: /u01/app/12.2.0.1/grid/bin/clsecho -p has -f clsrsc -m 540 "CLSTEST_backup12.1.0.2.0.ocr"
2018-02-05 12:13:38: Command output:
> CLSRSC-540: The root script failed to get a unique name of the OCR backup file on the ASM diskgroup using CLSTEST_backup12.1.0.2.0.ocr. 
>End Command output
2018-02-05 12:13:38: CLSRSC-540: The root script failed to get a unique name of the OCR backup file on the ASM diskgroup using CLSTEST_backup12.1.0.2.0.ocr.
2018-02-05 12:13:38: ###### Begin DIE Stack Trace ######
2018-02-05 12:13:38: Package File Line Calling 
2018-02-05 12:13:38: --------------- -------------------- ---- ----------
2018-02-05 12:13:38: 1: main rootcrs.pl 287 crsutils::dietrap
2018-02-05 12:13:38: 2: oraClusterwareComp::oraocr oraocr.pm 1353 main::__ANON__
2018-02-05 12:13:38: 3: crsupgrade crsupgrade.pm 1370 oraClusterwareComp::oraocr::backupOcr4Restore
2018-02-05 12:13:38: 4: crsupgrade crsupgrade.pm 817 crsupgrade::get_oldconfig_info
2018-02-05 12:13:38: 5: crsupgrade crsupgrade.pm 485 crsupgrade::CRSUpgrade
2018-02-05 12:13:38: 6: main rootcrs.pl 296 crsupgrade::new
2018-02-05 12:13:38: ####### End DIE Stack Trace #######

CAUSE

It is a bug.  Apply patch 27440094: 12.1.0.2 TO 12.2.0.1 :UPGRADE IS FAILING WITH CLSRSC-540.

WORKAROUD

There are two workarounds for this issue.

a)Apply patch 27440094: 12.1.0.2 TO 12.2.0.1 :UPGRADE IS FAILING WITH CLSRSC-540.

or

b)Reboot all the cluster nodes, and rerun “rootupgrade.sh” on all RAC nodes again.

After either a) or b), then manually complete as per “How to Complete Grid Infrastructure Configuration Assistant(Plug-in) if OUI is not Available (Doc ID 1360798.1)”

ORA-16857: standby disconnected from redo source for longer than specified threshold

One single 11.2.0.4 instance Oracle database alert log shows below information;

 RFS[1]: No standby redo logfiles available for thread 1

Data Guard shows ORA-16857 error:

DGMGRL> show database "TESTSTY";

Database - TESTSTY

Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 10 minutes 32 seconds (computed 48 seconds ago)
 Apply Lag: 10 minutes 32 seconds (computed 48 seconds ago)
 Apply Rate: 39.97 MByte/s
 Real Time Query: OFF
 Instance(s):
 TESTSTY

Database Warning(s):
 ORA-16857: standby disconnected from redo source for longer than 
specified threshold

Database Status:
WARNING

Checked both primary and standby database, the  standby online redo logs have been created. But the size of the standby online redo logs are different from the database redo logs for both primary and secondary database.

— on standby :

SQL> select GROUP#,THREAD# ,BYTES/1024/1024 from v$standby_log;

GROUP#     THREAD#    BYTES/1024/1024
---------- ---------- ---------------
 4         1           50
 5         1           50
 6         1           50
 7         1           50

SQL> select GROUP#,THREAD#,BYTES/1024/1024 from v$log;

GROUP#     THREAD#    BYTES/1024/1024
---------- ---------- ---------------
 1         1          100
 3         1          100
 2         1          100

— On Primary

SQL> select GROUP#,THREAD# ,BYTES/1024/1024 from v$standby_log;

GROUP#     THREAD#    BYTES/1024/1024
---------- ---------- ---------------
 4         1          50
 5         1          50
 6         1          50
 7         1          50

SQL> select GROUP#,THREAD#,BYTES/1024/1024 from v$log;

GROUP#     THREAD#    BYTES/1024/1024
---------- ---------- ---------------
 1         1          100
 2         1          100
 3         1          100

Drop all standby online redo logs on both primary and standby databases, and recreate them again with same size as redo logfiles.

-- for standby db which is under recovery, recovery needs to be stopped first
SQL>alter database recover managed standby database cancel;

SQL>alter database add standby logfile thread 1 group 4 size 100m;

It will create standby online redo logs for one under +FRA, and another one under from “db_create_file_dest” parameter, if “db_create_online_log_dest_x” are not defined.

Finally restart the recovery process, then everything is fine.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE 
THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE;

Database altered.

alert.log:

RFS[1]: Selected log 4 for thread 1 sequence 41436 dbid 1345227970 branch 816878594
Mon Feb 19 14:38:25 2018
..
.
Recovery of Online Redo Log: Thread 1 Group 4 Seq 41436 Reading mem 0
..
.

DGMGRL of DataGuard:

DGMGRL> show database 'TESTSTY';

Database - TESTSTY

Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 0 seconds (computed 0 seconds ago)
 Apply Lag: 0 seconds (computed 0 seconds ago)
 Apply Rate: 208.00 KByte/s
 Real Time Query: OFF
 Instance(s):
 TESTSTY

Database Status:
SUCCESS

Warning: ORA-16857: standby disconnected from redo source for longer than specified threshold

One single 11.2.0.4 instance Oracle database alert log shows following information;

 RFS[1]: No standby redo logfiles available for thread 1

Data Guard shows ORA-16857 error:

DGMGRL> show database "TESTSTY";

Database - TESTSTY

Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 10 minutes 32 seconds (computed 48 seconds ago)
 Apply Lag: 10 minutes 32 seconds (computed 48 seconds ago)
 Apply Rate: 39.97 MByte/s
 Real Time Query: OFF
 Instance(s):
 TESTSTY

Database Warning(s):
 ORA-16857: standby disconnected from redo source for longer than 
specified threshold

Database Status:
WARNING

Checked both primary and standby database, the  standby online redo logs have been created. It is strange to see some of the thread id is different between primary and secondary database.

— on standby :

SQL> select GROUP#,THREAD#,BYTES/1024/1024 ,ARCHIVED,STATUS
       from v$standby_log; 2

GROUP#     THREAD#    BYTES/1024/1024 ARC STATUS
---------- ---------- --------------- --- ----------
 4         1          50              NO UNASSIGNED
 5         0          50              YES UNASSIGNED
 6         0          50              YES UNASSIGNED
 7         0          50              YES UNASSIGNED

— On Primary

SQL> select GROUP#,THREAD#,BYTES/1024/1024 ,ARCHIVED,STATUS 
       from v$standby_log;

GROUP#     THREAD#    BYTES/1024/1024 ARC STATUS
---------- ---------- --------------- --- ----------
 4         0          50              YES UNASSIGNED
 5         0          50              YES UNASSIGNED
 6         0          50              YES UNASSIGNED
 7         0          50              YES UNASSIGNED

Drop all standby online redo logs on both primary and standby databases, and recreate then again by specify “thread 1” explicitly.

-- for standby db which is under recovery, recovery needs to be stopped first
SQL>alter database recover managed standby database cancel;

SQL>alter database add standby logfile thread 1 group 4 size 50m;

It will create standby online redo logs for one under +FRA, and another one under from “db_create_file_dest” parameter, if “db_create_online_log_dest_x” are not defined.

Finally restart the recovery process, everything is fine.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE 
THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE;

Database altered.

“ORA-16644: apply instance not available” While DGMGRL Switchover

For some reason, DG broker does not recognise the applying instance of a standby database.

DGMGRL> switchover to STDBYDB;
Performing switchover NOW, please wait...
Operation requires a connection to instance "STDBYDB2" on database
 "STDBYDB"
Connecting to instance "STDBYDB2"...
Connected as SYSDBA.
Error: ORA-16644: apply instance not available

Failed.
Unable to switchover, primary database is still "PRIMDB"
DGMGRL>

DIAGNOSIS

a) Check and find one of the instance STDBYDB1 is the recovery instance. Stop the standby recovery process, and then start it again.  Data Guard still cannot find it.

b) Stop whole standby database, and restart it again. Check standby applying is undergoing. Data Guard still cannot find it.

RESOLUTION

DGMGRL> edit database STDBYDB set property 'PreferredApplyInstance'
='STDBYDB1';
Property "PreferredApplyInstance" updated

$srvctl stop database -d STDBYDB;
$srvctl start database -d STDBYDB;

DGMGRL> edit database STDBYDB set property 'PreferredApplyInstance'='';
Property "PreferredApplyInstance" updated

“ORA-01017: invalid username/password; logon denied” While Data Guard Switchover

Data Guard switchover failed with “ORA-01017” error :

DGMGRL> connect /
Connected as SYSDG.

DGMGRL> switchover to STDBYDB;
Performing switchover NOW, please wait...
Operation requires a connection to instance "STDBYDB2" on database "STDBYDB"
Connecting to instance "STDBYDB2"...
ORA-01017: invalid username/password; logon denied

Warning: You are no longer connected to ORACLE.

RESOLUTION

Logon as SYS, then switchover will be successful.

DGMGRL> connect SYS
Connected as SYSDG.
DGMGRL> switchover to STDBYDB;

Performing switchover NOW, please wait...

Operation requires a connection to instance "STDBYDB2" on database "STDBYDB"

Connecting to instance "STDBYDB2"...

Connected as SYSDBA.

New primary database "STDBYDB" is opening...

Oracle Clusterware is restarting database "PRIMDB" ...

Switchover succeeded, new primary is "STDBYDB"