CRS-2883: Resource ‘ora.cluster_interconnect.haip’ failed during Clusterware stack start

When starting up second node CRS of 12.1.0.2 RAC, it failed with “CRS-2883” error.

root@racnode2]# /u01/app/12.1.0/grid/bin/crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'racnode2'
CRS-2672: Attempting to start 'ora.evmd' on 'racnode2'
CRS-2676: Start of 'ora.mdnsd' on 'racnode2' succeeded
CRS-2676: Start of 'ora.evmd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'racnode2'
CRS-2676: Start of 'ora.gpnpd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'racnode2'
CRS-2676: Start of 'ora.gipcd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'racnode2'
CRS-2676: Start of 'ora.cssdmonitor' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'racnode2'
CRS-2672: Attempting to start 'ora.diskmon' on 'racnode2'
CRS-2676: Start of 'ora.diskmon' on 'racnode2' succeeded
CRS-2676: Start of 'ora.cssd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'racnode2'
CRS-2672: Attempting to start 'ora.ctssd' on 'racnode2'
CRS-2883: Resource 'ora.cluster_interconnect.haip' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-4000: Command Start failed, or completed with errors.

Check ohasd_orarootagent_root.trc file:

...
..
.
2019-07-29 19:41:59.985295 : USRTHRD:3605423872: {0:9:3} HAIP: to set HAIP
2019-07-29 19:42:00.036464 : USRTHRD:3605423872: {0:9:3} HAIP: number of inf from clsinet -- 1
2019-07-29 19:42:00.037488 : CSSCLNT:3605423872: clssgsgrppridata: buffer too small - bufsize(4) < datasize(8)
2019-07-29 19:42:00.037795 : USRTHRD:3605423872: {0:9:3} CssGroup::getPrivateGroupData clssgsgrppridata() error, rc = 13
2019-07-29 19:42:00.037868 : USRTHRD:3605423872: {0:9:3} [NetHAMain] thread hit exception CssGroup::getPrivateGroupData clssgsgrppridata() error
2019-07-29 19:42:00.037881 : USRTHRD:3605423872: {0:9:3} [NetHAMain] thread stopping
...
..
.

CAUSE

Patch 29698592 (  Grid Infrastructure Patch Set Update 12.1.0.2.190716 ) has been applied onto first node, but it hasn’t been applied onto second node.

RESOLUTION

All nodes should be patched with same GI patches. Since CRS cannot be started up on 2nd node, so  opatchauto cannot be used.  All GI patches should be manually applied onto second node.

On second node:

1)Kill all the cluster processes manually.

2) Make GI_HOME read/write for GI owner “grid”

#chmod -R 775 $GI_HOME

3) Manually apply GI patches.

--OCW PSU
[grid@racnode2]$ $GI_HOME/OPatch/opatch apply -oh $GI_HOME -local /tmp/29698592/29509318

--ACFS PSU
[grid@racnode2]$ $GI_HOME/OPatch/opatch apply -oh $GI_HOME -local /tmp/29698592/29423125

--DBWLM PSU
[grid@racnode2]$ $GI_HOME/OPatch/opatch apply -oh $GI_HOME -local /tmp/29698592/26983807

--DB PSU
[grid@racnode2]$ $GI_HOME/OPatch/opatch apply -oh $GI_HOME -local /tmp/29698592/29494060

4) Starting up CRS still fails.

[root@racnode2]# $GI_HOME/bin/crsctl start crs -wait
CRS-6706: Oracle Clusterware Release patch level ('3536172590') does 
not match Software patch level ('0'). Oracle Clusterware cannot be started.
CRS-4000: Command Start failed, or completed with errors.

Still on node 2.

For 12.2, Execute “<GI_HOME>/crs/install/rootcrs.pl -prepatch”  “<GI_HOME>/crs/install/rootcrs.pl -postpatch” and as <root_user>  the patch level should be corrected.

For 12.1

[root@racnode2 ]# $GI_HOME/crs/install/rootcrs.sh -patch
Using configuration parameter file: /u01/app/12.1.0/grid/crs/install/crsconfig_params
2019/07/29 22:26:41 CLSRSC-4015: Performing install or upgrade action for Oracle Trace File Analyzer (TFA) Collector.

2019/07/29 22:28:40 CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.

2019/07/29 22:28:56 CLSRSC-329: Replacing Clusterware entries in file 'oracle-ohasd.service'

CRS-4123: Oracle High Availability Services has been started.
CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'racnode2'
CRS-2672: Attempting to start 'ora.evmd' on 'racnode2'
CRS-2676: Start of 'ora.mdnsd' on 'racnode2' succeeded
CRS-2676: Start of 'ora.evmd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'racnode2'
CRS-2676: Start of 'ora.gpnpd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'racnode2'
CRS-2676: Start of 'ora.gipcd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'racnode2'
CRS-2676: Start of 'ora.cssdmonitor' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'racnode2'
CRS-2672: Attempting to start 'ora.diskmon' on 'racnode2'
CRS-2676: Start of 'ora.diskmon' on 'racnode2' succeeded
CRS-2676: Start of 'ora.cssd' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'racnode2'
CRS-2672: Attempting to start 'ora.ctssd' on 'racnode2'
CRS-2676: Start of 'ora.ctssd' on 'racnode2' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'racnode2' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'racnode2'
CRS-2681: Clean of 'ora.asm' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'racnode2'
CRS-2676: Start of 'ora.asm' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'racnode2'
CRS-2676: Start of 'ora.storage' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'racnode2'
CRS-2676: Start of 'ora.crf' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'racnode2'
CRS-2676: Start of 'ora.crsd' on 'racnode2' succeeded
CRS-6017: Processing resource auto-start for servers: racnode2
CRS-2672: Attempting to start 'ora.net1.network' on 'racnode2'
CRS-2676: Start of 'ora.net1.network' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.ons' on 'racnode2'
CRS-2673: Attempting to stop 'ora.racnode2.vip' on 'racnode1'
CRS-2677: Stop of 'ora.racnode2.vip' on 'racnode1' succeeded
CRS-2672: Attempting to start 'ora.racnode2.vip' on 'racnode2'
CRS-2676: Start of 'ora.ons' on 'racnode2' succeeded
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'racnode1'
CRS-2676: Start of 'ora.racnode2.vip' on 'racnode2' succeeded
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'racnode1' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'racnode1'
CRS-2672: Attempting to start 'ora.LISTENER.lsnr' on 'racnode2'
CRS-2677: Stop of 'ora.scan1.vip' on 'racnode1' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'racnode2'
CRS-2676: Start of 'ora.scan1.vip' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'racnode2'
CRS-2676: Start of 'ora.LISTENER.lsnr' on 'racnode2' succeeded
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'racnode2' succeeded
CRS-2672: Attempting to start 'ora.ractest.db' on 'racnode2'
CRS-2676: Start of 'ora.ractest.db' on 'racnode2' succeeded
CRS-6016: Resource auto-start has completed for server racnode2
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.
Oracle Clusterware active version on the cluster is [12.1.0.2.0]. 
The cluster upgrade state is [NORMAL]. 
The cluster active patch level is [3536172590].

Finally use opatchauto to apply the rest patches other than GI patches, the already applied GI patches will be skipped smartly by opatchauto .

[root@racnode2]#$GI_HOME/OPatch/opatchauto apply /tmp/29698592

SUMMARY

  • All nodes should be applied with same patches with latest opatch ( patch id 6880880 )
  • opatchauto requires that CRS is able to be started up, and shutdown accordingly. If CRS cannot be started up, apply GI patches manually first, use “rootcrs.sh” or “rootcrs.pl” to correct the patch level on problematic node. After that CRS can be started up on all nodes.

How to Open ASM Trace File Starting With ‘+’

Oracle ASM trace file always starts with ‘+’, it is not working for below vi command to open and edit those kind of files:

$vi  +ASM1_ora_20007.trc

or

$vi  \+ASM1_ora_20007.trc

or

$vi  *ASM1_ora_20007.trc

or

$vi "+ASM1_ora_20007.trc"

What we do is to open those kind of files by :

1)  vi —

$ vi -- +ASM1_ora_20007.trc

OR

2) vi ./+ASM1_ora_20007.trc

$ vi ./+ASM1_ora_20007.trc

ORA-00603 ORA-27504 ORA-27300 ORA-27301 ORA-27302 ORA-603 in ASM alert log file

There are following error messages in ASM alert log file:

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_m000_39281.trc (incident=198012):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
opidrv aborting process M000 ospid (39281) as a result of ORA-603
2019-03-12T01:00:18.919871+11:00
Process m000 died, see its trace file

+ASM1_m000_39281.trc:

...
..
.
SKGXP:[7f620a621930.0]{0}: SKGXPVFYNET: Socket self-test could not verify successful transmission of 32768 bytes (mtype 61).
SKGXP:[7f620a621930.1]{0}: The network is required to support UDP protocol sends of this size. Socket is bound to 169.254.174.228.
SKGXP:[7f620a621930.2]{0}: phase 'send', 0 tries, 100 loops, 13590 ms (last)
struct ksxpp * ksxppg_ [0x7f620a68a770, 0x7f6204ff1350) = 0x7f6204ff1348
...
..
.

Workaround

1) Shrink database instance SGA size to give more memory back to OS.
After OS gets more available memory, the issue is gone.

Physical memory utilisation( %):

Capture

OR

2) Oracle has reported this issue in Doc ID 2041723.1. by changing the MTU of the loopback interface,  and changing the value of kernel parameter min_free_kbytes.

a)  Lower MTU to 16436 by adding following to /etc/sysconfig/network-scripts/ifcfg-lo.

MTU=16436

Then restart the network service.

# systemctl restart network.service

b) Increase the value of vm.min_free_kbytes to 0.4%of the total physical memory of the server. This can be done by adding following to /etc/sysctl.conf.

  vm.min_free_kbytes = 2097152

and then run below command to make it effective:

#sysctl -p

srvctl failed to start up local listener

SYMPTOMS

After server rebooted, all Oracle Grid Infrastructure ( GI ) resources are up running as expected except local listener. Try to manually start up local listener, then get the following error messages.

$ srvctl start listener -node racnode1
PRCR-1013 : Failed to start resource ora.LISTENER.lsnr
PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node racnode1
...
..
.
CRS-2674: Start of 'ora.LISTENER.lsnr' on 'racnode1' failed

Subscribe to get access

Read more of this content when you subscribe today.

Same SQL with Significant Different Response Time on RAC Nodes

SITUATION

Same SQL query running on node 1 takes 2.5 times longer time than running on node 2.

Both Unix administrator and DBA have checked and confirmed the work load are same on both nodes. All client connections to RAC databases are load balanced through RAC service.

INVESTIGATIONS

1) Open a new secession, then run the same SQL on node 1. There is no GC events on node 1 and 2:

–check current session SID:

SQL> SELECT SID, SERIAL#
     FROM   V$SESSION
     WHERE  SID = Sys_Context('USERENV', 'SID'); 

SID        SERIAL#
---------- ----------
306         24423

— Check session stats:

SQL> SELECT a.username, c.name, b.value
     FROM   
            v$session a, 
            v$sesstat b, 
            v$statname c
WHERE a.sid = b.sid
and a.sid=306
AND b.statistic# = c.statistic#
order by name; 

USERNAME NAME                                    VALUE
 ------- --------------------------------------  ----------
SYS      CPU used by this session                2419
SYS      CPU used when call started              2418
..
.

SYS     gc CPU used by this session                 0
SYS     gc blocks compressed                        0
SYS     gc blocks corrupt                           0
SYS     gc blocks lost                              0
SYS     gc claim blocks lost                        0
SYS     gc cr block build time                      0
SYS     gc cr block flush time                      0
SYS     gc cr block receive time                    0
SYS     gc cr block send time                       0
SYS     gc cr blocks received                       0
SYS     gc cr blocks served                         0
SYS     gc current block flush time                 0
SYS     gc current block pin time                   0
SYS     gc current block receive time               0
SYS     gc current block send time                  0
SYS     gc current blocks received                  0
SYS     gc current blocks served                    0
SYS     gc force cr disk read                       0
SYS     gc kbytes saved                             0
SYS     gc kbytes sent                              0
SYS     gc local grants                             0
SYS     gc read wait failures                       0
SYS     gc read wait time                           0
SYS     gc read wait timeouts                       0
SYS     gc read waits                               0
SYS     gc reader bypass grants                     0
SYS     gc remote grants                            0

— node 2

USERNAME NAME                                  VALUE
------- -------------------------------------- ----------
SYS     CPU used by this session               1006
SYS     CPU used when call started             1007
..
.

SYS    gc CPU used by this session                0
SYS    gc blocks compressed                       0
SYS    gc blocks corrupt                          0
SYS    gc blocks lost                             0
SYS    gc claim blocks lost                       0
SYS    gc cr block build time                     0
SYS    gc cr block flush time                     0
SYS    gc cr block receive time                   0
SYS    gc cr block send time                      0
SYS    gc cr blocks received                      0
SYS    gc cr blocks served                        0
SYS    gc current block flush time                0
SYS    gc current block pin time                  0
SYS    gc current block receive time              0
SYS    gc current block send time                 0
SYS    gc current blocks received                 0
SYS    gc current blocks served                   0
SYS    gc force cr disk read                      0
SYS    gc kbytes saved                            0
SYS    gc kbytes sent                             0
SYS    gc local grants                            0
SYS    gc read wait failures                      0
SYS    gc read wait time                          0
SYS    gc read wait timeouts                      0
SYS    gc read waits                              0
SYS    gc reader bypass grants                    0
SYS    gc remote grants                           0

2) SET AUTOTRACE ON TRACEONLY | STATISTICS.

consistent gets” means reading from buffer cache.

SQL> SET AUTOTRACE ON STATISTICS
SQL> @sqlscript.sql

no rows selected

Statistics
----------------------------------------------------------
0       recursive calls
0       db block gets
3038815 consistent gets
0       physical reads
0       redo size
5076    bytes sent via SQL*Net to client
4264    bytes received via SQL*Net from client
1       SQL*Net roundtrips to/from client
6       sorts (memory)
0       sorts (disk)
0       rows processed

3)tkprof shows the Fetch time.

–node 1

call    count  cpu      elapsed  disk  query   current    rows
------- ------ -------- -------- ----- ------- ---------- ------
Parse   2      0.10     0.10     0     0       0          0
Execute 3      0.00     0.00     0     0       0          0
Fetch   1     24.29     24.55    0     3038773 0          0
------- ----- -------- -------- ---- ------- -------- ----------
total   6     24.40     24.65    0    3038773  0          0

— node 2

call    count  cpu      elapsed  disk  query   current    rows
------- ------ -------- -------- ----- ------- ---------- -------
Parse   2      0.04     0.04     0     0       0           0
Execute 3      0.00     0.00     0     0       0           0
Fetch   1      9.86     9.91     0    3038773  0           0
------- ----- -------- -------- ---- ------- -------- ----------
total   6      9.91     9.95     0    3038773  0           0

4) Run a shell script to confirm node 1 CPU is not performing good as node 2:

— node 1 ( 136 seconds )

$ date; i=0; while (( i < 10000000 )); do (( i ++ )); done; date

Thu Sep 27 14:51:58 AEST 2018
Thu Sep 27 14:54:14 AEST 2018

— node 2 ( 52 seconds )

$ date; i=0; while (( i < 10000000 )); do (( i ++ )); done; date

Thu Sep 27 14:52:10 AEST 2018
Thu Sep 27 14:53:02 AEST 2018

5) Shutdown both GI and RAC,  the issue still exists.

6) After rebooted the server, everything goes back  to normal.  Both  node 1 and node 2 perform the same.

CAUSES

So it should be OS kernel issue.