CHM – Make DBA Life Easy

The Cluster Health Advisor (CHA) detected an unexpected increase in CPU utilization by databases or applications on this node. Identify CPU intensive processes and databases by reviewing Cluster Health Monitoring (CHM) data. Relocate databases to less busy machines, or limit the number of connections to databases on this node. Add nodes if more resources are required

A client keeps receiving the following alerts from OEM:

Host=racnode1.ractest
Target type=Cluster 
Target name=RACTEST-CLUSTER
Incident creation time=15/12/2021 2:11:20 PM AEDT 
Last updated time=15/12/2021 2:11:20 PM AEDT 
Message=Host CPU Utilization on Host racnode1 Database/Cluster RACTEST-CLUSTER Instance . The Cluster Health Advisor (CHA) detected an unexpected increase in CPU utilization by databases or applications on this node. Identify CPU intensive processes and databases by reviewing Cluster Health Monitoring (CHM) data. Relocate databases to less busy machines, or limit the number of connections to databases on this node. Add nodes if more resources are required. 
Severity=Warning 
Incident ID=501920 
Event count=1 
Incident Status=New 
Escalated=No 
Priority=High 
Incident owner=SYSMAN
Incident Acknowledged By Owner=No 
Categories= 
Rule Name=RuleSet - SYSMAN,Email SYSMAN About the Incidents 
Rule Owner=SYSMAN
...
..
.

Now let’s look into further where the alerts come from, the logs, GI processes and related metrics configurations on OEM.

Read more of this content when you subscribe today.

“Fatal NI Connect Error 12516″ and ”TNS-12564: TNS:connection refused” In Clusterware Alert.log when ologgerd Connects to GIMR

ISSUES

There are repeating following errors in 19c CRS alert.log by ologgered trying to connect to GIMR.

Fatal NI connect error 12516, connecting to:
(DESCRIPTION=(CONNECT_TIMEOUT=60)(ADDRESS=(PROTOCOL=tcp)(HOST=racnode-cluster-scan.virtuallab)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=GIMR_DSCREP_10)(CID=(PROGRAM=ologgerd)(HOST=racnode1.virtuallab)(USER=root)))(SECURITY=(ENCRYPTION_CLIENT=requested)))
VERSION INFORMATION:
TNS for Linux: Version 19.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production

Version 19.8.0.0.0
Time: 08-AUG-2020 18:15:01
Tracing not turned on.
Tns error struct:
ns main err code: 12564

TNS-12564: TNS:connection refused
ns secondary err code: 12560
nt main err code: 524

TNS-00524: Current operation is still in progress
nt secondary err code: 115
nt OS err code: 0

INVESTIGATION

Check mgmtlsnr or SCAN listeners, and found -MGMTDB PDB service ” gimr_dscrep_10″ is not registered properly.

[grid@racnode1 trace]$ lsnrctl services listener_scan2
...
..
.
Service "gimr_dscrep_10" has 1 instance(s).
Instance "-MGMTDB", status READY, has 1 handler(s) for this service…
Handler(s):
"DEDICATED" established:0 refused:0 state:blocked
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.100.62)(PORT=1526))

Subscribe to get access

Read more of this content when you subscribe today.

How to Recreate MGMTDB Database in 19c GI

This post demonstrates how to drop and recreate the 19c MGMTDB in a couple of minutes.

Check MGMTDB Current Status

[grid@racnode1 ~]$ srvctl status mgmtdb
Database is enabled
Instance -MGMTDB is running on node racnode1

Stop ora.crf Resource on All Nodes

[root@racnode1 ~]# /u01/app/19.0.0/grid/bin/crsctl stop res ora.crf -init
CRS-2673: Attempting to stop 'ora.crf' on 'racnode1'
CRS-2677: Stop of 'ora.crf' on 'racnode1' succeeded
[root@racnode1 ~]#

[root@racnode2 ~]# /u01/app/19.0.0/grid/bin/crsctl stop res ora.crf -init
CRS-2673: Attempting to stop 'ora.crf' on 'racnode2'
CRS-2677: Stop of 'ora.crf' on 'racnode2' succeeded
[root@racnode2 ~]#

Subscribe to get access

Read more of this content when you subscribe today.

Check MGMTDB Status after Recreation

[grid@racnode1 ~]$ srvctl status mgmtdb
Database is enabled
Instance -MGMTDB is running on node racnode2

ORA-12516 from “oclumon manage -repos ” command

SYMPTOM

While running oclumon command in 12.2 GI to check CHM retention, and get ORA-12516 error:

$ oclumon manage -repos checkretentiontime 259200
Failed change retention. Error returned ORA-12516: TNS:listener could not find available handler with matching protocol stack

INVESTIGATION

Check MGMTLSNR is enabled and running

$ srvctl status mgmtlsnr
Listener MGMTLSNR is enabled
Listener MGMTLSNR is running on node(s): racnode1

Services are all registered on MGMTLSNR

Both private and private HAIP are registered in listener.

$ lsnrctl status MGMTLSNR
...
..
.
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=MGMTLSNR)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.1.1.11)(PORT=1526)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=169.254.105.132)(PORT=1526)))
..
.
Service "_mgmtdb" has 1 instance(s).
Instance "-MGMTDB", status READY, has 1 handler(s) for this service…
Service "gimr_dscrep_10" has 1 instance(s).
Instance "-MGMTDB", status READY, has 1 handler(s) for this service…
The command completed successfully

MGMTDB is DISABLED

$ srvctl status mgmtdb
Database is disabled
Instance -MGMTDB is running on node racnode1

strace log

Log shows private IP access issue.

66684 [00007f94efff1687] getsockname(50, {sa_family=AF_INET, sin_port=htons(48322), sin_addr=inet_addr("169.254.105.132")}, [16]) = 0
66684 [00007f94efff1657] getpeername(50, {sa_family=AF_INET, sin_port=htons(61021), sin_addr=inet_addr("169.254.105.132")}, [16]) = 0
66684 [00007f94f2603aeb] recvfrom(50, 0x1a63888, 10240, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
66684 [00007f94f2603aeb] recvfrom(50, 0x1a63888, 10240, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
66684 [00007f94f2603aeb] recvfrom(50, 0x1a63888, 10240, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
66684 [00007f94f2603aeb] recvfrom(50, 0x1a63888, 10240, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
66684 [00007f94f2603aeb] recvfrom(50, 0x1a63888, 10240, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

LOCAL_LISTENER

local_listener shows without private IP.

SQL> show parameter local_listener

NAME                 TYPE     VALUE
-------------------  ------  -------------------------------
local_listener       string   (ADDRESS=(PROTOCOL=TCP)(HOST=
                                     10.1.1.11)(PORT=1526))

CAUSES

local_listener includes private IP only, but private HAIP is missing.

SOLUTION

ENABLE MGMTDB

$ srvctl enable mgmtdb

$ srvctl status mgmtdb
Database is enabled
Instance -MGMTDB is running on node racnode1

Add private HAIP into local_listener

SQL> alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)(PORT=1526))','(ADDRESS=(PROTOCOL=TCP)(HOST=169.254.105.132)(PORT=1526))' scope=both;

System altered.

SQL> show parameter local_listener

NAME          TYPE    VALUE
------------- ------- ------------------------------
local_listener string (ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)
                      (PORT=1526)), (ADDRESS=(PROTOCOL=TCP)
                      (HOST=169.254.105.132)(PORT=1526))

Run oclumon command again successfully.

$ oclumon manage -repos checkretentiontime 259200
The Cluster Health Monitor repository can support the desired retention for 2 hosts

How to Shrink Tablespace SYSMGMTDATA in -MGMTDB

Cluster Health Monitor ( CHM) tables could be fragmented in MGMTDB database

In another post Tablespace SYSMGMTDATA Is Full in GIMR Database (-MGMTDB), we introduced how to truncate Cluster Health Monitor ( CHM) tables to get space back, but did not solve the tablespace high water mark( HWM ) problem.

This post demonstrate how to shrink tablespace SYSMGMTDATA online without outage.

Subscribe to get access

Read more of this content when you subscribe today.

SHRINK tablespace SYSMGMTDATA

SYSMGMTDATA tablespace shrinked from 30GB down to 4GB, also give a maximum size to avoid out of space issue.

SQL>alter database datafile '+OCR_VOTE/_MGMTDB/FD9B43BF6A646F8CE043B6A9E80A2815/DATAFILE/sysmgmtdata.269.922884445' autoextend on maxsize 4096m;

Database altered.