How To Manage the Cluster Health Monitor ( CHM ) Repository

Cluster Health Monitor ( CHM ) Repository size should be reviewed periodically to meet business needs and OCR/VOTE disk availability.

Where is Cluster Health Monitor (CHM) Repository ?

In 11.2, the CHM repository is stored in a Berkley Database . The default location of the CHM repository is $GI_HOME/crf/db.

In 12.1, the CHM repository is hosted in the Grid Infrastructure Management Repository (GIMR). The default location for GIMR is stored in the ASM diskgroup which stores the OCR and voting disk .

What is the recommended CHM data retention ?

Oracle Support recommends that the CHM repository be sized according to 72 hours ( 259,200 seconds )(three days) of data retention (e.g.., one weekend worth).

What is the minimum size of  CHM repository ?

For 11.2 GI, one day of data retention for each node requires  867 MB around. So the size of the CHM repository needed to retain 72 hours of data would be as follows:

~72 hours of CHM data retention = NumberOfNodes * 3Days * 867 MB

So for a 2 nodes cluster :

~72 hours of CHM data retention = 2 ( nodes ) * 3 ( days ) * 867 ( per day per node )(5202 MB)

For 12.1, one day of data retention for each node requires 750 MB around, so the size of the CHM repository needed to retain 72 hours of data would be as follows:

~72 hours of CHM data retention = NumberOfNodes * 3Days * 750 MB

So for a 2 node cluster

~72 hours of CHM data retention = 2 ( nodes ) * 3( days ) * 750 ( per day per node ) (4500 MB)

How to see the current CHM repository retention in seconds ?

[grid@racnode1 ~]$ /u01/app/12.1.0/grid/bin/oclumon manage -get repsize

CHM Repository Size = 272580 seconds

How to resize the CHM Repository retention ?

For 11.2 GI:

To determine the current location of the CHM repository:

$oclumon manage -get reppath
 To move and resize the CHM repository for 3 days retention for a 2 nodes cluster:

$ oclumon manage -repos reploc path* -maxspace 5202


* where path = directory path for new location of the CHM repository

For 12.1:

To resize the CHM Repository with one command to result in 3 days retention, eg., for a 2  nodes cluster:

$ oclumon manage -repos changerepossize 4500

How to verify the change in repository size has met the desired retention ?

In 12.1.0.1

$ oclumon manage -repos changeretentiontime 260000

This command does not make any changes. It is more like a “what-if”, ie., what if I wanted to change the retention time, how much space would be required ?

In 12.1.0.2 the syntax was changed and should be used as follows :

[grid@racnode1 ~]$ oclumon manage -repos checkretentiontime 260000

The Cluster Health Monitor repository can support the desired retention for 2 hosts

Grid Infrastructure Management Repository (GIMR)

Since -MGMT DB uses OCR/Voting disk by default, It is strongly recommended to check MGMT database tablespace usage. In some GI versions, we see MGMT database used all OCR/Voting disk.

What is Management Repository?

Grid Infrastructure Management Repository ( GIMR ) is a single instance database managed by GI. It will be up and running on one node in the cluster. If the hosting node is down, the database will be automatically failed over to other node.

What’s the purpose of Management Database?

GIMR will be the central repository to store Cluster Health Monitor (aka CHM/OS, ora.crf) and other data in 12c.

Where does Management Database store it’s datafiles?

In 12R1, by default, Management database uses the same shared storage as OCR/Voting File.

Can Management Database  be turned on/off  when you want ?

In 12.1.0.1, GIMR is optional, if Management Database is not selected to be configured during installation/upgrade OUI, all features (Cluster Health Monitor (CHM/OS) etc) that depend on it will be disabled.

Note: there’s no supported procedure to enable Management Database once the GI stack is configured

This changed in 12.1.0.2, it’s mandatory to have GIMR and must not be turned off

What are the resources associated with Management Database?

The following resources from “crsctl stat res -t” are for Management Database:

ora.mgmtdb
1 ONLINE ONLINE racnode1 Open,STABLE
ora.MGMTLSNR
1 ONLINE ONLINE racnode1 169.254.146.121 172.16.100.61,STABLE

On OS level, the database “-MGMTDB” and listener MGMTLSNR are for Management Database:

[grid@racnode1 ~]$ ps -ef| grep pmon_-MGMTDB
grid 4210 1 0 12:07 ? 00:00:00 mdb_pmon_-MGMTDB

[grid@racnode1 ~]$ ps -ef| grep MGMTLSNR
grid 4015 1 0 12:07 ? 00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr MGMTLSNR -no_crs_notify -inherit

How to start/stop Management Database ?

If Management Database is  down for some reason, the following srvctl command can be used to start it:

Usage: srvctl start mgmtdb [-startoption <start_option>] [-node <node_name>]
Usage: srvctl start mgmtlsnr [-node <node_name>]
[grid@racnode1 ~]$ srvctl status MGMTDB
Database is enabled
Instance -MGMTDB is running on node racnode1

[grid@racnode1 ~]$ srvctl status MGMTLSNR
Listener MGMTLSNR is enabled
Listener MGMTLSNR is running on node(s): racnode1

[grid@racnode1 ~]$ srvctl stop MGMTDB
[grid@racnode1 ~]$ srvctl stop MGMTLSNR

[grid@racnode1 ~]$ srvctl status MGMTDB
Database is enabled
Database is not running.
[grid@racnode1 ~]$ srvctl status MGMTLSNR
Listener MGMTLSNR is enabled
Listener MGMTLSNR is not running

[grid@racnode1 ~]$ srvctl start MGMTDB
[grid@racnode1 ~]$ srvctl start MGMTLSNR
PRCC-1014 : MGMTLSNR was already running
PRCR-1004 : Resource ora.MGMTLSNR is already running
PRCR-1079 : Failed to start resource ora.MGMTLSNR
CRS-5702: Resource 'ora.MGMTLSNR' is already running on 'racnode1'

[grid@racnode1 ~]$ srvctl status MGMTDB
Database is enabled
Instance -MGMTDB is running on node racnode1

[grid@racnode1 ~]$ srvctl status MGMTLSNR
Listener MGMTLSNR is enabled
Listener MGMTLSNR is running on node(s): racnode1

How to access  to Management Database trace file etc?

Since the database name starts with “-” sign, “./” needs to be specified to avoid error: 

[grid@racnode1 grid]$ cd $ORACLE_BASE
[grid@racnode1 grid]$ cd diag
[grid@racnode1 rdbms]$ cd _mgmtdb
[grid@racnode1 _mgmtdb]$ ls -ltr
total 4
drwxr-x--- 16 grid oinstall 4096 Feb 16 21:59 -MGMTDB
-rw-r----- 1 grid oinstall 0 Feb 16 21:59 i_1.mif

[grid@racnode1 _mgmtdb]$ cd -MGMTDB
-bash: cd: -M: invalid option
cd: usage: cd [-L|[-P [-e]]] [dir]

[grid@racnode1 _mgmtdb]$ cd ./-MGMTDB
[grid@racnode1 trace]$ view -MGMTDB_mmon_26447.trc
VIM - Vi IMproved 7.4 (2013 Aug 10, compiled May 4 2014 20:16:04)
Unknown option argument: "-MGMTDB_mmon_26447.trc"
More info with: "vim -h"

[grid@racnode1 trace]$ view ./-MGMTDB_mmon_26447.trc

Is there any need to manually backup or tune Management Database?

As of now, there’s no such need.

How much (shared) disk space should be allocated for the Management Database?

For Oracle Cluster Registry (OCR) with external redundancy and the Grid Infrastructure Management Repository

Minimum:  At least 5.2 GB for the OCR volume that contains the Grid Infrastructure Management Repository (4.5 GB + 300 MB voting files + 400 MB OCR), plus 500 MB for each node for clusters greater than four nodes. For example, a six-node cluster allocation should be 6.2 GB.

Reference:  http://docs.oracle.com/database/121/CWLIN/storage.htm#CHDDCAHD

EMAGENT Stopped With “OutOfMemoryError”

Java paramemetes like ” -Xmx” and “-XX:MaxPermSize” might cause OEM agent up/down periodically.

SYMPTOMS

The agent fails to start, or keeps restarting  itself,  or becomes unresponsive and the following messages are seen in logs.

$AGENT_INST/bin/emctl start agent reports following error :

......
......
......
Starting agent .........................................
Consult emctl.log and emagent.nohup in: /db/oracle/product/agent12c/12.1.0.4/agent_inst/sysman/log

$AGENT_INST/sysman/log/emagent.nohup reports following error :

......
......
2016-03-06 20:03:08,691 [1:main] WARN - Missing filename for log handler 'opsscfg
Agent is going down due to an OutOfMemoryError
----- Sun Mar 6 20:03:28 2016::23451::Checking status of EMAgent : 30249 -----
----- Sun Mar 6 20:03:28 2016::23451::EMAgent exited at Sun Mar 6 20:03:28 2016 with return value 57. -----
----- Sun Mar 6 20:03:28 2016::23451::EMAgent will be restarted because of an Out of Memory Exception. 
......
......

$AGENT_INST/sysman/log/gcagent.log reports following error :

......
......
2016-03-06 20:02:52,549 [1:main] FATAL - Fatal error: Target Interaction Manager failed at Startup
java.lang.OutOfMemoryError: Java heap space
......
......
......
2016-03-06 20:02:52,552 [1:main] ERROR - agent main threw an error
java.lang.OutOfMemoryError: Java heap space

$GENT_INST/bin/emctl start agent reports following error :

Oracle Enterprise Manager Cloud Control 12c Release 4
Copyright (c) 1996, 2014 Oracle Corporation. All rights reserved.
Starting agent ............................................failed.
Fatal agent error: Target Interaction Manager failed at Startup
Fatal agent error: Target Interaction Manager failed at Startup
Fatal agent error: Target Interaction Manager failed at Startup
EMAgent is Thrashing. Exiting watchdog

$AGENT_INST/sysman/log/emagent.nohup reports following error :

......
......
2016-03-06 20:03:08,691 [1:main] WARN - Missing filename for log handler 'opsscfg
Agent is going down due to an OutOfMemoryError

$AGENT_INST/sysman/log/gcagent.log reports following error :

2016-03-06 20:02:52,549[1:3305B9] INFO - Invoking STARTUP_P1 (1) on Target Interaction Manager
2016-03-06 20:02:52,549[1:main] FATAL - Fatal error: Target Interaction Manager failed at Startup
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

OR

$AGENT_INST/sysman/log/gcagent.log reports following errors:

......
......
......
java.lang.OutOfMemoryError: PermGen space
......
......
......
java.lang.OutOfMemoryError: PermGen space
2016-03-06 20:02:52,660 [64:GC.Executor.8 (oracle_camm_manager:ADPManager_EMGC_ADPMANAGER2:Response) (oracle_camm_manager:ADPManager_EMGC_ADPMANAGER2:Response:Response)] ERROR - oracle_camm_manager:ADPManager_EMGC_ADPMANAGER2:Response:Response
java.lang.OutOfMemoryError: PermGen space

SOLUTION

1. Stop the agent first.

$ $AGENT_INST/bin/emctl stop agent

If agent does not shutdown gracefully, then kill all agent background processes by first grepping for agent perl and java processes only.

$ps -eaf | grep java | grep '/db/oracle/product/agent12c/12.1.0.4/agent_inst'

$ps -eaf | grep perl
$kill -9 <Process id>

2. Take backup and edit $AGENT_INST/sysman/config/emd.properties file.

Note : Tune -Xmx and -XX:MaxPermSize parameters , in the below example heap value is increased to 512M from 128M .

The default current value:

agentJavaDefines=-Xmx128M -XX:MaxPermSize=96M

The new value:

agentJavaDefines=-Xmx512M -XX:MaxPermSize=128M

For 13c, please refer to “EM 13c: Agent Tuning Parameters when the Agent is Monitoring Several Targets (Doc ID 2449197.1)

3. Restart agent successfully.

$ $AGENT_INST/bin/emctl start agent
Oracle Enterprise Manager Cloud Control 12c Release 4
Copyright (c) 1996, 2014 Oracle Corporation. All rights reserved.
Starting agent ............................................... started.

$AGENT_INST/bin/emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 4
Copyright (c) 1996, 2014 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version : 12.1.0.4.0
OMS Version : 12.1.0.4.0
Protocol Version : 12.1.0.1.0
Agent Home : /db/oracle/product/agent12c/12.1.0.4/agent_inst
Agent Log Directory : /db/oracle/product/agent12c/12.1.0.4/agent_inst/sysman/log
Agent Binaries : /db/oracle/product/agent12c/12.1.0.4/core/12.1.0.4.0
Agent Process ID : 6532
Parent Process ID : 6208
......
......
......
Operating System : Linux version 2.6.32-358.6.1.el6.x86_64 (amd64)
Last Reload : (none)
Last successful upload : 2016-03-06 20:19:15
Last attempted upload : 2016-03-06 20:19:15
Total Megabytes of XML files uploaded so far : 0.03
Number of XML files pending upload : 0
Size of XML files pending upload(MB) : 0
Available disk space on upload filesystem : 41.27%
Collection Status : Collections enabled
Heartbeat Status : Ok
Last attempted heartbeat to OMS : 2016-03-06 20:19:14
Last successful heartbeat to OMS : 2016-03-06 20:19:14
Next scheduled heartbeat to OMS : 2016-03-06 20:20:14

---------------------------------------------------------------
Agent is Running and Ready

ORA-19909: datafile 1 belongs to an orphan incarnation

One database might have multiple incarnations.

The following errors occurred in a standby database alert log :

MRP0: Detected orphaned datafiles!
ORA-19909: datafile 1 belongs to an orphan incarnation
ORA-01110: data file 1: '+DATA/PROD/datafile/system.238.875096219

Check standby database incarnation :

RMAN> list incarnation;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key Inc Key DB Name DB ID    STATUS Reset SCN    Reset Time
------ ------- ------ ---------- ----- ------------ ----------------
1      1       PROD   3779924890 PARENT 1           08-01-2015 10:00
2      2       PROD   3779924890 CURRENT 2620491565 01-02-2016 12:01

Reset database to old incarnation :

RMAN> reset database to incarnation 1; 

database reset to incarnation 1 


RMAN> list incarnation;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key  Inc Key DB Name DB ID      STATUS  Reset      SCN Reset Time
------- ------- ------- ---------- ------- ---------- ---------------
1       1       PROD    3779924890 CURRENT 1          08-01-2015 10:00
2       2       PROD    3779924890 ORPHAN  2620491565 01-02-2016 12:01

Then everything is working fine.

How to Understand and Use ESTIMATE and ESTIMATE_ONLY Parameters in DataPump

Be aware LOB segment size is not included in the estimation size of datapump.

ESTIMATE Parameter

The value for parameter ESTIMATE  is either BLOCKS (default) or STATISTICS.

BLOCKS: The estimated space used is calculated by multiplying the number of database blocks used by the target objects with the appropriate block sizes. It is  the least accurate because of:

  • The table was created with a much bigger initial extent size than was needed for the actual table data
  • Many rows have been deleted from the table, or a very small percentage of each block is used.

STATISTICS: The estimated space used is calculated by using statistics for each table. If the table has been recently analyzed, the “estimate=statistics” would be the most accurate.

ESTIMATE_ONLY Parameter

Using estimate_only parameter will not generate the dump file other than a logfile .

The value for this parameter is either Y (yes) or N (no = default).

Y: Export estimates the space that would be consumed, but quits without actually performing the export operation.
N: Export does not only estimate, it performs an actual export of data, too.

BLOB EXCLUDED FROM ESTIMATION

If a table has LOBs, ESTIMATE dump file size does NOT take LOB size into consideration.

The table segment size is 34.43GB, and the blog segment size is 38.3GB. It seems the dadapump estimation size excludes blob segment size.

SQL> select bytes/1024/1024/1024 
       from dba_segments 
       where owner='TESTUSER' 
         and segment_name='TEST' 
         and segment_type='TABLE';

BYTES/1024/1024/1024
--------------------
 34.4365234

SQL> desc TESTUSER.TEST
 Name    Null?     Type
 ------- -------  ----------
 NAME    NOT NULL VARCHAR2(30)
 CREATED NOT NULL DATE
 ......
 ......
 DOC              BLOB
 ......
 ......


SQL> select owner,table_name,column_name,segment_name 
       from dba_lobs 
      where owner='TESTUSER' and table_name='TEST';

OWNER      TABLE_NAME  COLUMN_NAME       SEGMENT_NAME 
---------- ----------  ----------------  ------------------------
TESTUSER   TEST        SESSIONITEMLONG   SYS_LOB0000062845C00010$

SQL> select owner, segment_Name,bytes/1024/1024/1024 
       from dba_segments 
      where segment_name='SYS_LOB0000062845C00010$' 
        and owner='TESTUSER';

OWNER       SEGMENT_NAME             BYTES/1024/1024/1024
----------- ------------------------ ------------------
TESTUSER    SYS_LOB0000062845C00010$  38.3


$ expdp \"/ as sysdba\" directory=DATAPUMP_DIR ESTIMATE_ONLY=y ESTIMATE=BLOCKS tables=TESTUSER.TEST;

....
....
....
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. estimated "TESTUSER"."TEST" 35.50 GB
Total estimation using BLOCKS method: 35.50 GB
Job "SYS"."SYS_EXPORT_TABLE_01" successfully completed at 14:50:11

$expdp \"/ as sysdba\" directory=DATAPUMP_DIR ESTIMATE_ONLY=y ESTIMATE=STATISTICS tables=TESTUSER.TEST;
......
......
......
Estimate in progress using STATISTICS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. estimated "TESTUSER"."TEST" 28.96 GB
Total estimation using STATISTICS method: 28.96 GB
Job "SYS"."SYS_EXPORT_TABLE_01" successfully completed at 14:51:31

From the above test, we can see LOB segment size is excluded from estimation size.