SITUATION
Same SQL query running on node 1 takes 2.5 times longer time than running on node 2.
Both Unix administrator and DBA have checked and confirmed the work load are same on both nodes. All client connections to RAC databases are load balanced through RAC service.
INVESTIGATIONS
1) Open a new secession, then run the same SQL on node 1. There is no GC events on node 1 and 2:
–check current session SID:
SQL> SELECT SID, SERIAL# FROM V$SESSION WHERE SID = Sys_Context('USERENV', 'SID'); SID SERIAL# ---------- ---------- 306 24423
— Check session stats:
SQL> SELECT a.username, c.name, b.value FROM v$session a, v$sesstat b, v$statname c WHERE a.sid = b.sid and a.sid=306 AND b.statistic# = c.statistic# order by name; USERNAME NAME VALUE ------- -------------------------------------- ---------- SYS CPU used by this session 2419 SYS CPU used when call started 2418 .. . SYS gc CPU used by this session 0 SYS gc blocks compressed 0 SYS gc blocks corrupt 0 SYS gc blocks lost 0 SYS gc claim blocks lost 0 SYS gc cr block build time 0 SYS gc cr block flush time 0 SYS gc cr block receive time 0 SYS gc cr block send time 0 SYS gc cr blocks received 0 SYS gc cr blocks served 0 SYS gc current block flush time 0 SYS gc current block pin time 0 SYS gc current block receive time 0 SYS gc current block send time 0 SYS gc current blocks received 0 SYS gc current blocks served 0 SYS gc force cr disk read 0 SYS gc kbytes saved 0 SYS gc kbytes sent 0 SYS gc local grants 0 SYS gc read wait failures 0 SYS gc read wait time 0 SYS gc read wait timeouts 0 SYS gc read waits 0 SYS gc reader bypass grants 0 SYS gc remote grants 0
— node 2
USERNAME NAME VALUE ------- -------------------------------------- ---------- SYS CPU used by this session 1006 SYS CPU used when call started 1007 .. . SYS gc CPU used by this session 0 SYS gc blocks compressed 0 SYS gc blocks corrupt 0 SYS gc blocks lost 0 SYS gc claim blocks lost 0 SYS gc cr block build time 0 SYS gc cr block flush time 0 SYS gc cr block receive time 0 SYS gc cr block send time 0 SYS gc cr blocks received 0 SYS gc cr blocks served 0 SYS gc current block flush time 0 SYS gc current block pin time 0 SYS gc current block receive time 0 SYS gc current block send time 0 SYS gc current blocks received 0 SYS gc current blocks served 0 SYS gc force cr disk read 0 SYS gc kbytes saved 0 SYS gc kbytes sent 0 SYS gc local grants 0 SYS gc read wait failures 0 SYS gc read wait time 0 SYS gc read wait timeouts 0 SYS gc read waits 0 SYS gc reader bypass grants 0 SYS gc remote grants 0
2) SET AUTOTRACE ON TRACEONLY | STATISTICS.
“consistent gets” means reading from buffer cache.
SQL> SET AUTOTRACE ON STATISTICS SQL> @sqlscript.sql no rows selected Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 3038815 consistent gets 0 physical reads 0 redo size 5076 bytes sent via SQL*Net to client 4264 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 6 sorts (memory) 0 sorts (disk) 0 rows processed
3)tkprof shows the Fetch time.
–node 1
call count cpu elapsed disk query current rows ------- ------ -------- -------- ----- ------- ---------- ------ Parse 2 0.10 0.10 0 0 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 1 24.29 24.55 0 3038773 0 0 ------- ----- -------- -------- ---- ------- -------- ---------- total 6 24.40 24.65 0 3038773 0 0
— node 2
call count cpu elapsed disk query current rows ------- ------ -------- -------- ----- ------- ---------- ------- Parse 2 0.04 0.04 0 0 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 1 9.86 9.91 0 3038773 0 0 ------- ----- -------- -------- ---- ------- -------- ---------- total 6 9.91 9.95 0 3038773 0 0
4) Run a shell script to confirm node 1 CPU is not performing good as node 2:
— node 1 ( 136 seconds )
$ date; i=0; while (( i < 10000000 )); do (( i ++ )); done; date Thu Sep 27 14:51:58 AEST 2018 Thu Sep 27 14:54:14 AEST 2018
— node 2 ( 52 seconds )
$ date; i=0; while (( i < 10000000 )); do (( i ++ )); done; date Thu Sep 27 14:52:10 AEST 2018 Thu Sep 27 14:53:02 AEST 2018
5) Shutdown both GI and RAC, the issue still exists.
6) After rebooted the server, everything goes back to normal. Both node 1 and node 2 perform the same.
CAUSES
So it should be OS kernel issue.