风哥教程

培训 . 交流 . 分享
Make progress together!

诊断Oracle RAC数据库上的“IPC Send timeout”问题(转)

[复制链接]
内容发布:肥兔| 发布时间:2014-1-21 22:12:58
本帖最后由 肥兔 于 2014-1-21 22:16 编辑

IPC Send timeout故障现象
Oracle RAC 数据库上比较常见的一种问题就是“IPC Send timeout”。数据库Alert log中出现了“IPC Send timeout”之后,经常会伴随着ora-29740 或者 "Waiting for clusterware split-brain resolution"等,数据库实例会因此异常终止或者被驱逐出集群
比如:
实例1的ALERT LOG:
Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755      <==发送者
Receiver: inst 2 binc 1323620776 ospid 49715160        <==接收者
Thu Jul 02 05:24:51 2012
IPC Send timeout to 1.7 inc 120 for msg type 65516 from opid 13
Thu Jul 02 05:24:51 2012
Communications reconfiguration: instance_number 2
Waiting for clusterware split-brain resolution       <==出现脑裂
Thu Jul 02 05:24:51 2012
Trace dumping is performing id=[cdmp_20120702052451]
Thu Jul 02 05:34:51 2012
Evicting instance 2 from cluster   <==过了10分钟,实例2被驱逐出集群实例2的ALERT LOG:
Thu Jul 02 05:24:50 2012
IPC Send timeout detected. Receiver ospid 49715160       <==接收者
Thu Jul 02 05:24:50 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms6_49715160.trc:
Thu Jul 02 05:24:51 2012
Waiting for clusterware split-brain resolution
Thu Jul 02 05:24:51 2012
Trace dumping is performing id=[cdmp_20120702052451]
Thu Jul 02 05:35:02 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lmon_6257780.trc:
ORA-29740: evicted by member 0, group incarnation 122  <==实例2出现ORA- 29740错误,并被驱逐出集群
Thu Jul 02 05:35:02 2012
LMON: terminating instance due to error 29740
Thu Jul 02 05:35:02 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms7_49453031.trc:
ORA-29740: evicted by member , group incarnation
在RAC实例间主要的通讯进程有LMON, LMD, LMS等进程。正常来说,当一个消息被发送给其它实例之后,发送者期望接收者会回复一个确认消息,但是如果这个确认消息没有在指定的时间内收到(默认300秒),发送者就会认为消息没有达到接收者,于是会出现“IPC Send timeout”问题。
这种问题通常有以下几种可能性:
1. 网络问题造成丢包或者通讯异常。
2. 由于主机资源(CPU、内存)问题造成这些进程无法被调度或者这些进程无响应。
3. Oracle Bug.
4. AIX平台没有打IZ97457丁包

网络问题造成的“IPC Send timeout”例子
实例1的Alert log中显示接收者是2号机的进程49715160,
Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755       <==发送者
Receiver: inst 2 binc 1323620776 ospid 49715160       <==接收者
查看当时2号机的OSWatcher的vmstat输出,没有发现CPU和内存紧张的问题,查看OSWatcher的netstat输出,在发生问题前几分钟,私网的网卡上有大量的网络包传输。
Node2:
zzz Thu Jul 02 05:12:38 CDT 2012
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  10.182.3    10.182.3.2       4073847798     0 512851119     0     0 <==4073847798 - 4073692530 = 155268 个包/30秒
zzz Thu Jul 02 05:13:08 CDT 2012
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  10.182.3    10.182.3.2       4074082951     0 513107924     0     0 <==4074082951 - 4073847798 = 235153 个包/30秒
Node1:
zzz Thu Jul 02 05:12:54 CDT 2012
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  10.182.3    10.182.3.1       502159550     0 4079190700     0     0 <==502159550 - 501938658 = 220892 个包/30秒
zzz Thu Jul 02 05:13:25 CDT 2012
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  10.182.3    10.182.3.1       502321317     0 4079342048     0     0 <==502321317 - 502159550 = 161767 个包/30秒
查看这个系统正常的时候,大概每30秒传输几千个包:
zzz Thu Jul 02 04:14:09 CDT 2012
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  10.182.3    10.182.3.2       4074126796     0 513149195     0     0 <==4074126796 - 4074122374 = 4422个包/30秒
这种突然的大量的网络传输可能会引发网络传输异常,另外网络的UDP或者IP包丢失也会造成该错误。对于这种情况,需要联系网管对网络进行检查。在某些案例中,重启私网交换机或者调换了交换机后问题不再发生。(请注意,网络的正常的传输量会根据硬件和业务的不同而不同。)

CPU负载过高造成的“IPC Send timeout”例子
实例1的Alert log中显示接收者是2号机的进程1596935,
Fri Aug 01 02:04:29 2008
IPC Send timeout detected.Sender: ospid 1506825 <==发送者
Receiver: inst 2 binc -298848812 ospid 1596935  <==接收者
查看当时2号机的OSWatcher的vmstat输出:
zzz ***Fri Aug 01 02:01:51 CST 2008
System Configuration: lcpu=32 mem=128000MB
kthr     memory             page              faults        cpu     
----- ----------- ------------------------ ------------ -----------
  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
25  1 7532667 19073986   0   0   0   0    5   0 9328 88121 20430 32 10 47 11
58  0 7541201 19065392   0   0   0   0    0   0 11307 177425 10440 87 13  0  0 <==idle的CPU为0,说明CPU100%被使用
61  1 7552592 19053910   0   0   0   0    0   0 11122 206738 10970 85 15  0  0
zzz ***Fri Aug 01 02:03:52 CST 2008
   System Configuration: lcpu=32 mem=128000MB
   kthr     memory             page              faults        cpu     
----- ----------- ------------------------ ------------ -----------
  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
25  1 7733673 18878037   0   0   0   0    5   0 9328 88123 20429 32 10 47 11
81  0 7737034 18874601   0   0   0   0    0   0 9081 209529 14509 87 13  0  0 <==CPU的run queue非常高
80  0 7736142 18875418   0   0   0   0    0   0 9765 156708 14997 91  9  0  0 <==idle的CPU为0,说明CPU100%被使用
上面这个例子说明当主机CPU负载非常高的时候,接收进程无法响应发送者,从而引发了“IPC Send timeout”。

引起IPC Send timeout问题的常见bug
10g平台上该问题的常见Bug有Bug 5190596和Bug 6200820。这两个bug多出现在10.2.0.3和10.2.0.4,到了10.2.0.5版本就已经修复了该bug,具体请参见MOS上的文章:
LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]
'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]
11g平台上的常见bug有Bug 6200820和Bug 7653579具体请参见MOS上的文章:
Bug 6200820  AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)
Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]
AIX平台没有打IZ97457丁包引起的 IPC Send timeout
关于这点MOS上的这篇文章
AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]
有如下介绍
Applies to:
Oracle Server - Enterprise Edition - Version 9.2.0.2 and later
IBM AIX on POWER Systems (64-bit)
Symptoms
Environment with IBM AIX VIO experiences one or some or all of the following symptoms:
Packet Loss
Cache Fusion "block lost"
IPC Send timeout
Instance Eviction
SKGXPSEGRCV: MESSAGE TRUNCATED user data nnnn bytes payload nnnn bytes
Cause
AIX issue APAR IZ97457 - A VIOS Server will not forward traffic from its VIO Clients to the external network
Solution
Please engage your OS vendor for fix.
Oracle的建议是打上补丁,IZ97457补丁的介绍如下
Error description
A VIOS Server will not forward traffic from its VIO Clients to the external network.
Packets from the VIO Client travel to the hypervisor(phype) but the packets are dropped by the hypervisor as it attempts to deliver the packet to the VIO Server's trunk adapter.
The hypervisor will have dropped the packets because there are no buffers to place the data in. On the VIOServer,interrupts are not activating the trunk adapter to read and remove data from its buffers. This results in having full buffers at the trunk adapter.
Since the trunk adapter's buffers are full, phype cannot deliver the data and so VIO Clients cannot get packets through the SEA adapter and out to the network.
The problem was discovered on P7 systems where Vlans on the SEA are used.
"Hypervisor Receive" errors on the trunk adapter will increase as this problem occurs and the VIO Clients are not able to reach the outside network.
Problem summary
Unresponsive VIO Clients with traffice not forwarded to external network.
Problem conclusion
Ensure proper locking around receive scheduling operations.
可以看到,IZ97457该补丁是用于处理网络缓冲池用满的情况,建议AIX系统的用户检查下是否打了这个补丁。




上一篇:ORA-19706: Invalid SCN
下一篇:Oracle RAC数据库集群节点被驱逐的5种原因分析
回复

使用道具 举报

内容发布:肥兔| 发布时间:2014-1-21 22:18:59
本帖最后由 肥兔 于 2014-1-21 22:20 编辑

'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration

Applies to:  
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 10.2.0.4 [Release 10.2]
Information in this document applies to any platform.
Oracle Server Enterprise Edition - Version: 10.2.0.1 to 10.2.0.4


Symptoms
After RAC reconfiguration, intermittently the QMNC processes on different nodes cannot communicate with each other in order to allow queue table repartitioning to occur.  This results in node affinity not changing and propagation schedules remaining on a non-primary node and messages not being propagated.

You may see the qmc process on node 3 send a message to the qmnc process on node 1 presumably because node 1 owns schedules which should be owned by node 3 so we see

alert_<dbname>.log

IPC Send timeout detected.Sender: ospid 22795
Receiver: inst 1 binc 4 ospid 17307

In node 1 alert_<dbname>.log we see messages like

IPC Send timeout detected.Sender: ospid 17307
Receiver: inst 2 binc 2 ospid 13661

and also
.
IPC Send timeout detected. Receiver ospid 17307
Wed Jul 11 07:42:01 2007
Errors in file /haclu/64bit/app/oracle/admin/H102/bdump/h1021_qmnc_17307.trc:
IPC Send timeout detected. Receiver ospid 17307
Wed Jul 11 07:42:04 2007
Errors in file /haclu/64bit/app/oracle/admin/H102/bdump/h1021_qmnc_17307.trc:
Wed Jul 11 07:42:04 2007

and

ORA-4021: timeout occurred while waiting to lock object D1.D115Q_EXAMPLE

In node 2 alert_<dbname>.log we see messages of the form

Wed Jul 11 07:41:42 2007
IPC Send timeout detected. Receiver ospid 13661
Wed Jul 11 07:41:42 2007
Errors in file /haclu/64bit/app/oracle/admin/H102/bdump/h1022_qmnc_13661.trc:

and in /haclu/64bit/app/oracle/admin/H102/bdump/h1022_qmnc_13661.trc:

kwqmnstslv: less than 10s since last failed ksvcreate.
Couldn't start a new slave


Changes

Cause
The cause of this issue has been identified in Bug 6200820. It is caused by a problem with the IPC messaging used by the QMON processes when communicating between instances in a RAC cluster causing them to wait on a response which never comes and then timeout after 5 minutes.

Solution
For 10.2.0.3 apply Patch 6326889.
For 10.2.0.4 onwards only the fix for Bug 6200820 is required.

The fix for Bug 6200820 will be included in 10.2.0.5  and the base release of 11.1 onwards. If the patch does not exists on your port contact Global Customer Services.

Note than not all IPC send timeout messages are related to this bug; this is a specific bug involving QMNC Coordinator processes and IPC Send Timeout. If you are running RAC you may see IPC Send Timeout messages for other processes, which are not related to this issue.

回复 支持 反对

使用道具 举报

1框架
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

热门文章教程

  • 风哥Oracle数据库巡检工具V1.0(附2.6网页
  • 实战PHP与MySQL权威指南PDF电子书下载
  • 大数据技术与应用入门培训教程(电子版下载
  • Oracle 12cR2 九大新功能全面曝光_详解云数
  • Oracle OCP认证考试IZ0-053题库共712题数据
  • MySQL5权威指南(第3版)PDF电子版下载
快速回复 返回顶部 返回列表