Oracle案例:SMON回滚异常导致实例Crash
ccwgpt 2024-12-04 11:59 58 浏览 0 评论
某省电力系统的一个4节点rac,2节点在早上的时候crash。
Wed Nov 03 07:59:05 2021
SMON: Restarting fast_start parallel rollback
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Wed Nov 03 07:59:07 2021
PMON (ospid: 100111): terminating the instance due to error 474
Wed Nov 03 07:59:07 2021
System state dump requested by (instance=2, osid=100111 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_diag_100469_20211103075907.trc
Instance terminated by PMON, pid = 100111
当SMON重新发起并行回滚时,实例被PMON终止,这里有一个隐藏错误,常常被忽视,PMON (ospid: 100111): terminating the instance due to error 474
[oracle@dbtest ~]$ oerr ora 474
00474, 00000, "SMON process terminated with error"
// *Cause: The system cleanup process died
// *Action: Warm start instance
说明是PMON发现SMON异常而终止的实例。那么看看SMON trace文件:
*** 2021-11-03 07:59:05.420
Parallel Transaction recovery caught error 30317
SMON: Restarting fast_start parallel rollback
*** 2021-11-03 07:59:06.283
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1]
DDE previous invocation failed before phase II
DDE was called in a 'No Invocation Mode'
----- Start Diag Diagnostic Dump -----
Diag diagnostic dump is performed due to an error in the diagfw code during error handling.
DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code.
Dump error and call stack for the diagnostic dump:
*** 2021-11-03 07:59:06.306
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
*** 2021-11-03 07:59:06.306
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
Warning! DDE is invoked in protected mode. DDE call is aborted.
Warning! DDE is invoked in protected mode. DDE call is aborted.
*** 2021-11-03 07:59:06.308
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/cdump'
SMON在重新发起回滚时异常之后进程被crash了,最终PMON终止了实例。继续往前查看alert日志发现从6点52分46秒开始报一些异常之后,就开始了blockrecover并伴随一些ora-00600[4193]错误。
Wed Nov 03 06:51:58 2021
Archived Log entry 220723 added for thread 2 sequence 33832 ID 0xbc36136c dest 1:
Wed Nov 03 06:52:46 2021
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Wed Nov 03 06:52:48 2021
Block recovery from logseq 33833, block 2030926 to scn 16125071688907
Recovery of Online Redo Log: Thread 2 Group 8 Seq 33833 Reading mem 0
Mem# 0: +FDATADG/newepmfdp/onlinelog/group_8.278.1083929519
Block recovery completed at rba 33833.2031098.16, scn 3754.1764459724
Block recovery from logseq 33832, block 4457757 to scn 16125071688920
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515
Block recovery interrupted at rba 33832.4613427.16
Block recovery from logseq 33832, block 4613427 to scn 16125071688920
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515
Block recovery interrupted at rba 33832.4769190.16
Block recovery from logseq 33832, block 4769190 to scn 16125071688920
...
Wed Nov 03 07:00:04 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc (incident=320530):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Nov 03 07:06:09 2021
Sweep [inc][320530]: completed
Wed Nov 03 07:06:16 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc:
ORA-12751: cpu time or run time policy violation
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc:
ORA-12751: cpu time or run time policy violation
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Wed Nov 03 07:06:22 2021
Dumping diagnostic data in directory=[cdmp_20211103070622], requested by (instance=2, osid=84504 (M000)), summary=[incident=320530].
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc:
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Wed Nov 03 07:06:28 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc:
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Wed Nov 03 07:11:57 2021
Sweep [inc2][320530]: completed
Wed Nov 03 07:31:57 2021
Suspending MMON action 'undo usage' for 82800 seconds
对于ora-00600[4193],经常搞恢复的太熟悉不过了。不过一般在异常断电强制open数据库的时候遇到,正常运行的环境很少见,其含义如下:
Meaning: seq# mismatch while adding an undo record to an undo block. This
is done by the application of redo.
---------------------------------------------------------------------------
Argument Description:
a. (ktubhseq): undo record seq# - this is the seq# of the block that
this undo record WILL BE APPLIED TO.
This is from the Undo Block. It is
NOT the seq# of the undo block itself.
b. (ktudbseq): redo RECORD seq# - this is the seq# number in the block
that this redo WILL BE APPLIED TO.
This is from the Redo Record.
When we try to apply redo to an undo block (forward changes are made by
the application of redo to a block) we check that the seq# in the undo
record matches the seq# in the redo record. These seq# should be the
same because when we apply a redo record we must apply it to the
correct version of the block. We can only apply a redo record to a
block that contains the same seq# as in the redo record.
If the seq# do not match then this error is raised. This implies some
kind of block corruption in either the redo or the undo block.
ksesic2(OERI(4193), ksenrg(ubh->ktubhseq), ksenrg(db->ktudbseq));
struct ktubh
{
kxid ktubhxid; /* txid of tx currently using or last used this block */
ub2 ktubhseq; /* undo block sequence number */
ub1 ktubhcnt; /* high water mark record index, number of undo entries */
ub1 ktubhirb; /* rollback record index, rec index to start the rollback */
ub1 ktubhicl; /* collecting record index, rec index to start retrieving col info */
ub1 ktubhflg; /* dummy */
ub2 ktubhidx[1]; /* byte offset of record in block, grows at runtime */
};
struct ktudb Kernel Transaction Undo Data operation Block (redo)
{
ub2 ktudbsiz; /* size of entry */
ub2 ktudbspc; /* verification: space left in undo block */
ub2 ktudbflg; /* flag to indicate the kind of redo operation */
kxid ktudbxid; /* current tx id */
ub2 ktudbseq; /* block sequence number */
ub1 ktudbrec; /* new record index for this change */
};
其实质就是ktubhseq和ktudbseq不一致,如果是ktudbrec和ktubhcnt不一致则为ORA-00600[4194],ktubh是undo block中的结构,ktudb是redo记录undo块操作时会记录的信息。
Undo Data
struct kcbh, 20 bytes @0
ub1 type_kcbh @0
ub1 frmt_kcbh @1
ub1 spare1_kcbh @2
ub1 spare2_kcbh @3
ub4 rdba_kcbh @4
ub4 bas_kcbh @8
ub2 wrp_kcbh @12
ub1 seq_kcbh @14
ub1 flg_kcbh @15
ub2 chkval_kcbh @16
ub2 spare3_kcbh @18
struct ktubh, 142 bytes @20
struct ktubhxid, 8 bytes @20
ub2 ktubhseq @28
ub1 ktubhcnt @30
ub1 ktubhirb @31
ub1 ktubhicl @32
ub1 ktubhflg @33
ub2 ktubhidx[64] @34
ub1 freespace[38] @162
ub1 undodata[7988] @200
ub4 tailchk @8188
回到案例,从trace中可以看到ktubh和ktudb的信息。
ktudb记录如下:
- xid: 0x750f.003.0093843a
- uba: 0x1cd49528.ddee.01
- seq: 0xddee
- rec: 0x02
*** ktuc_diag_dmp: dump of current change vector
ktudb redo: siz: 232 spc: 8068 flg: 0x0012 seq: 0xddee rec: 0x02
xid: 0x750f.003.0093843a
ktubl redo: slt: 3 rci: 0 opc: 11.1 [objn: 6729 objd: 6729 tsn: 1]
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
0x00000000 prev ctl uba: 0x1cd49528.ddee.01
prev ctl max cmt scn: 0x0eaa.692b5862 prev tx cmt scn: 0x0eaa.692b58c4
txn start scn: 0xffff.ffffffff logon user: 0 prev brb: 0 prev bcl: 0 BuExt idx: 0 flg2: 0
KDO undo record:
KTB Redo
op: 0x04 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: L itl: xid: 0x7607.013.00713adb uba: 0x1d001423.538a.15
flg: C--- lkc: 0 scn: 0x0eaa.68c33db2
KDO Op code: URP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x008011c0 hdba: 0x00801a02
itli: 1 ispac: 0 maxfr: 4858
tabn: 0 slot: 78(0x4e) flag: 0x2c lock: 0 ckix: 23
ncol: 10 nnew: 1 size: 0
col 8: [ 6] c5 02 0b 2a 4d 5c
uba指向的undo块ktudb信息如下:
- xid: 0x750d.009.00abc7f4
- seq: 0xd24
- cnt: 0x4d
********************************************************************************
UNDO BLK:
xid: 0x750d.009.00abc7f4 seq: 0xd24 cnt: 0x4d irb: 0x1 icl: 0x0 flg: 0x0000
Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x1f8c 0x02 0x1f18 0x03 0x1ebc 0x04 0x1e48 0x05 0x1dec
0x06 0x1d78 0x07 0x1d1c 0x08 0x1cc0 0x09 0x1c64 0x0a 0x1bf0
0x0b 0x1b94 0x0c 0x1b20 0x0d 0x1ac4 0x0e 0x1a50 0x0f 0x19f4
0x10 0x1980 0x11 0x1924 0x12 0x18b0 0x13 0x1854 0x14 0x17e0
0x15 0x1784 0x16 0x1710 0x17 0x16b4 0x18 0x1640 0x19 0x15e4
0x1a 0x1570 0x1b 0x1514 0x1c 0x14a0 0x1d 0x144c 0x1e 0x13d8
0x1f 0x137c 0x20 0x1320 0x21 0x12c4 0x22 0x1250 0x23 0x11f4
0x24 0x1180 0x25 0x1124 0x26 0x10b0 0x27 0x1054 0x28 0x0fe0
0x29 0x0f84 0x2a 0x0f10 0x2b 0x0eb4 0x2c 0x0e40 0x2d 0x0de4
0x2e 0x0d70 0x2f 0x0d14 0x30 0x0ca0 0x31 0x0c44 0x32 0x0be0
0x33 0x0b84 0x34 0x0b10 0x35 0x0ab4 0x36 0x0a40 0x37 0x09e4
0x38 0x0970 0x39 0x0914 0x3a 0x08a0 0x3b 0x0844 0x3c 0x07d0
0x3d 0x0774 0x3e 0x0700 0x3f 0x06a4 0x40 0x0630 0x41 0x05d4
0x42 0x0560 0x43 0x0504 0x44 0x0490 0x45 0x0434 0x46 0x03c0
0x47 0x0364 0x48 0x02f0 0x49 0x0294 0x4a 0x0220 0x4b 0x01c4
0x4c 0x0150 0x4d 0x00f4
这里看到undo块中已经有一个活动事务了,这就不仅仅是seq不一致的问题了,更严重的是一个新事务居然去找了一个存在活动事务的undo块来使用。我们知道undo段头会预分配最多5个block到free block pool里作为下一个事务准备,当需要分配undo块时,会去寻找free block pool里面的块。
看看报错事务指向的undo段头块(段头号为0x750f):
TRN CTL:: seq: 0xddee chd: 0x0003 ctl: 0x001d inc: 0x00000000 nfb: 0x0001
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x1cd49528.ddee.01 scn: 0x0eaa.692b5862
Version: 0x01
FREE BLOCK POOL::
uba: 0x1cd49528.ddee.01 ext: 0x16a6 spc: 0x1f84
uba: 0x00000000.ddc3.3e ext: 0x167b spc: 0x4e2
uba: 0x00000000.dc0b.4b ext: 0x47ab spc: 0x320
uba: 0x00000000.7727.48 ext: 0xd13 spc: 0x326
uba: 0x00000000.0000.00 ext: 0x0 spc: 0x0
TRN TBL::
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
------------------------------------------------------------------------------------------------
0x00 9 0x00 0x9416bd 0x0014 0x0eaa.692b7f1d 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564
0x01 9 0x00 0x93546c 0x000f 0x0eaa.692b5bde 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x02 9 0x00 0x9454fb 0x000b 0x0eaa.692b622a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x03 9 0x00 0x93842a 0x0019 0x0eaa.692b58c4 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x04 9 0x00 0x932b09 0x0007 0x0eaa.692b7254 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564
0x05 9 0x00 0x931008 0x0010 0x0eaa.692b5d5f 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x06 9 0x00 0x948b77 0x0021 0x0eaa.692b59f4 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x07 9 0x00 0x93c1e6 0x0009 0x0eaa.692b7540 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564
0x08 9 0x00 0x93c4b5 0x000d 0x0eaa.692b5d8a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x09 9 0x00 0x94ac14 0x0000 0x0eaa.692b79f1 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564
0x0a 9 0x00 0x944b33 0x0012 0x0eaa.692b704e 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564
0x0b 9 0x00 0x9453f2 0x000c 0x0eaa.692b667a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x0c 9 0x00 0x9468e1 0x001e 0x0eaa.692b6a9d 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
0x0d 9 0x00 0x94d5e0 0x0020 0x0eaa.692b5f85 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
可以看到free block pool里有一个块正好是0x1cd49528。综上所述,undo块0x1cd49528实际已经分配给undo段头0x750d,并且已经有事务xid: 0x750d.009.00abc7f4正在使用该undo块,但是undo段头0x750f里面的FREE BLOCK POOL并没有把这个块清理掉,导致了ORA-00600[4193]产生。
一般对于事务异常的问题会触发blockrecover操作,那6点52分46秒就出现了blockrecover,是否问题就是从这里开始的呢?PMON trace显示blockrecover是从6点52分49秒开始,file 2608 block 144即为undo段头0x750f的块地址:
*** 2021-11-03 06:52:49.027
Block recovery interrupted at rba 33832.4613427.16
Resuming block recovery (PMON) for file 2608 block 144
Block recovery from logseq 33832, block 4613427 to scn 16125071688920
*** 2021-11-03 06:52:49.031
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
Block recovery interrupted at rba 33832.4769190.16
Resuming block recovery (PMON) for file 2608 block 144
Block recovery from logseq 33832, block 4769190 to scn 16125071688920
...
从一些前台进程trace里可以看到,6点52分51秒的事务TX-750F0003-0093843A就已经无法进行。进程被crash。
*** 2021-11-03 06:52:51.294
*** SESSION ID:(2049.549) 2021-11-03 06:52:51.294
*** CLIENT ID:() 2021-11-03 06:52:51.294
*** SERVICE NAME:(SYS$USERS) 2021-11-03 06:52:51.294
*** MODULE NAME:(OGG-REPYCJ-OPEN_DATA_SOURCE) 2021-11-03 06:52:51.294
*** ACTION NAME:() 2021-11-03 06:52:51.294
ksqgtl:newtx: res: 321161d7b0 TX-750F0003-0093843A flag: 0x47
*** 2021-11-03 06:52:52.294
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1]
DDE previous invocation failed before phase II
DDE was called in a 'No Invocation Mode'
----- Start Diag Diagnostic Dump -----
Diag diagnostic dump is performed due to an error in the diagfw code during error handling.
DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code.
Dump error and call stack for the diagnostic dump:
*** 2021-11-03 06:52:52.319
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
*** 2021-11-03 06:52:52.319
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
Warning! DDE is invoked in protected mode. DDE call is aborted.
Warning! DDE is invoked in protected mode. DDE call is aborted.
*** 2021-11-03 06:52:52.321
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
ssexhd: crashing the process...
Shadow_Core_Dump = PARTIAL
该事务id和之前分析报错的ORA-00600[4193]是一样的。说明从6点52分46秒开始,undo段头0x750F就无法绑定事务了,原因就是free block pool里的undo块已经不属于该undo段了,被事务0x750d.009.00abc7f4使用了。通过redodump可以验证这一点。
REDO RECORD - Thread:2 RBA: 0x008429.001ef4a5.0184 LEN: 0x00bc VLD: 0x01
SCN: 0x0eaa.692b839b SUBSCN: 6 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59949 AFN:2608 DBA:0x8c800090 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
SETSTAT: exts:24996 blks:219215 lastmap:0x8c91f248 mapcnt:56
UPDXNT: extent:5798 add:TRUE
CHANGE #2 TYP:0 CLS:7 AFN:2608 DBA:0x8c859230 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
ADD: dba:0x1cd49528 len:8 at offset:1
ADDRET: offset:1 ctime:0
CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b839b SEQ:1 OP:22.2 ENC:0 RBL:0
ktfbhredo - File Space Header Redo:
Header Opcode:
Save: No Pending Op
...
REDO RECORD - Thread:2 RBA: 0x008429.001efbb0.0050 LEN: 0x00bc VLD: 0x01
SCN: 0x0eaa.692b8484 SUBSCN: 4 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59945 AFN:2608 DBA:0x8c800080 OBJ:4294967295 SCN:0x0eaa.692b8427 SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
SETSTAT: exts:25381 blks:221582 lastmap:0x8c8d46a8 mapcnt:57
UPDXNT: extent:13691 add:TRUE
CHANGE #2 TYP:0 CLS:7 AFN:115 DBA:0x1cd3aa10 OBJ:4294967295 SCN:0x0eaa.692b8425 SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
ADD: dba:0x1cd49528 len:8 at offset:189
ADDRET: offset:189 ctime:0
CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b8484 SEQ:1 OP:22.2 ENC:0 RBL:0
ktfbhredo - File Space Header Redo:
Header Opcode:
Save: No Pending Op
...
REDO RECORD - Thread:2 RBA: 0x008429.001efd6f.0010 LEN: 0x02a4 VLD: 0x01
SCN: 0x0eaa.692b84b8 SUBSCN: 6 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59946 AFN:115 DBA:0x1cd49528 OBJ:4294967295 SCN:0x0eaa.692b84b8 SEQ:3 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 92 spc: 6690 flg: 0x0022 seq: 0x0d24 rec: 0x0f
xid: 0x750d.009.00abc7f4
ktubu redo: slt: 9 rci: 14 opc: 11.1 objn: 6049924 objd: 6049924 tsn: 9
Undo type: Regular undo Undo type: Last buffer split: No
Tablespace Undo: No
0x00000000
KDO undo record:
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x1cd49528.0d24.07
KDO Op code: DRP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x45707392 hdba: 0x09ab6222
itli: 2 ispac: 0 maxfr: 4858
tabn: 0 slot: 20(0x14)
该undo块一系列操作都是在11/03/2021 06:52:46完成,这明显不符合undo块分配原则,怀疑是bug,但是mos没有找到类似文档:
- SCN: 0x0eaa.692b839b,undo块0x1cd49528分配给到undo段头地址为0x8c800090的undo段(段头号为0x750f)
- SCN: 0x0eaa.692b8484,undo块0x1cd49528分配到undo段头地址为0x8c800080的undo段(段头号为0x750d)
- SCN: 0x0eaa.692b84b8,事务0x750d.009.00abc7f4使用了该undo块
undo块分配原则:
1.If the current extent has more free blocks, return the next free block.
2.Otherwise if the next extent has expired, wrap into the next extent and return the first block in that extent.
3.Otherwise get space from the Undo tablespace. If a free extent is available, allocate the extent to the transaction table and return the first block in the new extent.
4.Steal from an offlined transaction table. If a free extent is available, deallocate it from the offlined transaction table and add it to the current transaction table. Return the first free block from the new extent.
5.Steal from an onlined transaction table. The procedure is similar to (4).
6.Extend the file in the undo tablespace.If the files can be extended, add an extent to the current transaction table and return a block to the requestor
7.Reuse unexpired extents from own transaction table. If all extents are currently busy (i.e.contain uncommitted undo), go to the next step. Otherwise wrap into the next extent
8.Steal unexpired extents from offlined transaction tables. If fail try the onlined transaction tables.
9.If all else fails, report the ORA-1562: “Failed to extend rollback segment number %s” error.
Note:Steps 7 and 8 are skipped if the undo retention guarantee (10g and higher) is specified.
综上分析我们得到的解决办法有如下几种:
- 重建该实例的undo表空间
- 删除异常的undo段(段头号为0x750f)
- bbed清空undo段(段头号为0x750f)的free block pool
那么故障的原因和解决方案分析完了,还有一个很异常的问题,为什么几乎同一时间undo块连续分配给两个undo段,当undo block分配给其他undo段使用的时候,以前的undo段的free block pool里没有清理掉呢?结合当时的awr来看有比较严重的undo争用,undo extent steal也很频繁,再往前翻看alert发现频繁的ORA-01628:
undo global data ktusm_stealext_2 0 1,029,644 827,502
undo global data ktusmupst: KSLBEGIN 0 44,653 83,762
Wed Nov 03 06:38:58 2021
Archived Log entry 220714 added for thread 2 sequence 33828 ID 0xbc36136c dest 1:
Wed Nov 03 06:39:51 2021
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$
......
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$
ORA-1628: max # extents 32765 reached for rollback segment
Wed Nov 03 06:41:12 2021
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30292_4090763577$
Wed Nov 03 06:41:27 2021
Thread 2 advanced to log sequence 33830 (LGWR switch)
Current log# 11 seq# 33830 mem# 0: +FDATADG/newepmfdp/onlinelog/group_11.305.1083929527
Wed Nov 03 06:41:31 2021
Archived Log entry 220716 added for thread 2 sequence 33829 ID 0xbc36136c dest 1:
Wed Nov 03 06:41:48 2021
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30343_1212715771$
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_j002_77822.trc:
ORA-12012: error on auto execute of job 1230
ORA-12008: error in materialized view refresh path
ORA-01628: max # extents (32765) reached for rollback segment _SYSSMU30343_1212715771$
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2809
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 3025
ORA-06512: at "SYS.DBMS_IREFRESH", line 689
ORA-06512: at "SYS.DBMS_REFRESH", line 195
ORA-06512: at line 1
Wed Nov 03 06:43:45 2021
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30359_11512129$
Wed Nov 03 06:43:45 2021
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30359_11512129$
我怀疑是频繁的undo extent扩展、收缩、偷取导致的undo块分配紊乱。由于没有对系统平时的负载做分析,这里暂时先给出一些粗略的后续建议:
- undo设置为undo retention guarantee
- 合理设置undo相关参数
- 合理设计事务,避免大事务产生
- 分析dba_undo_extents,很可能有过多的非常小的undo extent
墨天轮原文链接:https://www.modb.pro/db/175145?sjhy(复制链接至浏览器或点击文末阅读原文查看)
关于作者
李翔宇,云和恩墨西区交付技术顾问,长期服务移动运营商行业客户,熟悉Oracle性能优化,故障诊断,特殊恢复。
相关推荐
- VUE3前端开发入门系列教程二:使用iView框架辅助开发
-
1、安装iView新框架,支持VUE3npminstallview-ui-plus2、编辑src/main.js,添加以下内容,导入js和css到项目importViewUIPlusfrom...
- 万能前端框架uni app初探03:底部导航开发
-
前言本节我们使用uniapp的底部导航功能,点击不同tab会显示不同页面,这个功能在实际项目开发中几乎是必备的。一、基础知识1.tabBar如果应用是一个多tab应用,可以通过tabBar配...
- Rust Web 开发框架,前端你可以选择哪个?
-
Rust构建一切。在如今流行的语言中,Rust可谓是将构建和高效作为自己优美的身姿在大众视野中脱颖而出。它是一门赋予每个人构建可靠且高效软件能力的语言。它有什么特性呢?高性能。Rust速度惊人且内...
- 连载:前端开发中纠结的Javascript框架(上)
-
如今,前端开发有着许许多多的框架和库。其中一些好用,一些却不尽人意。通常我们会习惯性运用某一概念,模块或句法。事实上,并没有什么万能工具。这篇文章是关于未来框架的发展趋势——那就是没有框架!我从以下几...
- 前端开发框架的演进架构:提升用户体验和开发效率
-
前端开发框架是现代Web应用开发的重要工具,它不仅可以帮助开发者构建复杂的用户界面,还能够提升用户体验和开发效率。随着Web技术的不断发展,前端开发框架也在不断演进,为开发者提供了更丰富、更高效的工具...
- Google应用Mesh-TensorFlow框架,让CNN也能处理超高分辨率图像
-
为了要处理超高分辨率医疗图像数据,Google开发了一种空间数据分区(SpatialPartition)技术,在不牺牲图像分辨率的条件下,分析超高分辨率图像。Google使用Mesh-TensorF...
- 大模型安全挑战加剧:框架层漏洞成新靶心
-
近日,360数字安全集团发布了一份关于大模型安全漏洞的报告,揭示了当前大模型及围绕其构建的框架和应用中存在的严重安全问题。报告显示,360近期研究发现了近40个大模型相关的安全漏洞,其中既包括二进制内...
- Keras 3.0正式发布:可用于TensorFlow、JAX和PyTorch
-
机器之心报道编辑:陈萍经过5个月的更新迭代,Keras3.0终于来了。「大新闻:我们刚刚发布了Keras3.0版本!」Keras之父FrancoisChollet在X上激动的...
- TensorFlow和Keras入门必读教程(tensorflow与keras版本对应)
-
导读:本文对TensorFlow的框架和基本示例进行简要介绍。作者:本杰明·普朗什(BenjaminPlanche)艾略特·安德烈斯(EliotAndres)来源:华章科技01TensorFlo...
- 谷歌官方回应“TensorFlow遭弃”:还在投资开发,将与JAX并肩作战
-
鱼羊发自凹非寺量子位|公众号QbitAI终于,谷歌出面回应“TensorFlow遭弃”传闻:我们将继续致力于将TensorFlow打造为一流机器学习平台,与JAX并肩推动机器学习研究。这段时...
- 2025 年的PHP :现代 Web 开发的强大引擎
-
程序员还在吐槽PHP过时?2025年的PHP8.4直接封神了。看看最近更新的属性钩子、强类型系统,加上Laravel这些框架,老语言早就脱胎换骨。十年前说PHP弱类型容易崩代码的,现在脸疼不?联合类...
- 前端内卷终结者?htmx如何让开发者告别200行JS只做一个按钮
-
当你用React写一个点赞按钮需要引入3个状态管理库、编写80行JSX和120行钩子函数时,htmx只需要一行HTML:<buttonhx-post="/like"hx-sw...
- NativePHP桌面版V1.0正式发布(元气桌面电脑版下载)
-
导读:各位小伙伴,使用PHP构建桌面级系统的利器,NativePHP来了。概述NativePHP是一个用于使用PHP构建桌面应用的框架。它允许PHP开发人员使用熟悉的工具和技术创建跨平台的原生应用...
- PHP Laravel框架底层机制(php基本框架)
-
当然可以,Laravel是最受欢迎的PHP框架之一,以优雅的语法和丰富的生态而闻名。尽管开发体验非常“高端”,它的底层其实是由一系列结构清晰、职责分明的组件构成的。下面我从整体架构、核心流程、...
- PHP框架之Laravel框架教程:2. 控制器、路由、视图简单介绍
-
2.控制器、路由、视图简单介绍我们先建立控制器,目录是:app/Http/Controllers,新建控制器Ding.php,代码如下:Ding.php:<?phpnamespaceA...
你 发表评论:
欢迎- 一周热门
- 最近发表
-
- VUE3前端开发入门系列教程二:使用iView框架辅助开发
- 万能前端框架uni app初探03:底部导航开发
- Rust Web 开发框架,前端你可以选择哪个?
- 连载:前端开发中纠结的Javascript框架(上)
- 前端开发框架的演进架构:提升用户体验和开发效率
- Google应用Mesh-TensorFlow框架,让CNN也能处理超高分辨率图像
- 大模型安全挑战加剧:框架层漏洞成新靶心
- Keras 3.0正式发布:可用于TensorFlow、JAX和PyTorch
- TensorFlow和Keras入门必读教程(tensorflow与keras版本对应)
- 谷歌官方回应“TensorFlow遭弃”:还在投资开发,将与JAX并肩作战
- 标签列表
-
- 框架图 (58)
- flask框架 (53)
- quartz框架 (51)
- abp框架 (47)
- jpa框架 (47)
- springmvc框架 (49)
- 分布式事务框架 (65)
- scrapy框架 (56)
- shiro框架 (61)
- 定时任务框架 (56)
- java日志框架 (61)
- JAVA集合框架 (47)
- mfc框架 (52)
- abb框架断路器 (48)
- beego框架 (52)
- java框架spring (58)
- grpc框架 (65)
- tornado框架 (48)
- 前端框架bootstrap (54)
- orm框架有哪些 (51)
- ppt框架 (48)
- 内联框架 (52)
- cad怎么画框架 (58)
- ssm框架实现登录注册 (49)
- oracle字符串长度 (48)