Doris版本:0.14x
Flink 通过 Stream load 将数据实时导入Doris BE节点,最大导入批次大小:5000条 导入频率:2s,经常会返回如下错误。
请问先红色标识的错误信息一般是什么原因导致的?谢谢
{
"TxnId": 11169477
"Label": "audit_20210520_150710_bd4c6dc578ad452f8b1c644ba7977611",
"Status": "Fail",
"Message": "already stopped, skip waiting for close. cancelled/!eos: : 1/0",
"NumberTotalRows": 0,
"NumberLoadedRows": 0,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 799,
"LoadTimeMs": 185,
"BeginTxnTimeMs": 3,
"StreamLoadPutTimeMs": 4,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 178,
"CommitAndPublishTimeMs": 0
}
收藏
点赞
0
个赞
请登录后评论
TOP
切换版块
大小为0可能是空版本,比如导入数据在这个分桶上没有数据。这里可能要注意是否有数据倾斜。
当前集群的版本情况是这样的
几乎没有BC,只有CC
目前打算执行以下两项参数,有没有更多的建议调整参数和参数值?
echo "compaction_task_num_per_disk=5" >> /etc/doris/be/conf/be.conf
echo "max_cumulative_compaction_num_singleton_deltas=500" >> /etc/doris/be/conf/be.conf
另外,查看版本合并时,发现有大量的版本大小为0,想要了解下这是什么情况造成的,数据导入失败?
tablet_id=2527704, txn_id=11178426, err=-215 应该是数据版本堆积过多,doris目前有版本数500的限制,由be的 max_tablet_version_num 参数控制。该参数是为了防止导入过于频繁,compaction速度无法跟上写入速度,而导致版本持续积压的问题。
这种问题通常需要降低导入频率,并等待compaction消化完当前的数据版本。Doris的监控中也有compaction score相关监控。
或者你可以通过 show tablet 2527704 ,然后执行后面的 show proc 语句来查看副本的版本数量(versionCount 列)
- AverageThreadTokens: 0.00
- FragmentCpuTime: 492.766us
- MemoryLimit: 2.00 GB
- PeakMemoryUsage: 1.21 MB
- PeakReservation: 0
- PeakUsedReservation: 0
- RowsProduced: 1
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
OlapTableSink:(Active: 21.307ms, non-child: 100.00%)
- CloseWaitTime: 20.370ms
- ConvertBatchTime: 0.000ns
- MaxAddBatchExecTime: 17.989ms
- NonBlockingSendTime: 1.409ms
- NonBlockingSendWorkTime: 298.548us
- SerializeBatchTime: 23.098us
- NumberBatchAdded: 10
- NumberNodeChannels: 10
- OpenTime: 744.186us
- RowsFiltered: 0
- RowsRead: 1
- RowsReturned: 1
- SendDataTime: 13.745us
- WaitMemLimitTime: 0.000ns
- TotalAddBatchExecTime: 39.369ms
- ValidateDataTime: 3.116us
BROKER_SCAN_NODE (id=0):(Active: 38.553us, non-child: 0.19%)
- BytesDecompressed: 0
- BytesRead: 284.00 B
- DecompressTime: 0.000ns
- FileReadTime: 5.751us - MaterializeTupleTime(*): 11.861us
- NumDiskAccess: 0 , txn_id=11178426, err=-215, id=d94952cdffe4ff08- - PeakMemoryUsage: 1.02 MB
- RowsRead: 1 1178426, err=-215
- RowsReturned: 1
- RowsReturnedRate: 25.94 K/sec
- TotalRawReadTime(*): 32.899us
- TotalReadThroughput: 0.00 /sec
- WaitScannerTime: 0.000ns
W0520 15:51:04.718528 2596 stream_load_executor.cpp:90] fragment execute failed, query_id=d94952cdffe4ff08-131b39ecc22d6988, err_msg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215, id=d94952cdffe4ff08-131b39ecc22d6988, job_id=-1, txn_id=11178426, label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e
W0520 15:51:04.718605 2801 stream_load.cpp:142] handle streaming load failed, id=d94952cdffe4ff08-131b39ecc22d6988, errmsg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215
be.INFO,选取了一个-215报错的Stream load日志如下:
需要麻烦看下这种错误有什么好的解决方案
I0520 15:51:04.691452 2801 stream_load.cpp:214] new income streaming load request.id=d94952cdffe4ff08-131b39ecc22d6988, job_id=-1, txn_id=-1, label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e, db=ods_dental, tbl=ods_dental_operationlog
I0520 15:51:04.696164 2801 stream_load_executor.cpp:53] begin to execute job. label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e, txn_id=11178426, query_id=d94952cdffe4ff08-131b39ecc22d6988
I0520 15:51:04.696190 2801 plan_fragment_executor.cpp:76] Prepare(): query_id=d94952cdffe4ff08-131b39ecc22d6988 fragment_instance_id=d94952cdffe4ff08-131b39ecc22d6989 backend_num=0 c002baa6e8f138e, version: 0
I0520 15:51:04.696252 2801 plan_fragment_executor.cpp:140] Using query memory limit: 2.00 GB
I0520 15:51:04.696797 2596 plan_fragment_executor.cpp:239] Open(): fragment_instance_id=d94952cdffe4ff08-131b39ecc22d6989 c002baa6e8f138e, version: 0
I0520 15:51:04.697333 2759 tablets_channel.cpp:59] open tablets channel: (id=d94952cdffe4ff08-131b39ecc22d6988,index_id=2527663), tablets num: 3, timeout(s): 36000
I0520 15:51:04.699055 2749 tablets_channel.cpp:141] close tablets channel: (id=d94952cdffe4ff08-131b39ecc22d6988,index_id=2527663), sender id: 0 c002baa6e8f138e, version: 0
I0520 15:51:04.699074 20369 tablet_sink.cpp:979] all node channels are stopped(maybe finished/offending/cancelled), consumer thread exit.
I0520 15:51:04.699285 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527680.39817555.a14fc33f88e02df4-29c9453fca0a2196, rowsetid: 0200000000dd21cd6a432b7fb01433bddc002baa6e8f138e, version: 0 ; , load_id=d94952cdffe4ff08-131b39ecc22d6988
I0520 15:51:04.699295 2749 delta_writer.cpp:343] close delta writer for tablet: 2527680, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0) t_id=2527704, txn_id=11178426, err=-215
I0520 15:51:04.699322 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527668.39817555.754bf6c720eb9f87-282e5c4b1f464694, rowsetid: 0200000000dd21ce6a432b7fb01433bddter write failed, tablet_id=2527704, txn_id=11178c002baa6e8f138e, version: 0
I0520 15:51:04.699327 2749 delta_writer.cpp:343] close delta writer for tablet: 2527668, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0)
I0520 15:51:04.699347 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527672.39817555.0f4841f222bb3bb5-8b4b4268a0aeecac, rowsetid: 0200000000dd21cf6a432b7fb01433bdd)(1)} {10003:(17)(1)} {10005:(0)(1)} {10008:(0)(1c002baa6e8f138e, version: 0
I0520 15:51:04.699350 2749 delta_writer.cpp:343] close delta writer for tablet: 2527672, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0) , txn_id=11178426, err=-215
I0520 15:51:04.699422 2749 load_channel_mgr.cpp:152] removing load channel d94952cdffe4ff08-131b39ecc22d6988 because it's finished
I0520 15:51:04.699430 2749 load_channel.cpp:38] load channel mem peak usage=4096, info=limit: 2147483648; consumption: 0; label: LoadChannel:d94952cdffe4ff08-131b39ecc22d6988; all tracker size: 3; limit trackers size: 2; parent is null: false; , load_id=d94952cdffe4ff08-131b39ecc22d6988
W0520 15:51:04.699900 2759 tablet_sink.cpp:168] NodeChannel[2527663-10008] add batch req success but status isn't ok, load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426, node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215
W0520 15:51:04.700928 2596 tablet_sink.cpp:733] NodeChannel[2527663-10008]: close channel failed, load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426. error_msg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215
I0520 15:51:04.718088 2596 tablet_sink.cpp:749] total mem_exceeded_block_ns=0, total queue_push_lock_ns=0, total actual_consume_ns=298548
I0520 15:51:04.718101 2596 tablet_sink.cpp:780] finished to close olap table sink. load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426, node add batch time(ms)/num: {10006:(17)(1)} {10009:(0)(1)} {10002:(0)(1)} {10007:(0)(1)} {10010:(0)(1)} {10003:(17)(1)} {10005:(0)(1)} {10008:(0)(1)} {10011:(0)(1)} {10004:(0)(1)}
W0520 15:51:04.718353 2596 fragment_mgr.cpp:230] Got error while opening fragment d94952cdffe4ff08-131b39ecc22d6989: Internal error: close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215
I0520 15:51:04.718505 2596 plan_fragment_executor.cpp:583] Fragment d94952cdffe4ff08-131b39ecc22d6989:(Active: 20.468ms, non-child: 0.00%)
already stopped, skip waiting for close. cancelled/!eos: : 1/0"
这个错误可能是由多种愿意引起的,比如某次rpc失败,某个节点block住等等。目前版本这里的错误信息不太完善。定位问题比较麻烦。
可能需要先用 txnId 查找到对应的导入计划的 query id,再通过query id 查找对应的错误信息。
或者你可以尝试在be.INFO 中搜索 -215,-235 或者类似的负数错误码,可能是导入失败的原因。
我们在下一个版本优化了这部分的报错信息。