xiaomi / rdsn Goto Github PK
View Code? Open in Web Editor NEWHas been migrated to https://github.com/apache/incubator-pegasus/tree/master/rdsn
License: Other
Has been migrated to https://github.com/apache/incubator-pegasus/tree/master/rdsn
License: Other
#290 提供cpu profiling功能时,遇到svg图中显示function address,而没有function name的bug。
目前解决方案为,推荐使用google/pprof,而不使用gperftool提供的pprof(perl脚本)。
根本原因是pprof perl脚本在FetchSymbols时使用了curl命令。
由于FetchSymbols的request带有较大数据,curl在不指定http版本情况下,就在请求中加入了"Expect : 100-continue"。(如果--http1.0限制版本,就不会走这一流程。实测http1.1/2都会走expect 100-continue)
而rdsn的http parser并不支持这一机制。
未来优化rdsn http server时可以考虑支持下。
What we need to do:
This is a minor release, including some improvements on CU calculation.
fixed by #165
for profiling slow query
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 1 issue was created.
It is closed now.
❤️ #307 build: use official thrift, cmake set default 3rdlibs, curl without idn, by vagetablechicken
🔈 #307 build: use official thrift, cmake set default 3rdlibs, curl without idn, by vagetablechicken
It received 1 comments.
Last week, 6 pull requests were created, updated or merged.
Last week, 2 pull requests were updated.
💛 #302 refactor: reimplement mutation_log::replay in a block by block way, by neverchanje
💛 #298 feat(throttle): support size-based write throttling, by neverchanje
Last week, 4 pull requests were merged.
💜 #307 build: use official thrift, cmake set default 3rdlibs, curl without idn, by vagetablechicken
💜 #305 test: add unit tests for task, by neverchanje
💜 #304 feat(dup): add interface mutation_duplicator & duplication procedure, by neverchanje
💜 #299 feat(split): parent replica prepare states, by hycdong
Last week there were 4 commits.
🛠️ feat(split): parent replica prepare states (#299) by hycdong
🛠️ feat(dup): add interface mutation_duplicator & duplication procedure (#304) by neverchanje
🛠️ build: change thrift compiler, set default 3rdlibs (#307) by vagetablechicken
🛠️ add unit tests for task (#305) by neverchanje
Last week there were 3 contributors.
👤 hycdong
👤 neverchanje
👤 vagetablechicken
Last week there was 1 stargazer.
⭐ IngrownMink4
You are the star! 🌟
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
在线上偶尔会出现下面的一种情况:
Config[C,B,A], LastDrop[]
Config[C,B], LastDrop[A]
Config[C], LastDrop[A,B]
Config[C,A], LastDrop[B]
Config[], LastDrop[B,A,C]
以上只是一种示例情况。实际上,只要某个partition进入DDD状态,且LastDrop的最后两个节点中有一个节点无法正常启动,就会进入需要人工干预的DDD状态。而在线上集群多个节点的起起停停过程中,这种情况是很容易出现的。
如果这样的partition很多,人工干预就是一个很大的工作量,真的通过人工操作来一个个恢复是不现实的。所以我们要思考,是否可以将这种情况的恢复过程自动化?
如果LastDrop最后两个节点中,有一个恢复正常,另外一个是要踢掉的节点,实际上就可以自动来恢复,譬如直接选择恢复正常的那个节点作为Primary。但是我们需要证明,这样的选择确定不会丢数据。
或者退一步,即使不完全自动化,如果在Shell中提供一个自动诊断工具,将处于需要人工干预的DDD状态的partition信息展示出来,给出运维干预的建议,然后让用户确认,选择出合适的方案,一方面能够大大降低运维工作了,另一方面也能保证数据的正确性。
The shared log's size is calculated by "global_end_offset - global_start_offset". When all shared log are deleted, the "global_start_offset" will be reset to 0, while the "global_end_offset" will remain unchanged, in which case we will get a uncorrect "size" of shared log.
please see
rdsn/src/dist/replication/lib/mutation_log.cpp
Line 1668 in 6ce6d43
Setting "_global_start_offset" to "_global_end_offset" may fix this, but we need to test it.
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 4 issues were created.
Of these, 0 issues have been closed and 4 issues are still open.
💚 #300 fix: use derror rather than dwarn for failed network bootstrap, by neverchanje
💚 #299 feat(split): parent replica prepare states, by hycdong
💚 #298 feat(throttle): support size-based write throttling, by neverchanje
💚 #297 feat(dup): implement duplication_sync on meta server side, by neverchanje
Last week, 5 pull requests were created, updated or merged.
Last week, 3 pull requests were updated.
💛 #299 feat(split): parent replica prepare states, by hycdong
💛 #298 feat(throttle): support size-based write throttling, by neverchanje
💛 #297 feat(dup): implement duplication_sync on meta server side, by neverchanje
Last week, 2 pull requests were merged.
💜 #296 http: improvement on http api, by neverchanje
💜 #291 split: parent replica create child replica, by hycdong
Last week there were 2 commits.
🛠️ split: parent replica create child replica (#291) by hycdong
🛠️ http: improvement on http api (#296) by neverchanje
Last week there were 2 contributors.
👤 hycdong
👤 neverchanje
Last week there were no stargazers.
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
replica因为error需要关闭,在replica::close()中会close _app。
但是其他线程(gc/manual_compact/checkpoint)可能存取_app,然后出core
(gdb) bt
#0 dsn::replication::replica::last_durable_decree (this=<optimized out>) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica.cpp:357
#1 0x00007f0215f1760f in dsn::replication::replica_stub::on_gc (this=0x1cd9a70) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1488
#2 0x00007f021606c28e in dsn::timer_task::exec (this=0x7f00c0006140) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task.cpp:479
#3 0x00007f021606d199 in dsn::task::exec_internal (this=this@entry=0x7f00c0006140) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task.cpp:195
#4 0x00007f02160ff38d in dsn::task_worker::loop (this=0x1c2e320) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:323
#5 0x00007f02160ff559 in dsn::task_worker::run_internal (this=0x1c2e320) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:302
#6 0x00007f0213967600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#7 0x00007f02141e0dc5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f02130d173d in clone () from /lib64/libc.so.6
(gdb) f 1
#1 0x00007f0215f1760f in dsn::replication::replica_stub::on_gc (this=0x1cd9a70) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1488
1488 in /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp
(gdb) p r._obj._config
$5 = {_vptr.replica_configuration = 0x7f0216403ed0 <vtable for dsn::replication::replica_configuration+16>, pid = {_value = {u = {app_id = 2, partition_index = 3124}, value = 13417477832706}},
ballot = 4, primary = {static s_invalid_address = {static s_invalid_address = <same as static member of an already seen type>, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0,
uri = 0}, group = {type = 0, group = 0}, value = 0}}, _addr = {v4 = {type = 1, padding = 0, port = 34801, ip = 177090572}, uri = {type = 1, uri = 190149554362662912}, group = {type = 1,
group = 190149554362662912}, value = 760598217450651649}}, status = dsn::replication::partition_status::PS_ERROR, learner_signature = 17179869186, __isset = {pid = true, ballot = true,
primary = true, status = true, learner_signature = true}}
(gdb)
相关日志:
D2018-05-15 07:08:35.530 (1526339315530452861 123eb) replica.rep_long4.0405000208ad6b5f: replica_learn.cpp:936:on_copy_remote_state_completed(): [email protected]:34801: on_copy_remote_state_completed[0000000400000002]: learnee = 10.142.48.12:34801, learn_duration = 292299 ms, copy remote state done, err = ERR_TIMEOUT, copy_file_count = 2, copy_file_size = 0, copy_time_used = 29260 ms, local_committed_decree = 2871353, app_committed_decree = 2871353, app_durable_decree = 2869256, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
D2018-05-15 07:08:35.530 (1526339315530511263 123b9) replica.replica2.040700040000bee7: replica_learn.cpp:1147:on_learn_remote_state_completed(): [email protected]:34801: on_learn_remote_state_completed[0000000400000002]: learnee = 10.142.48.12:34801, learn_duration = 292299 ms, err = ERR_TIMEOUT, local_committed_decree = 2871353, app_committed_decree = 2871353, app_durable_decree = 2869256, current_learning_status = replication::learner_status::LearningWithoutPrepare
E2018-05-15 07:08:35.530 (1526339315530524828 123b9) replica.replica2.040700040000bee7: replica_learn.cpp:1170:handle_learning_error(): [email protected]:34801: handle_learning_error[0000000400000002]: learnee = 10.142.48.12:34801, learn_duration = 292299 ms, err = ERR_TIMEOUT, local_error
D2018-05-15 07:08:35.530 (1526339315530639621 123b9) replica.replica2.040700040000bee7: replica_config.cpp:900:update_local_configuration(): [email protected]:34801: status change replication::partition_status::PS_POTENTIAL_SECONDARY @ 4 => replication::partition_status::PS_ERROR @ 4, pre(2871353, 2871353), app(2871353, 2869256), duration = 292300 ms, replica_configuration(pid=2.3124, ballot=4, primary=10.142.48.12:34801, status=2, learner_signature=17179869186)
D2018-05-15 07:08:35.530 (1526339315530650745 123b9) replica.replica2.040700040000bee7: replica_config.cpp:909:update_local_configuration(): [email protected]:34801: being close ...
time: 20190717 11:39:15
cluster: tjwqtst-staging
node: tj1-hadoop-pegasus-tst-ts09
version: 1.11.5
CALL [replica-server] [10.38.162.236:31801] succeed: Pegasus Server 1.11.5 (ba0661d17a96143164d7a0a5c17bb88c0c1dd44d) Release, Started at 2019-07-17 11:39:15
core file: core.replica.asio.3.120767.1563334709
code:
https://github.com/XiaoMi/rdsn/blob/709ea4117fd31b2bd2788dddb1b41f94e8307210/src/core/tools/common/thrift_message_parser.cpp
call stack:
#0 0x00007f19d0d7d1d7 in raise () from /lib64/libc.so.6
#1 0x00007f19d0d7e8c8 in abort () from /lib64/libc.so.6
#2 0x00007f19d48b743e in dsn_coredump () at /home/wutao1/pegasus-release/rdsn/src/core/core/service_api_c.cpp:73
#3 0x00007f19d493e843 in dsn::thrift_message_parser::parse_message (thrift_header=..., message_data=...)
at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/thrift_message_parser.cpp:275
#4 0x00007f19d493eb13 in dsn::thrift_message_parser::get_message_on_receive (this=0x50a07ee10, reader=0x59bc4a80, read_next=@0x7f19bc46e04c: 4096)
at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/thrift_message_parser.cpp:72
#5 0x00007f19d495a5ff in operator() (length=<optimized out>, __closure=0x7f19bc46e090, ec=...) at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/asio_rpc_session.cpp:114
#6 operator() (this=0x7f19bc46e090) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/bind_handler.hpp:127
#7 asio_handler_invoke<boost::asio::detail::binder2<dsn::tools::asio_rpc_session::do_read(int)::__lambda2, boost::system::error_code, long unsigned int> > (function=...)
at /home/wutao1/boost_1_58_0/output/include/boost/asio/handler_invoke_hook.hpp:69
#8 invoke<boost::asio::detail::binder2<dsn::tools::asio_rpc_session::do_read(int)::__lambda2, boost::system::error_code, long unsigned int>, dsn::tools::asio_rpc_session::do_read(int)::__lambda2> (context=..., function=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#9 boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, dsn::tools::asio_rpc_session::do_read(int)::__lambda2>::do_complete(boost::asio::detail::io_service_impl *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=<optimized out>, base=<optimized out>)
at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/reactive_socket_recv_op.hpp:110
#10 0x000000000074fec9 in complete (bytes_transferred=<optimized out>, ec=..., owner=..., this=<optimized out>)
at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/task_io_service_operation.hpp:38
#11 do_run_one (ec=..., this_thread=..., lock=..., this=0x29c4620) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:372
#12 boost::asio::detail::task_io_service::run (this=0x29c4620, ec=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:149
#13 0x00007f19d4952cc6 in run (this=<optimized out>, ec=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/impl/io_service.ipp:66
#14 operator() (__closure=0x299d930) at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/asio_net_provider.cpp:73
#15 _M_invoke<> (this=0x299d930) at /home/wutao1/app/include/c++/4.8.2/functional:1732
#16 operator() (this=0x299d930) at /home/wutao1/app/include/c++/4.8.2/functional:1720
#17 std::thread::_Impl<std::_Bind_simple<dsn::tools::asio_network_provider::start(dsn::rpc_channel, int, bool)::__lambda2()> >::_M_run(void) (this=0x299d918)
at /home/wutao1/app/include/c++/4.8.2/thread:115
#18 0x00007f19d16d5600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#19 0x00007f19d2348dc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f19d0e3f73d in clone () from /lib64/libc.so.6
test the performance of dpdk. if necessary, we can use dpdk if network card supports.
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 4 issues were created.
Of these, 2 issues have been closed and 2 issues are still open.
💚 #317 feat(dup): implement procedure load_from_private_log, by neverchanje
💚 #316 add http interface for get_app_envs, by levy5307
❤️ #315 feat(dup): verify private log validity before starting to duplicate, by neverchanje
❤️ #314 feat: support table-level slow query on meta server, by levy5307
Last week, 9 pull requests were created, updated or merged.
Last week, 1 pull request was updated.
💛 #317 feat(dup): implement procedure load_from_private_log, by neverchanje
Last week, 8 pull requests were merged.
💜 #315 feat(dup): verify private log validity before starting to duplicate, by neverchanje
💜 #314 feat: support table-level slow query on meta server, by levy5307
💜 #312 feat(dup): implement ship_mutation stage and mutation_batch, by neverchanje
💜 #311 refactor: rename disk_aio to aio_context, by neverchanje
💜 #310 refactor: remove empty_aio_provider and posix aio_provider, by neverchanje
💜 #309 feat(split): child replica learn parent prepare list and checkpoint, by hycdong
💜 #302 refactor: introduce mutation_log::replay_block, by neverchanje
💜 #298 feat(throttle): support size-based write throttling, by neverchanje
Last week there were 8 commits.
🛠️ feat: support table-level slow query on meta server (#314) by levy5307
🛠️ feat(split): child replica learn parent prepare list and checkpoint (#309) by hycdong
🛠️ feat(dup): verify private log validity before starting to duplicate (#315) by neverchanje
🛠️ feat(dup): implement ship_mutation stage and mutation_batch (#312) by neverchanje
🛠️ feat(throttle): support size-based write throttling (#298) by neverchanje
🛠️ refactor: rename disk_aio to aio_context (#311) by neverchanje
🛠️ refactor: introduce mutation_log::replay_block (#302) by neverchanje
🛠️ refactor: remove empty_aio_provider and posix aio_provider (#310) by neverchanje
Last week there were 3 contributors.
👤 levy5307
👤 hycdong
👤 neverchanje
Last week there were no stargazers.
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 3 issues were created.
Of these, 0 issues have been closed and 3 issues are still open.
💚 #321 feat(http): add http interface for get_app_envs, by levy5307
💚 #320 feat(dup): protect private log from missing when duplication is enabled, by neverchanje
💚 #319 feat(split): child replica apply private logs, in-memory mutations and catch up parent, by hycdong
🔈 #321 feat(http): add http interface for get_app_envs, by levy5307
It received 1 comments.
Last week, 4 pull requests were created, updated or merged.
Last week, 1 pull request was opened.
💚 #320 feat(dup): protect private log from missing when duplication is enabled, by neverchanje
Last week, 2 pull requests were updated.
💛 #321 feat(http): add http interface for get_app_envs, by levy5307
💛 #319 feat(split): child replica apply private logs, in-memory mutations and catch up parent, by hycdong
Last week, 1 pull request was merged.
💜 #317 feat(dup): implement procedure load_from_private_log, by neverchanje
Last week there was 1 commit.
🛠️ feat(dup): implement procedure load_from_private_log (#317) by neverchanje
Last week there was 1 contributor.
👤 neverchanje
Last week there were no stargazers.
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
currently replica_server、 meta server、Poco、boost are dynamic libraries, we can change it to static libs for better deployment.
if we change all libraries to static libraries, we can link the tcmalloc try to resolve issue #28
cluster: tjwqtst-staging
node: 10.38.162.227
version: Pegasus Server 1.9.0 (972fc0148f89935e828181a7a3cd6b0c69150467) Release
/home/work/app/pegasus/tjwqtst-staging/replica/log/log.502.txt
D2018-06-28 23:41:28.254 (1530200488254061413 e6f2) replica.io-thrd.59122: network.cpp:619:on_server_session_accepted(): server session accepted, remote_client = 10.38.166.13:26882, current_count = 777
E2018-06-28 23:41:28.254 (1530200488254075072 e6ef) replica.io-thrd.59119: network.cpp:244:prepare_parser(): invalid header type, remote_client = 10.38.166.13:26882, header_type = '\FF\00\00('
E2018-06-28 23:41:28.254 (1530200488254165882 e6ef) replica.io-thrd.59119: asio_rpc_session.cpp:128:operator()(): asio read from 10.38.166.13:26882 failed
D2018-06-28 23:41:28.254 (1530200488254175468 e6ef) replica.io-thrd.59119: network.cpp:639:on_server_session_disconnected(): server session disconnected, remote_client = 10.38.166.13:26882, current_count = 776
D2018-06-28 23:41:28.254 (1530200488254583273 e6f0) replica.io-thrd.59120: network.cpp:619:on_server_session_accepted(): server session accepted, remote_client = 10.38.166.13:26883, current_count = 777
E2018-06-28 23:41:28.254 (1530200488254598616 e6ef) replica.io-thrd.59119: network.cpp:244:prepare_parser(): invalid header type, remote_client = 10.38.166.13:26883, header_type = '\00\1E\00\06'
E2018-06-28 23:41:28.254 (1530200488254618606 e6ef) replica.io-thrd.59119: asio_rpc_session.cpp:128:operator()(): asio read from 10.38.166.13:26883 failed
D2018-06-28 23:41:28.254 (1530200488254625148 e6ef) replica.io-thrd.59119: network.cpp:639:on_server_session_disconnected(): server session disconnected, remote_client = 10.38.166.13:26883, current_count = 776
... ...
D2018-06-28 23:41:31.278 (1530200491278023227 e6f2) replica.io-thrd.59122: asio_rpc_session.cpp:102:operator()(): asio read from 10.118.45.2:49038 failed: End of file
D2018-06-28 23:41:31.278 (1530200491278032958 e6f2) replica.io-thrd.59122: network.cpp:639:on_server_session_disconnected(): server session disconnected, remote_client = 10.118.45.2:49038, current_count = 823
W2018-06-28 23:41:31.278 (1530200491278143007 e6f2) replica.io-thrd.59122: asio_rpc_session.cpp:189:safe_close(): asio socket shutdown failed, error = Bad file descriptor
W2018-06-28 23:41:31.278 (1530200491278148003 e6f2) replica.io-thrd.59122: asio_rpc_session.cpp:192:safe_close(): asio socket close failed, error = Bad file descriptor
gdb /home/work/app/pegasus/tjwqtst-staging/replica/package/bin/replica_server /home/core/core.replica.asio.3.59078.1530200491
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/work/app/pegasus/tjwqtst-staging/replica/package/bin/pegasus_server confi'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007ffab56dd1d7 in raise () from /lib64/libc.so.6
(gdb)
(gdb) bt
#0 0x00007ffab56dd1d7 in raise () from /lib64/libc.so.6
#1 0x00007ffab56de8c8 in abort () from /lib64/libc.so.6
#2 0x00007ffab64af9f5 in tcmalloc::Log (mode=mode@entry=tcmalloc::kCrash, filename=filename@entry=0x7ffab64c5cee "src/tcmalloc.cc", line=line@entry=332, a=..., b=..., c=..., d=...)
at src/internal_logging.cc:118
#3 0x00007ffab64a4564 in (anonymous namespace)::InvalidFree (ptr=<optimized out>) at src/tcmalloc.cc:332
#4 0x00007ffab8c2497d in weak_release (this=0xffffffff) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:160
#5 release (this=0xffffffff) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:147
#6 ~shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/detail/shared_count.hpp:443
#7 ~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/shared_ptr.hpp:323
#8 dsn::thrift_message_parser::parse_message (thrift_header=..., message_data=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/thrift_message_parser.cpp:255
#9 0x00007ffab8c24cc7 in dsn::thrift_message_parser::get_message_on_receive (this=0x32b94d9a0, reader=0x10275da10, read_next=@0x7ffaa0dce0cc: 4096)
at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/thrift_message_parser.cpp:72
#10 0x00007ffab8c2eb47 in operator() (length=<optimized out>, __closure=0x7ffaa0dce110, ec=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/asio_rpc_session.cpp:119
#11 operator() (this=0x7ffaa0dce110) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/bind_handler.hpp:127
#12 asio_handler_invoke<boost::asio::detail::binder2<dsn::tools::asio_rpc_session::do_read(int)::__lambda1, boost::system::error_code, long unsigned int> > (function=...)
at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/handler_invoke_hook.hpp:69
#13 invoke<boost::asio::detail::binder2<dsn::tools::asio_rpc_session::do_read(int)::__lambda1, boost::system::error_code, long unsigned int>, dsn::tools::asio_rpc_session::do_read(int)::__lambda1> (
context=..., function=...) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#14 boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, dsn::tools::asio_rpc_session::do_read(int)::__lambda1>::do_complete(boost::asio::detail::io_service_impl *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=<optimized out>, base=<optimized out>)
at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/reactive_socket_recv_op.hpp:110
#15 0x0000000000584540 in complete (bytes_transferred=<optimized out>, ec=..., owner=..., this=<optimized out>)
at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/task_io_service_operation.hpp:38
#16 do_run_one (ec=..., this_thread=..., lock=..., this=0x2ccc7e0) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:372
#17 boost::asio::detail::task_io_service::run (this=0x2ccc7e0, ec=...) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:149
#18 0x00007ffab8c13bef in run (this=0x2c9f9c8) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/asio/impl/io_service.ipp:59
#19 operator() (__closure=<optimized out>) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/asio_net_provider.cpp:69
#20 _M_invoke<> (this=<optimized out>) at /home/work/qinzuoyan/Pegasus/toolchain/output/include/c++/4.8.2/functional:1732
#21 operator() (this=<optimized out>) at /home/work/qinzuoyan/Pegasus/toolchain/output/include/c++/4.8.2/functional:1720
#22 std::thread::_Impl<std::_Bind_simple<dsn::tools::asio_network_provider::start(dsn::rpc_channel, int, bool, dsn::io_modifer&)::__lambda1()> >::_M_run(void) (this=<optimized out>)
at /home/work/qinzuoyan/Pegasus/toolchain/output/include/c++/4.8.2/thread:115
#23 0x00007ffab6035600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#24 0x00007ffab6ca2dc5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007ffab579f73d in clone () from /lib64/libc.so.6
(gdb)
(gdb) f 8
#8 dsn::thrift_message_parser::parse_message (thrift_header=..., message_data=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/thrift_message_parser.cpp:255
255 /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/tools/common/thrift_message_parser.cpp: No such file or directory.
(gdb) p dsn_hdr
$1 = (dsn::message_header *) 0x73a23f844
(gdb) p *dsn_hdr
$2 = {hdr_type = 1413892180, hdr_version = 0, hdr_length = 192, hdr_crc32 = 0, body_length = 79, body_crc32 = 0, id = 26, trace_id = 0, rpc_name = "RPC_RRDB_RRDB_GET", '\000' <repeats 30 times>,
rpc_code = {local_code = 0, local_hash = 0}, gpid = {_value = {u = {app_id = 49, partition_index = 1}, value = 4294967345}}, context = {u = {is_request = 1, is_forwarded = 0, unused = 0,
serialize_format = 1, is_forward_supported = 0, parameter_type = 0, parameter = 0}, context = 65}, from_address = {static s_invalid_address = {
static s_invalid_address = <same as static member of an already seen type>, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0},
value = 0}}, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0}, value = 0}}, client = {timeout_ms = 0, thread_hash = 388032,
partition_hash = 0}, server = {error_name = '\000' <repeats 47 times>, error_code = {local_code = 0, local_hash = 0}}}
(gdb) f 6
#6 ~shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/detail/shared_count.hpp:443
443 /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/detail/shared_count.hpp: No such file or directory.
(gdb) p this
$3 = (boost::detail::shared_count * const) <synthetic pointer>
(gdb) p *this
$4 = {pi_ = 0xffffffff}
(gdb) f 7
#7 ~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/shared_ptr.hpp:323
323 /home/work/qinzuoyan/software/boost_1_58_0/output/include/boost/smart_ptr/shared_ptr.hpp: No such file or directory.
(gdb) p this
$5 = (boost::shared_ptr<dsn::binary_reader_transport> * const) <synthetic pointer>
(gdb) p *this
$6 = {px = <optimized out>, pn = {pi_ = 0xffffffff}}
(gdb) p stream
$8 = {<dsn::binary_reader> = {_vptr.binary_reader = 0x9014d0 <vtable for dsn::rpc_read_stream+16>, _blob = {_holder = {<std::__shared_ptr<char, (__gnu_cxx::_Lock_policy)2>> = {
_M_ptr = 0x6e1402000 "THFT", _M_refcount = {_M_pi = 0x18c4eafc0}}, <No data fields>}, _buffer = 0x6e1402000 "THFT", _data = 0x6e1402d16 "\200\001", _length = 79}, _size = 79,
_ptr = 0x6e1402d33 "\f", _remaining_size = 50}, _msg = 0x73a23f780}
(gdb) p stream._msg
$9 = (dsn_message_t) 0x73a23f780
(gdb) p ('dsn::message_ex'*)stream._msg
$10 = (dsn::message_ex *) 0x73a23f780
(gdb) p *('dsn::message_ex'*)stream._msg
$11 = {<dsn::ref_counter> = {_vptr.ref_counter = 0x7ffab8f165b0 <vtable for dsn::message_ex+16>, _magic = 3735928559, _counter = {<std::__atomic_base<long>> = {
_M_i = 0}, <No data fields>}}, <dsn::extensible_object<dsn::message_ex, 4>> = {<dsn::extensible> = {_ptr = 0x73a23f7a8, _count = 4}, static INVALID_SLOT = <optimized out>,
static INVALID_VALUE = <optimized out>, _extensions = {0, 0, 0, 0}, static s_extensionDeletors = {0x0, 0x0, 0x0, 0x0}, static s_nextExtensionIndex = {<std::__atomic_base<unsigned int>> = {
_M_i = 1}, <No data fields>}}, <dsn::callocator_object<dsn::tls_trans_malloc, dsn::tls_trans_free>> = {<No data fields>}, header = 0x73a23f844,
buffers = {<std::_Vector_base<dsn::blob, std::allocator<dsn::blob> >> = {_M_impl = {<std::allocator<dsn::blob>> = {<__gnu_cxx::new_allocator<dsn::blob>> = {<No data fields>}, <No data fields>},
_M_start = 0x1b073f540, _M_finish = 0x1b073f590, _M_end_of_storage = 0x1b073f590}}, <No data fields>}, io_session = {_obj = 0x0}, to_address = {static s_invalid_address = {
static s_invalid_address = <same as static member of an already seen type>, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0},
value = 0}}, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0}, value = 0}}, server_address = {static s_invalid_address = {
static s_invalid_address = <same as static member of an already seen type>, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0},
value = 0}}, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0}, value = 0}}, local_rpc_code = {_internal_code = 0}, hdr_format = {
_internal_code = 0}, send_retry_count = 0, dl = {_next = 0x73a23f810, _prev = 0x73a23f810}, static _id = {<std::__atomic_base<unsigned long>> = {_M_i = 63889502}, <No data fields>}, _rw_index = 1,
_rw_offset = 0, _rw_committed = false, _is_read = true, static s_local_hash = 0}
(gdb) p *(('dsn::message_ex'*)stream._msg).header
$12 = {hdr_type = 1413892180, hdr_version = 0, hdr_length = 192, hdr_crc32 = 0, body_length = 79, body_crc32 = 0, id = 26, trace_id = 0, rpc_name = "RPC_RRDB_RRDB_GET", '\000' <repeats 30 times>,
rpc_code = {local_code = 0, local_hash = 0}, gpid = {_value = {u = {app_id = 49, partition_index = 1}, value = 4294967345}}, context = {u = {is_request = 1, is_forwarded = 0, unused = 0,
serialize_format = 1, is_forward_supported = 0, parameter_type = 0, parameter = 0}, context = 65}, from_address = {static s_invalid_address = {
static s_invalid_address = <same as static member of an already seen type>, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0},
value = 0}}, _addr = {v4 = {type = 0, padding = 0, port = 0, ip = 0}, uri = {type = 0, uri = 0}, group = {type = 0, group = 0}, value = 0}}, client = {timeout_ms = 0, thread_hash = 388032,
partition_hash = 0}, server = {error_name = '\000' <repeats 47 times>, error_code = {local_code = 0, local_hash = 0}}}
(gdb)
看起来像是thrift_message_parser::parse_message()中trans_ptr的内存被写坏了
当一个节点挂了比较长的时间,重新复活后数据是如何同步的?
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 2 issues were created.
Of these, 0 issues have been closed and 2 issues are still open.
💚 #324 refactor: remove replay related codes to mutation_log_replay, by neverchanje
💚 #323 refactor: rename task::is_empty to is_callback_empty & move aio tests…, by neverchanje
Last week, 4 pull requests were created, updated or merged.
Last week, 2 pull requests were opened.
💚 #324 refactor: remove replay related codes to mutation_log_replay, by neverchanje
💚 #323 refactor: rename task::is_empty to is_callback_empty & move aio tests…, by neverchanje
Last week, 2 pull requests were updated.
💛 #321 feat(http): add http interface for get_app_envs, by levy5307
💛 #320 feat(dup): protect private log from missing when duplication is enabled, by neverchanje
Last week there were no commits.
Last week there were no contributors.
Last week there were no stargazers.
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
add http interface for get_app_envs
Zookeeper客户端支持kerberos认证
Since clang is now supported, we can run IWYU periodically (weekly maybe?).
https://github.com/include-what-you-use/include-what-you-use
shared log is mainly used for sequential write to disks, which is excellent design for spin disks or for SSDs when "fsync" is necessary.
however, currently in rdsn, the data are stored in ssds without "fsync", so we can try to remove the shared log for better performance and easier archiecture.
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 4 issues were created.
Of these, 1 issues have been closed and 3 issues are still open.
💚 #305 test: add unit tests for task, by neverchanje
💚 #304 feat(dup): add interface mutation_duplicator & duplication procedure, by neverchanje
💚 #302 refactor: reimplement mutation_log::replay in a block by block way, by neverchanje
❤️ #303 build: remove MY_PROJ_INC_PATH, by vagetablechicken
Last week, 7 pull requests were created, updated or merged.
Last week, 4 pull requests were updated.
💛 #304 feat(dup): add interface mutation_duplicator & duplication procedure, by neverchanje
💛 #302 refactor: reimplement mutation_log::replay in a block by block way, by neverchanje
💛 #299 feat(split): parent replica prepare states, by hycdong
💛 #298 feat(throttle): support size-based write throttling, by neverchanje
Last week, 3 pull requests were merged.
💜 #303 build: remove MY_PROJ_INC_PATH, by vagetablechicken
💜 #300 fix(network): use derror rather than dwarn for failed network bootstrap, by neverchanje
💜 #297 feat(dup): implement duplication_sync on meta server side, by neverchanje
Last week there were 3 commits.
🛠️ build: remove MY_PROJ_INC_PATH (#303) by vagetablechicken
🛠️ fix(network): use derror rather than dwarn for failed network bootstrap (#300) by neverchanje
🛠️ feat(dup): implement duplication_sync on meta server side (#297) by neverchanje
Last week there were 2 contributors.
👤 vagetablechicken
👤 neverchanje
Last week there was 1 stargazer.
⭐ stone-wind
You are the star! 🌟
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
Currently we try to balance the replicas of each app to nodes & disks, so as to ensuring the difference of partition count on each node no more than 1. but when there are more than 1 apps, the difference may accumulated. We may need to improve this.
时间:2018-05-11 01:52
集群:c3srv-stat
节点:10.142.11.54
版本:1.8.0 (dba2265cf29435729fa6d2a1e4e3e22b71b7d74f) Release
core文件:/home/core/core.replica.replica.98100.1525974558
log文件:/home/work/app/pegasus/c3srv-stat/replica/log/log.495.txt
Core was generated by `/home/work/app/pegasus/c3srv-stat/replica/package/bin/pegasus_server config.ini'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fadb38741d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fadb38741d7 in raise () from /lib64/libc.so.6
#1 0x00007fadb38758c8 in abort () from /lib64/libc.so.6
#2 0x00007fadb685fd8e in dsn_coredump () at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/service_api_c.cpp:176
#3 0x00007fadb67999d0 in dsn::replication::replica::update_local_configuration (this=this@entry=0x7fac401658d0, config=..., same_ballot=same_ballot@entry=true)
at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp:738
#4 0x00007fadb67f3d5a in dsn::replication::replica::on_add_learner (this=0x7fac401658d0, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_learn.cpp:1377
#5 0x00007fadb67687ed in dsn::replication::replica_stub::on_add_learner (this=0x1c36770, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:973
#6 0x00007fadb678406d in bool dsn::serverlet<dsn::replication::replica_stub>::register_rpc_handler<dsn::replication::group_check_request>(dsn::task_code, char const*, void (dsn::replication::replica_stub::*)(dsn::replication::group_check_request const&), dsn::gpid)::{lambda(void*, void*)#1}::_FUN(void*, void*) () at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/include/dsn/cpp/serverlet.h:184
#7 0x00007fadb68ba157 in run (req=<optimized out>, this=0x7fac640069b0) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/include/dsn/tool-api/task.h:249
#8 dsn::rpc_request_task::exec (this=<optimized out>) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/include/dsn/tool-api/task.h:276
#9 0x00007fadb68b95b9 in dsn::task::exec_internal (this=this@entry=0x7fa253c2ac44) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task.cpp:195
#10 0x00007fadb694a8ed in dsn::task_worker::loop (this=0x1ba5b00) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:323
#11 0x00007fadb694aab9 in dsn::task_worker::run_internal (this=0x1ba5b00) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:302
#12 0x00007fadb41cc600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#13 0x00007fadb4a45dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fadb393673d in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x00007fadb67f3d5a in dsn::replication::replica::on_add_learner (this=0x7fac401658d0, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_learn.cpp:1377
1377 in /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_learn.cpp
(gdb) p this._config
$4 = {_vptr.replica_configuration = 0x7fadb6c46570 <vtable for dsn::replication::replica_configuration+16>, pid = {_value = {u = {app_id = 8, partition_index = 5}, value = 21474836488}}, ballot = 27,
primary = {_addr = {u = {v4 = {type = 1, padding = 0, port = 43801, ip = 177081397}, uri = {type = 1, uri = 190139702928883712}, group = {type = 1, group = 190139702928883712},
value = 760558811715534849}}}, status = dsn::replication::partition_status::PS_POTENTIAL_SECONDARY, learner_signature = 115964116994, __isset = {pid = true, ballot = true, primary = true,
status = true, learner_signature = true}}
代码@replica_config.cpp:738:
用当前版本 9377c4c 的Shell,访问老版本(<=1.8.1)集群:
>>>app_disk test -d
[Parameters]
app_name: test
detailed: true
[Result]
ERROR: decode perf counter info from node 10.132.42.1:35801 failed, result = {"result":"OK","timestamp":1528639103,"timestamp_str":"2018-06-10 21:58:23","counters":[{"name":"replica*app.pegasus*disk.storage.sst(MB)@1.13","type":"NUMBER","value":0},{"name":"replica*app.pegasus*disk.storage.sst(MB)@1.2","type":"NUMBER","value":0},{"name":"replica*app.pegasus*disk.storage.sst(MB)@1.3","type":"NUMBER","value":0},{"name":"replica*app.pegasus*[email protected]","type":"NUMBER","value":9},{"name":"replica*app.pegasus*[email protected]","type":"NUMBER","value":10},{"name":"replica*app.pegasus*[email protected]","type":"NUMBER","value":2}]}
like:
currently we have last_durable decree & last flushed decree, which is confused and unnecessary. should refactor this.
Currently for each dsn_logv
, simple_logger
will check if number of log files exceeds limit, if so, a single logv call will cause one filesystem remove operation. If a file is unable to be deleted, the next logv call has to do again, which may incur a serious performance downgrade on the critical path, and even leads to replica failure.
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
Failed to remove garbage log file /home/work/app/pegasus/c3srv-feedprofile2/replica/log/log.438.txt
// TODO: move gc out of criticial path
while (_index - _start_index > _max_number_of_log_files_on_disk) {
std::stringstream str2;
str2 << "log." << _start_index++ << ".txt";
auto dp = utils::filesystem::path_combine(_log_dir, str2.str());
if (::remove(dp.c_str()) != 0) {
printf("Failed to remove garbage log file %s\n", dp.c_str());
_start_index--;
break;
}
}
Replica failure:
E2019-05-11 21:58:20.075 (1557583100075617036 1966c) replica.local_app6.04009626ef59471f: replica_stub.cpp:1348:response_client(): 2.118@: read fail: client = , code = RPC_RRDB_RRDB_GET, timeout = 1000, status = Unknown, error = ERR_OBJECT_NOT_FOUND
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 4 issues were created.
Of these, 0 issues have been closed and 4 issues are still open.
💚 #312 feat(dup): implement ship_mutation stage and mutation_batch, by neverchanje
💚 #311 refactor: rename disk_aio to aio_context, by neverchanje
💚 #310 refactor: remove empty_aio_provider and posix aio_provider, by neverchanje
💚 #309 feat(split): child replica learn parent prepare list and checkpoint, by hycdong
Last week, 5 pull requests were created, updated or merged.
Last week, 1 pull request was opened.
💚 #312 feat(dup): implement ship_mutation stage and mutation_batch, by neverchanje
Last week, 4 pull requests were updated.
💛 #311 refactor: rename disk_aio to aio_context, by neverchanje
💛 #310 refactor: remove empty_aio_provider and posix aio_provider, by neverchanje
💛 #309 feat(split): child replica learn parent prepare list and checkpoint, by hycdong
💛 #302 refactor: introduce mutation_log::replay_block, by neverchanje
Last week there were no commits.
Last week there were no contributors.
Last week there was 1 stargazer.
⭐ netroby
You are the star! 🌟
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
and also, rpc_address::ipv4_from_network_interface() returns 0 if no proper IP address found.
so that, rpc_address::assign_ipv4() and rpc_address::assign_ipv4_local_address() may assign unexpected ip(0.0.0.0), which may cause tricky problems.
need fix it.
currently the app information aren't removed from meta-state-service(it's zookeeper in our implementation).
In order to resolve this issue, we need a new field to track the max app id and store it to meta-state-service.
see Issue #126
currently we use a module "tls memory allocator" to allocate memory by block, should try to use tcmalloc/jemalloc to remove this. refer to #25
What we need to do:
Be sure we are able to build the whole project via ./run.sh build
.
blob is unsafe to be used as a shared buffer, since it can also be created without holding ownership of the internal buffer:
void assign(const char *buffer, int offset, unsigned int length);
blob(const char *buffer, int offset, unsigned int length);
This can easily lead to incorrect use, which may in turn cause double free of memory.
We should remove the two functions above and use dsn::string_view
instead.
This feature is necessary when split(#69) is added to our project.
For example:
Then ideally, each node will have 3 partitions, and these partitions will be placed on 3 disks, with the remaining 5 disks empty.
When the table is split, then partitions on these disks will be (2, 2, 2, 0, 0, 0, 0, 0), which is not balanced.
In addition, we may also need a command to move partitions from one disk to another.
and remove replication.allow_non_idempotent_write option.
What we need to do:
[ ] make private log flush more frequent than when duplication not enabled
[ ] make mutation log loading restarts from last global_start_offset when error (like ERR_INVALID_DATA) encountered.
While building rdsn with clang-3.9 I encountered some errors:
/home/mi/git/release/pegasus/rdsn/include/dsn/tool-api/task.h:590:10: error: 'dsn::aio_task::enqueue' hides overloaded virtual function [-Werror,-Woverloaded-virtual]
void enqueue(error_code err, size_t transferred_size);
^
/home/mi/git/release/pegasus/rdsn/include/dsn/tool-api/task.h:197:18: note: hidden overloaded virtual function 'dsn::task::enqueue' declared here: different number of parameters
(0 vs 2)
virtual void enqueue();
In file included from /home/mi/git/release/pegasus/rdsn/src/dist/block_service/local/local_service.cpp:6:
/home/mi/git/release/pegasus/rdsn/src/dist/block_service/local/local_service.h:82:20: error: private field '_local_service' is not used [-Werror,-Wunused-private-field]
local_service *_local_service;
/home/mi/git/release/pegasus/rdsn/src/core/tests/autoref_ptr_test.cpp:140:9: error: explicitly moving variable of type 'dsn::ref_ptr<SelfAssign>' to itself [-Werror,-Wself-move]
var = std::move(var);
~~~ ^ ~~~
In file included from /home/mi/git/release/pegasus/rdsn/src/dist/replication/meta_server/meta_state_service_utils.cpp:32:
/home/mi/git/release/pegasus/rdsn/src/dist/replication/meta_server/meta_state_service_utils_impl.h:73:1: error: 'operation' defined as a struct here but previously declared as a
class [-Werror,-Wmismatched-tags]
struct operation : pipeline::environment
It seemed they are not serious problems actually.
需要解决的问题:
现在与 perf counter 相关的 remote commands 分布在 pegasus_counter_updater 和 perf_counters 两个类中, 如果不整理代码新人完全无法了解这点.
我们可以看看其相关的代码是如何散布的:
作为新人很难从乱麻中找到具体调用路径. 我们需要将监控相关代码统一放到一个路径下.
由于最初开发者不擅长使用 rdsn 的 task 模型, 故使用 boost io_service 编写定时器代码定期发请求给 falcon agent, 还有定期更新 percentile_value.
这四个实现其实没有让我们 tune 的空间, 因为除了 pegasus_perf_counter, 其他都有各种问题. 留下四个实现反而导致代码过于难懂.
它具体负责了:
这个类负责有太多工作, 不是好的软件设计.
当前定义 perfcounter 时, 代码上看不出具体使用的类型是什么:
::dsn::perf_counter_wrapper _pfc_scan_qps;
当然我们在变量名上可以知道这应该是个 counter 类型, 然而真正去看初始化代码, 我们才知道它是 COUNTER_TYPE_RATE 类型
_pfc_scan_qps.init_app_counter(
"app.pegasus", buf, COUNTER_TYPE_RATE, "statistic the qps of SCAN request");
我们很难向新人解释什么是 counter rate 类型. 一种方法是我们编写文档, 一种方法是我们从一开始就选择业内通用命名:
Meters
A meter measures the rate of events over time (e.g., “requests per second”).
Because rdsn is originally a project from Microsoft, we didn't hardly modify the documentations since last year when we open sourced the internal fork of rdsn in xiaomi. Currently the xiaomi/rdsn has a large gap ahead of microsoft/rdsn, which is hardly maintained by its author (the latest commit is in Aug, 2017), therefore most of the documents are stale (like windows support).
We need to have the documentations caught up with the current actual state.
split的动作,在primary和secondary之间最好在同样的decree点之后发生。这样,对于某条写请求,可以明确的界定是由parent分片负责还是child分片负责。
如果没有规定一个固定的decree点,某个请求在不同副本间可能会由不同的partition负责,会有一些胡乱。
如果把split动作做成一个mutation, 走一致性协议。那么究竟在哪个点进行split,副本间就可以达到统一的共识了。事实上一致性协议就是干这件事情的,它本来就是用来保证主从之间的“复制状态机”的,如果把split也作为一种主从间的状态变化的话,这种做法是非常自然的。
对于一条已经commit的split message 我们只要强制要求这条message必须flush/durable就行。这样replica在重启的时候是不会重放这条mutation的。
如果replica group发生了configuration变化,mutation也不会发生重放。这是我们的一致性协议保证的:
退一步讲:果给每一条split的mutation指定好partition version,也就是说“本split希望partition count是从几变到几”,那么即使一个mutation要执行两次,貌似也不会有影响?因为我们可以在apply这条mutation的时候判断这个split还有没有必要做。
而且由于一致性协议的保证,这条mutation即便进行了重放,在三个副本之间的行为也是一模一样的。
首先像所有的write mutation一样,先prepare,等所有的副本都prepare之后,primary就可以commit了。
commit时候,replica要:
replica的commit过程就有点像进程的fork:复制内存镜像,对内核的数据结构做修改;我们需要做的是复制数据,对各种控制行为做修改。
可能这里会觉得commit的过程过于繁琐。但不管怎样的split方案,一个replica一分为二后,里面所对应的各种控制行为该怎么变化都是需要细致的考虑的。
这也是我喜欢2pc的原因,精确的相同decree点上的split, 能让我们更好的对“split前后parent、child”的行为做思考和控制。
primary在commit之后,split结束。
secondary的split过程和primary基本相同。但由于是2pc, primary在split结束后,secondary还没有split。
只有等新的一条mutation后发出后,才会推动secondary的commit。这条mutattion可能是一条普通的写,也可能是一条group check。但这条mutation一定得是parent发出的一条mutation; 如果是child, 它会发现其他副本还不存在这个partition而被拒绝掉,这里我们需要特殊处理下:
因为replica的所有状态变化都是线性的,所以这里可以明确的讨论:
设想一种场景:learner的decree在split之前,而primary已经是split之后了;在这种情况下,如果简单的learn会导致learner面对"split的mutation"无所适从。
这里我们可以简单处理:如果partition的version不一致,强制要求"learn from scratch"。
另一种方式是跳过split的控制命令。
meta在下发split消息的时候,可以通过config sync,也可以通过单独的命令。
当primary在完成split的时候,需要通知meta说split已经完成,这时meta会更新remote-storage, 把parent和child的partition configuration都更新一下,在此之前,child的partition configuration为空。
config-sync的时候,如果因为partition version不一致而导致replica-server上缺乏一些partition,可以暂时先忽略掉。因为split的mutation是走过两阶段提交的,这条mutation一定会发生,而不会丢失。
和之前的设计应该可以维持一致。
大概想了下:split如果没有同步点,在prepare的时候会比较混乱。如果也走这种“通知secondary-等待回复”的流程,其实和走2pc也差不多。
另外需要考虑下,primary split完成、通知meta修改partition config,primary挂了之后的一些情况
Here's the Weekly Digest for XiaoMi/rdsn:
Last week 2 issues were created.
Of these, 2 issues have been closed and 0 issues are still open.
❤️ #327 fix(backup): delay clearing obsoleted backup when it's still checkpointing, by vagetablechicken
❤️ #326 refactor: remove useless functions from binary_reader, by neverchanje
🔈 #327 fix(backup): delay clearing obsoleted backup when it's still checkpointing, by vagetablechicken
It received 2 comments.
Last week, 6 pull requests were created, updated or merged.
Last week, 1 pull request was updated.
💛 #320 feat(dup): protect private log from missing when duplication is enabled, by neverchanje
Last week, 5 pull requests were merged.
💜 #327 fix(backup): delay clearing obsoleted backup when it's still checkpointing, by vagetablechicken
💜 #326 refactor: remove useless functions from binary_reader, by neverchanje
💜 #324 refactor: move replay related codes to mutation_log_replay, by neverchanje
💜 #323 refactor: move aio tests out from service_api_c, by neverchanje
💜 #321 feat(http): add http interface for get_app_envs, by levy5307
Last week there were 5 commits.
🛠️ refactor: move aio tests out from service_api_c (#323) by neverchanje
🛠️ refactor: remove useless functions from binary_reader (#326) by neverchanje
🛠️ fix(coldbackup): delay clean request when chkpting (#327) by vagetablechicken
🛠️ feat(http): add http interface for get_app_envs (#321) by levy5307
🛠️ refactor: move replay related codes to mutation_log_replay (#324) by neverchanje
Last week there were 3 contributors.
👤 neverchanje
👤 vagetablechicken
👤 levy5307
Last week there were no stargazers.
Last week there were no releases.
That's all for last week, please 👀 Watch and ⭐ Star the repository XiaoMi/rdsn to receive next weekly updates. 😃
You can also view all Weekly Digests by clicking here.
Your Weekly Digest bot. 📆
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.