Comments (18)
hibari1 - console.log
2013-03-30 03:01:05.421 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 1 (true)
2013-03-30 03:01:06.144 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 1: UpRes ok, 12623 updates
2013-03-30 03:01:15.782 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 3 (true)
2013-03-30 03:01:16.386 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 3: UpRes ok, 10017 updates
2013-03-30 03:01:26.235 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330030126 [{0,4},{282,282},{900824,2357540}]
2013-03-30 03:01:29.376 [info] <0.25906.46>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 213 msec for 1 callers
2013-03-30 03:01:46.622 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 4 (true)
2013-03-30 03:01:49.194 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 4: UpRes ok, 38232 updates
2013-03-30 03:01:54.519 [info] <0.28395.46>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 312 msec for 1 callers
...
2013-03-30 03:11:14.029 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 Scavenger finished:
Options: [destructive,{skip_live_percentage_greater_than,90},{work_dir,"/scav/hibari1_scavenger"}]
Logs deleted without scavenging: 2 (24 bytes)
Logs scavenged: 21
Copied: hunks bytes errs = 763583 781908992 0
Reclaimed bytes = 308037249
...
2013-03-30 04:00:26.236 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330040026 [{1,4},{277,285},{900824,2357540}]
2013-03-30 04:00:57.401 [warning] <0.31010.58>@brick_ets:bigdata_dir_get_val:2922 stability_test_ch3_b3: error {error,enoent} at 1 36210172
2013-03-30 04:00:57.403 [warning] <0.610.0>@brick_ets:sequence_file_is_bad_common:3022 Brick commonLogServer bad sequence file: seq 1 offset 36210172: rename data/brick/hlog.commonLogServer/s/000000000001.HLOG or data/brick/hlog.commonLogServer/2/2/-000000000001.HLOG to data/brick/hlog.commonLogServer/BAD-CHECKSUM/1.at.offset.36210172 failed: enoent
2013-03-30 04:00:57.405 [info] <0.769.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch3_b3: purging keys with sequence 1, size 259805
2013-03-30 04:00:57.407 [info] <0.783.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch1_b1: purging keys with sequence 1, size 368467
2013-03-30 04:00:57.408 [info] <0.758.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch4_b2: purging keys with sequence 1, size 347070
2013-03-30 04:00:57.818 [info] <0.769.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch3_b3: 1 purged keys for log 1
2013-03-30 04:00:57.897 [info] <0.758.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch4_b2: 0 purged keys for log 1
2013-03-30 04:00:57.900 [info] <0.783.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch1_b1: 1 purged keys for log 1
2013-03-30 04:00:58.340 [info] <0.615.0>@brick_mboxmon:handle_info:152 node '[email protected]' down Extra [{nodedown_reason,connection_closed},{node_type,visible}]
2013-03-30 04:00:58.853 [error] <0.690.0>@brick_admin:run_client_monitor_procs:2171 Node '[email protected]' application gdss_client is stopped
2013-03-30 04:00:58.855 [info] <0.530.0> alarm_handler: {set,{{client_node_down,'[email protected]'},"GDSS application is not running on this node."}}
2013-03-30 04:00:59.510 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1307 set_all_chain_roles: stability_test_ch1: top
2013-03-30 04:00:59.510 [info] <0.530.0>@brick_admin_event_h:handle_event:120 handle_event: {set_alarm,{{disk_error,{stability_test_ch1_b1,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.510 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch1_b1,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.511 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1309 set_all_chain_roles: stability_test_ch1: brick read-only 1
2013-03-30 04:00:59.514 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1352 New head {stability_test_ch1_b2,'[email protected]'}: ChainDownSerial 368037 /= ChainDownAcked -4242, reflushing log downstream
2013-03-30 04:00:59.516 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1360 New head {stability_test_ch1_b2,'[email protected]'}: flush was ok
2013-03-30 04:00:59.516 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch1 status degraded belongs to tab stability_test
2013-03-30 04:00:59.528 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.529 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=74
2013-03-30 04:00:59.529 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch1 status degraded belongs to tab stability_test
2013-03-30 04:00:59.542 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.543 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=75
2013-03-30 04:00:59.697 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1307 set_all_chain_roles: stability_test_ch3: top
2013-03-30 04:00:59.697 [info] <0.530.0>@brick_admin_event_h:handle_event:120 handle_event: {set_alarm,{{disk_error,{stability_test_ch3_b3,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.697 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch3_b3,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.697 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1309 set_all_chain_roles: stability_test_ch3: brick read-only 1
2013-03-30 04:00:59.701 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1352 New head {stability_test_ch3_b1,'[email protected]'}: ChainDownSerial 259439 /= ChainDownAcked -4242, reflushing log downstream
2013-03-30 04:00:59.703 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1360 New head {stability_test_ch3_b1,'[email protected]'}: flush was ok
2013-03-30 04:00:59.703 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch3 status degraded belongs to tab stability_test
2013-03-30 04:00:59.715 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.715 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=76
2013-03-30 04:00:59.716 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch3 status degraded belongs to tab stability_test
from hibari.
Basho Bench - console.log.0
I'll update the driver so that it can handle those errors rather than crashing.
2013-03-29 23:07:44.851 [info] <0.184.0>@basho_bench_worker:worker_idle_loop:227 Starting 100.0 ms/req fixed rate worker: <0.184.0>
2013-03-29 23:07:44.851 [info] <0.182.0>@basho_bench_worker:worker_idle_loop:227 Starting 100.0 ms/req fixed rate worker: <0.182.0>
2013-03-30 04:00:57.820 [debug] <0.205.0>@basho_bench_worker:worker_next_op:265 Driver basho_bench_driver_hibari_native crashed: {{case_clause,{error,current_repair_state,disk_error}},[{basho_bench_driver_hibari_native,run,4,[{file,"src/basho_bench_driver_hibari_native.erl"},{line,78}]},{basho_bench_worker,worker_next_op2,2,[{file,"src/basho_bench_worker.erl"},{line,233}]},{basho_bench_worker,worker_next_op,1,[{file,"src/basho_bench_worker.erl"},{line,239}]},{basho_bench_worker,rate_worker_run_loop,2,[{file,"src/basho_bench_worker.erl"},{line,308}]}]}
2013-03-30 04:00:57.820 [debug] <0.215.0>@basho_bench_worker:worker_next_op:265 Driver basho_bench_driver_hibari_native crashed: {{case_clause,{error,current_repair_state,disk_error}},[{basho_bench_driver_hibari_native,run,4,[{file,"src/basho_bench_driver_hibari_native.erl"},{line,78}]},{basho_bench_worker,worker_next_op2,2,[{file,"src/basho_bench_worker.erl"},{line,233}]},{basho_bench_worker,worker_next_op,1,[{file,"src/basho_bench_worker.erl"},{line,239}]},{basho_bench_worker,rate_worker_run_loop,2,[{file,"src/basho_bench_worker.erl"},{line,308}]}]}
2013-03-30 04:00:57.820 [error] <0.204.0>@basho_bench_worker:handle_info:149 Worker <0.205.0> exited with crash
2013-03-30 04:00:57.820 [error] <0.214.0>@basho_bench_worker:handle_info:149 Worker <0.215.0> exited with crash
from hibari.
SmartOS machine info
- Solaris Zone-based SmartOS machine at Joyent Cloud
- Machine type: smartmachine medium 4GB RAM, 4 vCPUs
- Image: base 1.9.1
- Pkgsrc: http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/i386/All
# uname -a
SunOS 4f80810f-1aa5-4df3-9043-a625c327094d.local 5.11 joyent_20130111T180733Z i86pc i386 i86pc Solaris
# erl
Erlang R15B02 (erts-5.9.2) [source] [smp:24:24] [async-threads:0] [kernel-poll:false] [dtrace]
Eshell V5.9.2 (abort with ^G)
1>
User switch command
--> q
# sm-summary
* Gathering SmartMachine summary..
SM UUID ***
SM ID ***
Hostname ***
SmartOS build joyent_20130111T180733Z
Image base 1.9.1
Base Image NA
Documentation http://wiki.joyent.com/jpc2/SmartMachine+Base
Pkgsrc http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/i386/All
Processes 81
Memory (RSS) Cap 4096M
Memory (RSS) Used 260M
Memory (RSS) Free 3836M
Swap Cap 8192M
Swap Used 125M
/tmp Used 4.0K
Disk Quota 121G
% Disk Used 8%
# find /data/hibari1_data
/data/hibari1_data
/data/hibari1_data/brick
/data/hibari1_data/brick/hlog.stability_test_ch3_b3
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/sequence_number
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/bad-sequence
/data/hibari1_data/brick/hlog.bootstrap_copy1
/data/hibari1_data/brick/hlog.bootstrap_copy1/s
/data/hibari1_data/brick/hlog.bootstrap_copy1/s/000000000001.HLOG
/data/hibari1_data/brick/hlog.bootstrap_copy1/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.bootstrap_copy1/sequence_number
/data/hibari1_data/brick/hlog.bootstrap_copy1/bad-sequence
/data/hibari1_data/brick/hlog.stability_test_ch4_b2
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/sequence_number
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch3_b1
/data/hibari1_data/brick/hlog.tab1_ch3_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch3_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch3_b1/s
/data/hibari1_data/brick/hlog.tab1_ch2_b1
/data/hibari1_data/brick/hlog.tab1_ch2_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch2_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch2_b1/s
/data/hibari1_data/brick/hlog.commonLogServer
/data/hibari1_data/brick/hlog.commonLogServer/flush
/data/hibari1_data/brick/hlog.commonLogServer/1
/data/hibari1_data/brick/hlog.commonLogServer/1/3
/data/hibari1_data/brick/hlog.commonLogServer/1/3/-000000000030.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/4
/data/hibari1_data/brick/hlog.commonLogServer/1/4/-000000000024.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/7
/data/hibari1_data/brick/hlog.commonLogServer/1/7/-000000000027.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/5
/data/hibari1_data/brick/hlog.commonLogServer/1/2
/data/hibari1_data/brick/hlog.commonLogServer/1/2/-000000000036.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/6
/data/hibari1_data/brick/hlog.commonLogServer/1/6/-000000000033.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/1
/data/hibari1_data/brick/hlog.commonLogServer/BAD-CHECKSUM
/data/hibari1_data/brick/hlog.commonLogServer/2
/data/hibari1_data/brick/hlog.commonLogServer/2/6
/data/hibari1_data/brick/hlog.commonLogServer/2/1
/data/hibari1_data/brick/hlog.commonLogServer/2/1/-000000000028.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/5
/data/hibari1_data/brick/hlog.commonLogServer/2/5/-000000000025.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/2
/data/hibari1_data/brick/hlog.commonLogServer/2/7
/data/hibari1_data/brick/hlog.commonLogServer/2/7/-000000000034.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/3
/data/hibari1_data/brick/hlog.commonLogServer/2/3/-000000000037.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/4
/data/hibari1_data/brick/hlog.commonLogServer/2/4/-000000000031.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/s
/data/hibari1_data/brick/hlog.commonLogServer/s/000000000039.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/s/Config
/data/hibari1_data/brick/hlog.commonLogServer/3
/data/hibari1_data/brick/hlog.commonLogServer/3/1
/data/hibari1_data/brick/hlog.commonLogServer/3/1/-000000000035.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/6
/data/hibari1_data/brick/hlog.commonLogServer/3/6/-000000000026.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/2
/data/hibari1_data/brick/hlog.commonLogServer/3/2/-000000000029.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/5
/data/hibari1_data/brick/hlog.commonLogServer/3/5/-000000000032.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/7
/data/hibari1_data/brick/hlog.commonLogServer/3/4
/data/hibari1_data/brick/hlog.commonLogServer/3/4/-000000000038.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/3
/data/hibari1_data/brick/hlog.commonLogServer/register
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch3_b3
/data/hibari1_data/brick/hlog.commonLogServer/register/bootstrap_copy1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch3_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch2_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch1_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch1_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch4_b2
/data/hibari1_data/brick/hlog.stability_test_ch1_b1
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/bad-sequence
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch1_b1
/data/hibari1_data/brick/hlog.tab1_ch1_b1/s
/data/hibari1_data/brick/hlog.tab1_ch1_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch1_b1/sequence_number
# df -h
ファイルシス サイズ 使用 残り 使用% マウント位置
zones/**** 121G 9.1G 112G 8% /
/lib 269M 246M 24M 92% /lib
/lib/svc/manifest 2.3T 567K 2.3T 1% /lib/svc/manifest
/lib/svc/manifest/site 121G 9.1G 112G 8% /lib/svc/manifest/site
/sbin 269M 246M 24M 92% /sbin
/usr 377M 353M 24M 94% /usr
/usr/ccs 121G 9.1G 112G 8% /usr/ccs
/usr/local 121G 9.1G 112G 8% /usr/local
swap 8.0G 86M 8.0G 2% /etc/svc/volatile
/usr/lib/libc/libc_hwcap1.so.1 377M 353M 24M 94% /lib/libc.so.1
swap 8.0G 86M 8.0G 2% /tmp
swap 8.0G 86M 8.0G 2% /var/run
from hibari.
Some thoughts
- Hibari 0.3.0 is going to introduce a new client API
brick_simple:rename/6
, which internally uses?KEY_SWITCHAROO
. There may be a chance that the scavenger is not aware of this operation. - OR, the scavenger silently failed updating the log file locations of those 2 purged keys under high workload.
- I have ran the same stability tests on the following machines before tying Joyent Cloud:
- Ran 2 times and found no issue on a Zone-based SmartOS machine (2GB RAM, 8 vCPUs) on my Mac mini (Late 2012, Core i7 at 2.6GHz) + VMware fusion. Pkgsrc is 2012Q4/X86_64, not 2012Q4/i386.
- Ran 1 time and found no issue on a CentOS 6.4 virtual machine (6GB RAM, 4 vCPUs) on my Mac mini + VMware fusion.
By the way, I wanted to try Joyent Cloud just because to play with dtrace-based sub-second offset heat maps, which is cool: http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/
from hibari.
This wouldn't be related to this GitHub issue, but I'm curious if having smp:24:24 is OK while number of vCPUs in the Solaris Zone was capped to 4.
Hibari - erlang.log.1
=====
===== LOGGING STARTED Fri Mar 29 23:06:25 CST 2013
=====
Exec: /opt/hibari/hibari1/hibari/erts-5.9.2/bin/erlexec -boot /opt/hibari/hibari1/hibari/releases/0.3.0/hibari -mode embedded -config /opt/hibari/hibari1/hibari/releases/0.3.0/sys.config -args_file /opt/hibari/hibari1/hibari/releases/0.3.0/vm.args -- console
Root: /opt/hibari/hibari1/hibari
Erlang R15B02 (erts-5.9.2) [source] [smp:24:24] [async-threads:64] [kernel-poll:true] [dtrace]
from hibari.
Still investigating this issue, and updated gdss-brick to make the logging more informative.
Commit: hibari/gdss-brick@4dc196e
hibari >> GH33 (bad sequence file) related:
- Revise and add log messages to record purged keys in console.log.
- Rise logging level from [info] to [notice].
log messages - before
[warning] ... tab_ch3_b3: error {error,enoent} at 1 36210172
[info] ... tab_ch3_b3: purging keys with sequence 1, size 259805
[info] ... tab_ch3_b3: N purged keys for log 1
log messages - after
[warning] ... tab_ch3_b3: read error {error,enoent} at seq 1 offset 36210172
for the value of key <<"...">>
[notice] ... tab_ch3_b3: purging keys with sequence 1, size 259805
[notice] ... tab_ch3_b3: purged a key from log. key <<"...">>
...
[notice] ... tab_ch3_b3: purged a key from shadowtab. key <<"...">>
...
[notice] ... tab_ch3_b3: purged N keys for log 1
from hibari.
I rebuilt Hibari 0.3.0 RC with the above logging change, deleted the table, and restarted the same 8-hour stability test on the same smartmachine at Joyent Cloud. Hope this can reproduce the issue.
from hibari.
It did reproduce the same issue, on the same node [email protected]
. That's the only node who is running admin server.
Reply to comment
- I have ran the same stability tests on the following machines before tying Joyent Cloud:
- Ran 2 times and found no issue on a Zone-based SmartOS machine (2GB RAM, 8 vCPUs) on my Mac mini (Late 2012, Core i7 at 2.6GHz) + VMware fusion. Pkgsrc is 2012Q4/X86_64, not 2012Q4/i386.
- Ran 1 time and found no issue on a CentOS 6.4 virtual machine (6GB RAM, 4 vCPUs) on my Mac mini + VMware fusion.
I selected 32-bit version of SmartOS on Joyent Cloud actually by mistake. I used 64-bit version of SmartOS and CentOS for my local environment described above. Perhaps, this issue is related to 32-bit Erlang VM? maybe only on illmuos kernel?
To confirm this, I'm running the same test on the following machines now:
- 32-bit SmartOS on Joyent Cloud
- 64-bit SmartOS on Joyent Cloud
- 64-bit SmartOS on my Mac mini + VMware Fusion
Note that Erlang/OTP version is R15B02 for SmartOS, and R15B03-1 for CentOS.
from hibari.
- 32-bit SmartOS on Joyent Cloud
- 64-bit SmartOS on Joyent Cloud
- 64-bit SmartOS on my Mac mini + VMware Fusion
Unfortunately, I was able to reproduce the same issue on # 1 and # 3. I wonder why my earlier attempts on # 3 didn't fail. I don't have a record but I think I ran scavenger only on the first node. (Today, I'm running scavenger on all 4 nodes at the same time.) Or maybe it would happen more often on 32-bit version than 64-bit version; # 2 is still running without the issue.
- For # 1 (32-bit), it again happend on
[email protected]
which had admin server running. - For # 3 (64-bit), it happend on
[email protected]
.
Log from # 3
2013-03-30 19:58:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330195823 [{2,3},{197,201},{1113712,1113712}]
2013-03-30 19:59:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330195923 [{1,3},{199,201},{1113712,1113712}]
2013-03-30 20:00:02.354 [info] <0.610.0>@gmt_hlog_common:schedule_next_daily_scavenger:1496 Scheduling next scavenger 86398 seconds from now.
2013-03-30 20:00:02.354 [warning] <0.610.0>@gmt_hlog_common:handle_info:348 async writeback proc <0.7652.27> hasn't finished yet
2013-03-30 20:00:02.356 [info] <0.7651.27>@brick_ets:make_info_log_fun:3561 SCAV: start with [stability_test_ch2_b3,stability_test_ch4_b1,stability_test_ch3_b2]
2013-03-30 20:00:02.358 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 Scavenger <0.7654.27> starting now.
2013-03-30 20:00:04.017 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,52},{old_heap_block_size,514229},{heap_block_size,121393},{mbuf_size,0},{stack_size,72},{old_heap_size,441054},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{brick_ets,storetuple_flags,1}}]
2013-03-30 20:00:06.032 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,54},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,94604},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:06.329 [info] <0.8774.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 297 msec for 1 callers
2013-03-30 20:00:07.486 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,68},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,94614},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:08.174 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.7654.27>,long_gc,info_filtered}
2013-03-30 20:00:10.275 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,293923},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{brick_ets,'-prepend_rs/3-lc$^0/1-0-',3}}]
2013-03-30 20:00:13.994 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,62},{old_heap_size,59717},{heap_size,35049}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{dict,get_slot,2}}]
2013-03-30 20:00:16.101 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,52},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,76},{old_heap_size,0},{heap_size,96281}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:16.466 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,62},{old_heap_size,59717},{heap_size,35056}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{dict,store,3}}]
2013-03-30 20:00:16.746 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 1
2013-03-30 20:00:16.924 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 3
2013-03-30 20:00:17.460 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,84},{old_heap_block_size,0},{heap_block_size,832040},{mbuf_size,0},{stack_size,55},{old_heap_size,0},{heap_size,460242}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_bin,2}}]
2013-03-30 20:00:18.188 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,354},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,50},{old_heap_size,639591},{heap_size,189660}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,fun_loop,6}}]
2013-03-30 20:00:19.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 2 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:19.200 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,70},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,48},{old_heap_size,1561208},{heap_size,370568}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_term,2}}]
2013-03-30 20:00:20.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:20.446 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,135},{old_heap_block_size,514229},{heap_block_size,196418},{mbuf_size,0},{stack_size,50},{old_heap_size,205485},{heap_size,51143}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,monitor_request,2}}]
2013-03-30 20:00:21.182 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 3 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:21.186 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,147},{old_heap_block_size,0},{heap_block_size,1346269},{mbuf_size,0},{stack_size,55},{old_heap_size,0},{heap_size,914833}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_bin,2}}]
2013-03-30 20:00:22.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:22.251 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,218},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,48},{old_heap_size,1392494},{heap_size,514196}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_term,2}}]
2013-03-30 20:00:22.810 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12838.27>,long_gc,[{timeout,89},{old_heap_block_size,832040},{heap_block_size,514229},{mbuf_size,0},{stack_size,23},{old_heap_size,318069},{heap_size,196134}]} pinfo: [{initial_call,{proc_lib,init_p,5}},{current_function,{disk_log_1,logl,3}}]
2013-03-30 20:00:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200023 [{1,3},{201,201},{1113712,1113712}]
2013-03-30 20:00:23.565 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,75},{old_heap_block_size,0},{heap_block_size,1346269},{mbuf_size,0},{stack_size,51},{old_heap_size,0},{heap_size,681155}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,monitor_request,2}}]
2013-03-30 20:00:24.174 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:24.254 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,51},{old_heap_block_size,0},{heap_block_size,2103540},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,1403643}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,merge3_12_3,6}}]
2013-03-30 20:00:25.629 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,353},{old_heap_block_size,0},{heap_block_size,1682835},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,1417808}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,rmerge3_1,6}}]
2013-03-30 20:00:26.179 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:26.546 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,171},{old_heap_block_size,2629425},{heap_block_size,832040},{mbuf_size,0},{stack_size,27763},{old_heap_size,725412},{heap_size,234623}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,terms2bins,1}}]
2013-03-30 20:00:27.182 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:27.664 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,76},{old_heap_block_size,2629425},{heap_block_size,514229},{mbuf_size,0},{stack_size,42},{old_heap_size,1331163},{heap_size,78432}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,mergel,2}}]
2013-03-30 20:00:28.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 2 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:30.196 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,121},{old_heap_block_size,0},{heap_block_size,2629425},{mbuf_size,0},{stack_size,58521},{old_heap_size,0},{heap_size,704062}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,terms2bins,1}}]
2013-03-30 20:00:30.421 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.13593.27>,long_gc,[{timeout,53},{old_heap_block_size,987},{heap_block_size,317811},{mbuf_size,0},{stack_size,2},{old_heap_size,304},{heap_size,312279}]} pinfo: [{initial_call,{proc_lib,init_p,5}},{current_function,{disk_log,log_end,4}}]
2013-03-30 20:00:34.318 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 4: external child proc, sizes 65 and 98
2013-03-30 20:00:34.319 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 5
2013-03-30 20:00:34.324 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 1 percent 75.8263236019884
2013-03-30 20:00:35.332 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 3 percent 75.91776818694566
2013-03-30 20:00:35.332 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 4 percent 75.94176434380742
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 5 percent 76.15638196194216
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 6 percent 76.25145809103331
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 7 percent 76.29444069407096
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 8 percent 76.47620704197605
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 9 percent 76.54912658858676
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 10 percent 76.68048296028171
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 11 percent 76.77855474674008
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 12 percent 76.88504257958091
2013-03-30 20:00:35.335 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 6
2013-03-30 20:00:35.337 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 7
2013-03-30 20:00:37.691 [info] <0.14180.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 390 msec for 1 callers
2013-03-30 20:00:37.740 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 89 ms
2013-03-30 20:00:37.921 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 59 ms
2013-03-30 20:00:38.241 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 80 ms
2013-03-30 20:00:38.341 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 80 ms
2013-03-30 20:00:39.008 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 556 ms
2013-03-30 20:00:39.035 [info] <0.14204.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 654 msec for 1 callers
2013-03-30 20:00:39.375 [info] <0.14228.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 205 msec for 1 callers
2013-03-30 20:00:55.542 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 1 (true)
2013-03-30 20:00:56.777 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 1: UpRes ok, 38825 updates
2013-03-30 20:01:15.154 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 3 (true)
2013-03-30 20:01:17.499 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 3: UpRes ok, 38902 updates
2013-03-30 20:01:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200123 [{1,3},{206,206},{2914352,2914352}]
2013-03-30 20:01:36.159 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 4 (true)
2013-03-30 20:01:38.169 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 4: UpRes ok, 38895 updates
2013-03-30 20:01:56.828 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 5 (true)
2013-03-30 20:01:58.315 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 5: UpRes ok, 39020 updates
2013-03-30 20:02:16.874 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 147 ms
2013-03-30 20:02:17.186 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 6 (true)
2013-03-30 20:02:18.744 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 6: UpRes ok, 39046 updates
2013-03-30 20:02:22.096 [info] <0.22524.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 3351 msec for 1 callers
2013-03-30 20:02:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200223 [{1,3},{207,207},{1113712,2914352}]
2013-03-30 20:02:40.995 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 7 (true)
2013-03-30 20:02:42.385 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 7: UpRes ok, 39065 updates
2013-03-30 20:03:01.528 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 8 (true)
2013-03-30 20:03:02.753 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 8: UpRes ok, 39188 updates
2013-03-30 20:03:09.928 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 65 ms
2013-03-30 20:03:10.962 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 176 ms
2013-03-30 20:03:21.257 [info] <0.27351.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 241 msec for 1 callers
2013-03-30 20:03:21.446 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 9 (true)
2013-03-30 20:03:22.895 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 9: UpRes ok, 39232 updates
2013-03-30 20:03:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200323 [{0,3},{207,207},{1113712,2914352}]
2013-03-30 20:03:40.726 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 10 (true)
2013-03-30 20:03:42.312 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 10: UpRes ok, 39267 updates
2013-03-30 20:03:59.104 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 62 ms
2013-03-30 20:04:00.083 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 11 (true)
2013-03-30 20:04:01.613 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 11: UpRes ok, 39334 updates
2013-03-30 20:04:02.303 [info] <0.30860.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 446 msec for 1 callers
2013-03-30 20:04:21.356 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 12 (true)
2013-03-30 20:04:21.751 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200423 [{1,3},{205,207},{1114896,2914352}]
2013-03-30 20:04:22.715 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 12: UpRes ok, 39366 updates
2013-03-30 20:04:22.716 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 10
2013-03-30 20:04:22.716 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 Scavenger finished:
Options: [destructive,{skip_live_percentage_greater_than,90},{work_dir,"/scav/hibari4_scavenger"}]
Logs deleted without scavenging: 1 (12 bytes)
Logs scavenged: 11
Copied: hunks bytes errs = 430140 440463360 0
Reclaimed bytes = 136500433
2013-03-30 20:05:22.357 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200523 [{1,3},{199,207},{1572376,2914352}]
2013-03-30 20:06:22.970 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200623 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:07:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:08:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:09:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:10:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201023 [{1,3},{203,207},{1113712,2914352}]
2013-03-30 20:11:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:12:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201223 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:13:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:14:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:15:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201523 [{0,3},{202,207},{1113712,2914352}]
2013-03-30 20:16:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:17:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:18:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:19:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:20:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:21:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:22:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202223 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:23:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:24:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:25:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202523 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:26:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202623 [{1,3},{198,207},{1113712,2914352}]
2013-03-30 20:27:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:28:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202823 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:29:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202923 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:30:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203023 [{0,3},{202,207},{1113712,2914352}]
2013-03-30 20:31:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:32:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:33:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:34:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:35:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:36:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:36:48.518 [info] <0.27121.32>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 1113 msec for 1 callers
2013-03-30 20:37:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:38:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203823 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:39:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:40:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204023 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:41:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:42:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204223 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:43:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:44:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204423 [{0,3},{203,207},{1113712,2914352}]
2013-03-30 20:45:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204523 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:46:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:47:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:48:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:49:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:50:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:51:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:52:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:53:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205323 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:54:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:55:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:56:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:57:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:58:12.566 [info] <0.1358.36>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 453 msec for 1 callers
2013-03-30 20:58:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205823 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:59:15.233 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 79 ms
2013-03-30 20:59:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205923 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:00:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:01:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210123 [{1,3},{198,207},{1113712,2914352}]
2013-03-30 21:02:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 21:03:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210323 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:04:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210423 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:05:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 21:05:48.045 [warning] <0.6250.37>@brick_ets:bigdata_dir_get_val:2925 stability_test_ch2_b3: read error {error,enoent} at seq 12 offset 12913308 the stored value of key <<51,51,47,0,11,75,39>>
2013-03-30 21:05:48.046 [warning] <0.610.0>@brick_ets:sequence_file_is_bad_common:3025 Brick commonLogServer bad sequence file: seq 12 offset 12913308: rename data/brick/hlog.commonLogServer/s/000000000012.HLOG or data/brick/hlog.commonLogServer/1/6/-000000000012.HLOG to data/brick/hlog.commonLogServer/BAD-CHECKSUM/12.at.offset.12913308 failed: enoent
2013-03-30 21:05:48.048 [notice] <0.744.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch2_b3: purging keys with sequence 12, size 92306
2013-03-30 21:05:48.049 [notice] <0.716.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch4_b1: purging keys with sequence 12, size 294398
2013-03-30 21:05:48.051 [notice] <0.730.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch3_b2: purging keys with sequence 12, size 220454
2013-03-30 21:05:48.091 [notice] <0.744.0>@brick_ets:purge_rec_from_log:2029 stability_test_ch2_b3: purged a key from log. key <<51,51,47,0,11,75,39>>
2013-03-30 21:05:48.127 [notice] <0.744.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch2_b3: purged 1 keys for log 12
2013-03-30 21:05:48.151 [notice] <0.730.0>@brick_ets:purge_rec_from_log:2029 stability_test_ch3_b2: purged a key from log. key <<52,48,47,0,1,140,130>>
2013-03-30 21:05:48.247 [notice] <0.730.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch3_b2: purged 1 keys for log 12
2013-03-30 21:05:48.324 [notice] <0.716.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch4_b1: purged 0 keys for log 12
2013-03-30 21:05:48.708 [info] <0.615.0>@brick_mboxmon:handle_info:152 node '[email protected]' down Extra [{nodedown_reason,connection_closed},{node_type,visible}]
2013-03-30 21:05:50.402 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch3_b2,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 21:05:50.487 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch2_b3,'[email protected]'}},"Administrator intervention is required."}}
2013-03-30 21:05:50.626 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=38
2013-03-30 21:05:50.640 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=39
2013-03-30 21:05:50.717 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=40
2013-03-30 21:05:50.730 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=41
2013-03-30 21:06:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210623 [{1,3},{166,207},{1113712,2914352}]
2013-03-30 21:07:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210723 [{1,3},{139,207},{1113712,2914352}]
hibari-smartos1% sm-summary
* Gathering SmartMachine summary..
SM UUID 78c849af-be16-4463-a90a-4e85bbdef5eb
SM ID 1
Hostname hibari-smartos1
SmartOS build joyent_20130321T213641Z
Image base64 1.9.1
Base Image NA
Documentation http://wiki.joyent.com/jpc2/SmartMachine+Base
Pkgsrc http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/x86_64/All
Processes 71
Memory (RSS) Cap 2048M
Memory (RSS) Used 1160M
Memory (RSS) Free 888M
Swap Cap 2048M
Swap Used 1113M
/tmp Used 16K
Disk Quota 21G
% Disk Used 24%
hibari-smartos1% cat /opt/hibari/hibari1/hibari/log/erlang.log.1
=====
===== LOGGING STARTED Sat Mar 30 16:54:16 CST 2013
=====
Exec: /opt/hibari/hibari1/hibari/erts-5.9.2/bin/erlexec -boot /opt/hibari/hibari1/hibari/releases/0.3.0/
hibari -mode embedded -config /opt/hibari/hibari1/hibari/releases/0.3.0/sys.config -args_file /opt/hibar
i/hibari1/hibari/releases/0.3.0/vm.args -- console
Root: /opt/hibari/hibari1/hibari
Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:8:8] [async-threads:64] [kernel-poll:true] [dtrace]
hibari-smartos1% prstat -Z
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
41322 hibari 94M 79M sleep 59 0 0:10:15 1.5% beam.smp/30
40844 hibari 67M 51M sleep 59 0 0:06:10 0.8% beam.smp/93
40881 hibari 62M 45M sleep 58 0 0:04:25 0.6% beam.smp/94
40920 hibari 56M 42M sleep 55 0 0:04:02 0.6% beam.smp/94
40960 hibari 58M 42M sleep 51 0 0:03:44 0.5% beam.smp/93
41466 root 4472K 3728K cpu22 59 0 0:00:01 0.0% prstat/1
40842 hibari 4084K 2064K sleep 59 0 0:00:00 0.0% run_erl/1
40982 hibari 3972K 2548K sleep 1 0 0:00:00 0.0% inet_gethost/1
40964 hibari 3972K 2548K sleep 55 0 0:00:00 0.0% inet_gethost/1
40959 hibari 4084K 2064K sleep 56 0 0:00:00 0.0% run_erl/1
41222 hibari 4000K 2484K sleep 59 0 0:00:00 0.0% gtail/1
76138 hibari 5048K 3404K sleep 59 0 0:00:00 0.0% ksh93/1
65581 root 2476K 1572K sleep 59 0 0:00:00 0.0% pfexecd/3
65719 root 6612K 3292K sleep 59 0 0:00:00 0.0% inetd/3
76187 root 5524K 3180K sleep 59 0 0:00:00 0.0% sshd/1
40928 hibari 4036K 2000K sleep 59 0 0:00:00 0.0% inet_gethost/1
76134 hibari 7696K 4464K sleep 59 0 0:00:00 0.0% sshd/1
41230 hibari 4000K 2484K sleep 59 0 0:00:00 0.0% gtail/1
77086 hibari 5048K 3436K sleep 59 0 0:00:00 0.0% ksh93/1
76194 hibari 5048K 3404K sleep 59 0 0:00:00 0.0% ksh93/1
65705 root 1864K 1296K sleep 59 0 0:00:00 0.0% cron/1
40879 hibari 4084K 2064K sleep 59 0 0:00:00 0.0% run_erl/1
76863 hibari 5048K 3404K sleep 59 0 0:00:00 0.0% ksh93/1
41190 hibari 4000K 2484K sleep 59 0 0:00:00 0.0% gtail/1
65538 netadm 4568K 3276K sleep 59 0 0:00:00 0.0% ipmgmtd/3
76050 hibari 3332K 2452K sleep 59 0 0:00:00 0.0% bash/1
75788 root 7696K 4464K sleep 59 0 0:00:00 0.0% sshd/1
65717 root 2160K 1312K sleep 59 0 0:00:00 0.0% sac/1
75813 root 3340K 2456K sleep 59 0 0:00:00 0.0% bash/1
76145 hibari 3332K 2452K sleep 59 0 0:00:00 0.0% bash/1
41343 hibari 4036K 2000K sleep 1 0 0:00:00 0.0% inet_gethost/1
75787 root 5524K 3180K sleep 59 0 0:00:00 0.0% sshd/1
77066 hibari 6700K 4932K sleep 1 0 0:00:09 0.0% tmux/1
65456 root 13M 12M sleep 59 0 0:00:05 0.0% svc.configd/14
77099 hibari 3340K 2460K sleep 59 0 0:00:00 0.0% bash/1
76030 root 5524K 3180K sleep 59 0 0:00:00 0.0% sshd/1
76040 hibari 5048K 3404K sleep 59 0 0:00:00 0.0% ksh93/1
76795 hibari 7696K 4464K sleep 59 0 0:00:00 0.0% sshd/1
65722 root 1624K 944K sleep 59 0 0:00:00 0.0% utmpd/1
ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE
139 71 646M 456M 0.4% 0:28:57 3.9% 4f80810f-1aa5-4df3-9043-a625c327094d
Total: 71 processes, 540 lwps, load averages: 0.94, 0.91, 0.82
from hibari.
Reply to comment
Approach 2. Track rename operation during scavenger process. While scavenger is running, record rename info key: OldKey, value: NewKey to a new ETS table (e.g key:
<<"a/1">>
, value:<<"a/2">>
). If scavenger fails step 3 for key<<"a/1">>
, it reads the ETS table to get key<<"a/2">>
and retries step 3. This seems a most efficient approach as all additional steps are in-memory operations. However there will be a chance that the scavenger will never catch up to rename operations if<<"a/2">>
is renamed to<<"a/3">>
, then to<<"a/4">>
and so on very quickly.
Thinking about two cases:
- the brick server crashes during scavenging
- suspending/resuming the scavenger
1 should be okay because the scavenger will be restarted from the first step after the brick server is recovered. 2 will be a bit problematic; the brick server has to keep recording rename operations to the ETS table while scavenger is suspended.
Perhaps, tombstone-like approach on the old key will be more robust? I'll explain it in next comment.
from hibari.
Perhaps, tombstone-like approach on the old key will be more robust? I'll explain it in next comment.
Approach 4. Tombstone-like moved marker on the old key.
Currently, rename/6
operation runs delete/4
with testset
on the old key at the end. Instead, this approach makes rename/6
to run set/6
with teset
and a custom property (and empty data) to the old key to indicate this key has been renamed to the new key. This write will be recorded to the write-ahead log just like other write operations, and rename/6
may take this behavior no matter if the scavenger is running/suspended or not; to be sure there is no race conditions.
All read (get/6
and get_many/4
) and write (set/6
etc.) operations should be aware of this property and handle the old key as deleted (as well as brick repair process.) The scavenger will use the custom property to track rename operations, and it runs delete/4
with testset
in the last step of scavenging.
This approach will address "suspending/resuming the scavenger" issue but still has "will never catch up to rename operations" issue. So the scavenger should have max retry count for the attempts.
from hibari.
Getting ready for fix; refactoring etc.
gdss-admin (on a topic branch) - commit hibari/gdss-admin@512bc92
hibari >> gh33 (bad sequence on common log deleted by the scavenger):
- Updade brick_admin:fast_sync_scav/6 to use renamed bottom_fun field
of #scav{}
gdss-brick (on a topic branch) - commit hibari/gdss-brick@90b3a70
hibari >> gh33 (bad sequence on common log deleted by the scavenger):
- Temporary changes
- For debugging, temporary make the scavenger to run every two
minutes- Refactoring the scavenger code
- Breakdown gigantic gmt_hlog_commonlog:scavenger_commonlog/1
function into five smaller functions- Rename gmt_hlog_common:scavenger_commonlog_bottom10/7 to
scavenger_commonlog_bottom/2- Update #scav{}; add dead_paths, dead_seq_bytes, and live_seq_bytes
fields, and rename phase10_fun field to bottom_fun- Remove parameters like TmpPath and Finfolog form various scavenger
functions because they are also in the first parameter #scav{}- Change variable names, for example, from cryptic Tmp4List to
LiveHunkSizesOrderBySeq- Delete codes just for debugging. e.g. os:cmd("find -name ...") and
writing a file with debug info. Instead, add some debug level
logging- brick_ets module: Remove unused functions make_info_log_fun/1 and
scavenge_one_seq_file_fun/4- Change some of "==" and "/=" to "=:=" and "=/="
- Change some of "if X == true" and "if X == false" to "if X" and
"if not X" when X is guaranteed to be boolean type- Enhancements on the scavenging process
- Move the step to get all eligible sequence numbers to very top,
so that the scavenger can quit when there is no such sequences- Reorganize the contents of scavenger resume file. It's no longer
compatible with earlier version- Add more info and debug level logs
- Add more info level logs to each steps of scavenging process
- Add info-level log to gmt_hlog module:create_new_log_file/2,
safe_log_close_short/1 and safe_log_close_long/2- Add ?WAIT_BEFORE_EXIT to ensure error log to be written before
exit- Not to use the logger from brick_ets:make_info_log_fun/1 but
?E_INFO(Fmt, Args)- Change ?E_DGB(Cat, Fmt, Args) to ?E_DBG(Fmt, Args)
from hibari.
Getting ready for fix; refactoring etc.
gdss-admin (on a topic branch) - commit hibari/gdss-admin@0f2c65f
hibari >> gh33 (bad sequence on common log deleted by the scavenger):
- Update brick_admin:fast_sync_scav/6 not to set log_fun in #scav{}
record. (That function is not used in gdss-admin anyway)
gdss-brick (on a topic branch) - commit hibari/gdss-brick@8cf9850
hibari >> gh33 (bad sequence on common log deleted by the scavenger):
- Refactoring hlog, hlog_common, hlog_local, and brick_ets modules:
- Update #scav{} in brick.hrl: remove log_fun field, add type spec
on each field- Reorganize hlog deletion codes in the checkpoint process
- Add spec to some functions
- Move count_live_bytes_in_log/1 from brick_ets module to
gmt_hlog_common- Delete delete_seq/2 from brick_ets module. Add a similar function
(but with more informative info level log) to gmt_hlog_common
module- Change some of "if" expressions to "case"
- Change some of "==" and "/=" to "=:=" and "=/=" when possible
- Delete debug print like io:format/2
- Add more info and debug level logs
- Add file size information to various info level logs
- Add info level logs for local hlog creation and deletion
- Add info and debug level logs to the checkpoint process
- Disable ?DBG_TLOG logs in write_back functions. Will update them
later with more helpful debug messages
from hibari.
Getting ready for fix; scavenging process enhancement.
gdss-admin (on a topic branch) - commit hibari/gdss-admin@a1d7b2f
hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):
- Update brick_admin module to work with #scav{} record with
destructive field deleted
gdss-brick (on a topic branch) - commit hibari/gdss-brick@70c13cb
hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):
- Enhancements on the scavenging process:
- Check eligible sequence numbers before creating their work files,
so that the scavenger can avoid creating and sorting the contents
of unnecessary work files- Optimize steps for (1) copying live hunks to the latest hlog and
(2) updating their locations, so that we can eliminate the
resuming scavenger feature:
** Before: Run (1) on all live hunks (e.g. 100 thousands), then
run (2) on all copied hunks
** After: Run (1) on a chunk of live hunks (e.g. 100), then run
(2) on those hunks, go to (1) and repeat. The chunk
size is controlled by disk_log module in Erlang/OTP- Remove resume_scavenger_commonlog/1 and related functions
- Remove 'destructive' and 'skip_reads' fields. They only appear to
be useful for debugging the scavenging process- Refactoring hlog_common and brick_ets modules:
- Update #scav{}: remove update_locations. We don't switch that
function- Rewrite "try catch" expression and their child "begin end" and
"if" expressions in brick_ets:copy_one_hunk/6 to be a proper
"try catch" syntax- Add more info level logs:
- Add file size, copied hunks, and copied bytes information to some
info level logs
from hibari.
Getting ready for fix; scavenger process enhancement.
gdss-brick (on a topic branch) - commit hibari/gdss-brick@7749814
hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):
- Enhancements on the scavenging process:
- the fun created by gmt_hlog_common:scavenge_one_seq_file_fun/1
now handles unexpected errors so that scavenging process can
continue on other seq files.- gmt_hlog_common:scavenger_move_hunks1/8 now try to continue moving
live hunks even if some errors has been occurred.- gmt_hlog_common:scav_check_shutdown/1 now reports a summary of
canceled scavenging process. (done by adding #scav_progress{}
and scav_print_summary/2) It now deletes work files too.- scavenger_commonlog_bottom/1 now skips to copy live hunks when
is no such hunks- A bug fix in write_hunk_bytes of gmt_hlog_local:
- fix a potential bug in handle_call/3 for 'write_hunk_bytes' with
'metadata', which causes to return a wrong offset. (The caller
doesn't use the returned value so this bug had no actual impact)- Refactoring hlog_common and brick_ets modules:
- Rename type eee() in gmt_hlog.hrl to metadata_tuple()
- Moved an expression to create a metadata_tuple() from
gmt_hlog_local:handle_call/3 to make_metadata_tuple/7 for clarity- Add live_hunk_sizes field to #scav{} record
- Add a type called scavenger_move_hunks_result
- Add type do_mod() and some specs to brick_ets module.
- Remove unused 'relocate_*' fields from #wb{} record in
gmt_hlog_common module- Replace some "if" expressions with pattern matches
- Misc
- gdss-brick >> GH9 (refactor #state{} record) - Add some comments
to #state{} record in brick_ets.erl- Replace double-space with single-space in the @ doc in
gmt_hlog_local module.- Replace "GDSS" with "Hibari" in the @ doc too.
from hibari.
OK. I'm much more comfortable working on the stuff relating hlog (write-ahead-log) and scavenging process. I want to work on one more scavenger enhancement, then I can proceed to the solution for this issue.
%% @ TODO Run step 5 before step 3
%% - Move the part to scan for live hunk bytes from step 3 to 5
%% - Update step 5 to delete the work files for the sequences that
%% have grater live hunk percentages than the threshold
%% - Move step 3 under step 5
from hibari.
I accidentally deleted a comment from 9 months ago while I was trying to quote a part of it. I am re-posting a part of the comment at the bottom of this comment, but I lost other parts :(
Anyway, during holiday, I developed prototypes for the approaches 2 and 3 below for some experiments.
- Approach 2 almost worked but I found that it must have a write barrier (global write-lock in a physical brick) to avoid a race condition. Without the barrier, there is still a chance that a brick server will create a reference to a hunk in a log file that is being deleted by the common hlog process. I think a write barrier could live a few minutes in a busy system, so this will not be a good solution.
- Approach 3 worked as expected, and I decided to take this approach. I introduced frozen state to log file; if a log file is frozen,
rename/6
avoids to do?KEY_SWITCHAROO
and delete the old key, but instead, it loads a value of the old key-value from disk, sets the value to the new key, and deletes the old key. My prototype dynamically injects a get op to a do-op list to trigger the squidflash primer to asynchronously load the value from disk to OS disk cache, so it has a minimum impact to the performance.
A quote of the comment:
I reviewed the codes ...
...
Scavenger works as the followings steps:
- Lists hlog locations of all value-hunks in all bricks, sort them
by hlog's sequence number.- Calculates each hlog's live percentage, and if it's lower than
the setting, copy all live value-hunks to the latest hlog.- Perform do operations to update keys with new hlog locations of
the copied value-hunks.- Delete the old hlogs having all value-hunks moved to the latest
hlog.If key
<<"a/1">>
is renamed to<<"a/2">>
by?KEY_SWITCHAROO
in
between step 1 and 3, step 3 for key<<"a/1">>
will fail with
key_not_exist
. And the current implementation of scavenger
silently ignores this, because before introducingrename/6
, this
only happens when the key<<"a/1">>
has been deleted. This doesn't
work withrename/6
as the value-hunk in the old hlog is still
referenced from<<"a/2">>
.To address this, I think there are three approaches:
- Don't delete old hlog file. If scavenger fails step 3 for key
<<"a/1">>
, skip step 4 (deleting hlog file) for the old hlog
holding the value-hunk for<<"a/1">>
. That hlog will be eventually
deleted in other scavenger run on next day or so. This is safe but
not good for reclaiming disk space.- Track rename operation during scavenger process. While
scavenger is running, record rename info key: OldKey, value: NewKey
to a new ETS table (e.g key:<<"a/1">>
, value:<<"a/2">>
). If
scavenger fails step 3 for key<<"a/1">>
, it reads the ETS table
to get key<<"a/2">>
and retries step 3. This seems a most
efficient approach as all additional steps are in-memory
operations. However there will be a chance that the scavenger will
never catch up to rename operations if<<"a/2">>
is renamed to
<<"a/3">>
, then to<<"a/4">>
and so on very quickly.- Not to use
?KEY_SWITCHARO
during scavenger process. While
scavenger is running,rename/6
avoids doing?KEY_SWITCHAROO
and
delete the old key, but instead, it gets the value of the old key,
sets value to the new key, and deletes the old key. However doing
get and set on the value in hlog (on the disk) is much slower than
in-memory operations?KEY_SWITCHAROO
and delete. Also the
value-hunk will be copied twice to the latest hlog; once by
scavenger, once more byrename/6
.Approach 2 seems a yet better solution than others, but can't be
done quickly as extra investigation will be needed to avoid race
conditions etc. Approach 1 and 3 can be used for a quick temporary
fix (3 can be a permanent fix too if 2 is technically difficult.)
from hibari.
Closing this issue as I already addressed this issue for Hibari v0.1.11.
- For Hibari v0.1.11,
rename/6
acts like logs files are always frozen. For more details, see: hibari/gdss-brick#2 - For Hibari v0.3.x,
rename/6
will act as the approach 3 below. This will be done by hibari/gdss-brick #17 Redesign disk storage and checkpoint/scavenger processes.
- Approach 3 worked as expected, and I decided to take this approach. I introduced frozen state to log file; if a log file is frozen,
rename/6
avoids to do?KEY_SWITCHAROO
and delete the old key, but instead, it loads a value of the old key-value from disk, sets the value to the new key, and deletes the old key. My prototype dynamically injects a get op to a do-op list to trigger the squidflash primer to asynchronously load the value from disk to OS disk cache, so it has a minimum impact to the performance.
from hibari.
Related Issues (20)
- Upgrade UBF to version 2.2 and re-enable gdss-ubf-proto HOT 2
- Drop support for older Erlang/OTP releases. (Now server requires >= R16, client requires >= R14) HOT 2
- embedded hibari HOT 12
- Update Hibari master branch (v0.1.x) for Erlang/OTP 17.x HOT 4
- Can I instal hibari on Debian latest version distro? HOT 4
- Can I instal Hibari with MongooseIM ? HOT 6
- What is the difference between Hibari, mnesia & Riak is a distributed NoSQL ? HOT 3
- May I instal Hibari on Debian GNU/kFreeBSD distro? HOT 6
- Is hibari still mantained as I see no further development sinds last year? HOT 2
- reltool command will fail to generate package due to broken symlinks created by repo command HOT 1
- Create .deb package for Debian 7 HOT 1
- Hibari v0.1.11 Release HOT 14
- Host Hibari docs at Read The Docs HOT 2
- Update for Erlang/OTP 18 HOT 11
- Continuous integration with QuickCheck CI
- Update Thrift client examples for Thrift 0.9 HOT 3
- Introduce gdss_thrift_proto with full-featured Thirft client API HOT 6
- Continuous integration with Travis CI
- Include eper HOT 1
- Move to Erlang/OTP 19.3 and 20.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hibari.