GithubHelp home page GithubHelp logo

[Bug] Memory leak of be about doris HOT 11 OPEN

xingyingone avatar xingyingone commented on June 8, 2024 1
[Bug] Memory leak of be

from doris.

Comments (11)

liujunfei2230 avatar liujunfei2230 commented on June 8, 2024 1

i match the same problem on 2.0.3 version

from doris.

xinyiZzz avatar xinyiZzz commented on June 8, 2024

Hi @xingyingone, can the problem be reproduced? If possible, refer to https://doris.apache.org/community/developer-guide/debug-tool/#jemalloc-1 to print the jemalloc heap profile, and we will analyze it later.
image
image

from doris.

xingyingone avatar xingyingone commented on June 8, 2024

Unfortunately, it can't be reproduced at the moment

from doris.

seanfulton avatar seanfulton commented on June 8, 2024

heapfile.zip
I am seeing the same thing with apache-doris-2.0.3-bin-x64-noavx2.

We have a six-node cluster, three BE and three FE. All are 64G of RAM. We are loading a series of MySQL tables from files. The largest table is 1.5G. The files are a series of INSERT statements.

The BE nodes had 64G of RAM and we consistently ran into the limit of 52 G. I increased the RAM to 96G and now we are getting the error below. Restarting the BE process causes the RAM to drop, but once we restart the load, the RAM creeps up until it exceeds the limit.

This is from the BE node:
ON the
FreeTopOvercommitMemoryQuery:
WorkloadGroup:
I0103 08:02:27.237190 19832 daemon.cpp:262] [MemoryGC] start minor GC, process memory used 72.75 GB exceed soft limit 69.12 GB or sys available memory 18.70 GB less than warning water mark 3.20 GB..
I0103 08:02:27.237272 19832 mem_tracker_limiter.cpp:522] [MemoryGC] GC free process top memory overcommit query, , start seek all query, running query and load num: 63
I0103 08:02:27.237390 19832 mem_tracker_limiter.cpp:560] [MemoryGC] GC free process top memory overcommit query, seek finished, seek 63 tasks. among them, 0 tasks can be canceled; 63 small tasks that were skipped; 0 tasks is being canceled and has not been completed yet;
I0103 08:02:27.237396 19832 mem_tracker_limiter.cpp:568] [MemoryGC] GC free process top memory overcommit query, finished, no task need be canceled.
I0103 08:02:27.237417 19832 mem_info.cpp:128] [MemoryGC] end minor GC, free memory 0. cost(us): 192, details: :
FreeTopOvercommitMemoryQuery:
WorkloadGroup:
I0103 08:02:27.793373 20963 olap_server.cpp:1065] cooldown producer get tablet num: 0
I0103 08:02:28.338559 19832 daemon.cpp:262] [MemoryGC] start minor GC, process memory used 72.75 GB exceed soft limit 69.12 GB or sys available memory 18.70 GB less than warning water mark 3.20 GB..
I0103 08:02:28.338682 19832 mem_tracker_limiter.cpp:522] [MemoryGC] GC free process top memory overcommit query, , start seek all query, running query and load num: 63
I0103 08:02:28.338913 19832 mem_tracker_limiter.cpp:560] [MemoryGC] GC free process top memory overcommit query, seek finished, seek 63 tasks. among them, 0 tasks can be canceled; 63 small tasks that were skipped; 0 tasks is being canceled and has not been completed yet;
I0103 08:02:28.338925 19832 mem_tracker_limiter.cpp:568] [MemoryGC] GC free process top memory overcommit query, finished, no task need be canceled.
I0103 08:02:28.338955 19832 mem_info.cpp:128] [MemoryGC] end minor GC, free memory 0. cost(us): 339, details: :
FreeTopOvercommitMemoryQuery:
WorkloadGroup:
I0103 08:02:29.440176 19832 daemon.cpp:262] [MemoryGC] start minor GC, process memory used 72.75 GB exceed soft limit 69.12 GB or sys available memory 18.70 GB less than warning water mark 3.20 GB..
I0103 08:02:29.440274 19832 mem_tracker_limiter.cpp:522] [MemoryGC] GC free process top memory overcommit query, , start seek all query, running query and load num: 63
I0103 08:02:29.440418 19832 mem_tracker_limiter.cpp:560] [MemoryGC] GC free process top memory overcommit query, seek finished, seek 63 tasks. among them, 0 tasks can be canceled; 63 small tasks that were skipped; 0 tasks is being canceled and has not been completed yet;
I0103 08:02:29.440428 19832 mem_tracker_limiter.cpp:568] [MemoryGC] GC free process top memory overcommit query, finished, no task need be canceled.
I0103 08:02:29.440454 19832 mem_info.cpp:128] [MemoryGC] end minor GC, free memory 0. cost(us): 237, details: :
FreeTopOvercommitMemoryQuery:
WorkloadGroup:
I0103 08:02:30.061405 21622 heartbeat_server.cpp:61] get heartbeat from FE.host:10.10.21.11, port:9020, cluster id:1280404001, counter:9637, BE start time: 1704238503931
I0103 08:02:30.541870 19832 daemon.cpp:262] [MemoryGC] start minor GC, process memory used 72.75 GB exceed soft limit 69.12 GB or sys available memory 18.70 GB less than warning water mark 3.20 GB..
I0103 08:02:30.541988 19832 mem_tracker_limiter.cpp:522] [MemoryGC] GC free process top memory overcommit query, , start seek all query, running query and load num: 63
I0103 08:02:30.542172 19832 mem_tracker_limiter.cpp:560] [MemoryGC] GC free process top memory overcommit query, seek finished, seek 63 tasks. among them, 0 tasks can be canceled; 63 small tasks that were skipped; 0 tasks is being canceled and has not been completed yet;
I0103 08:02:30.542183 19832 mem_tracker_limiter.cpp:568] [MemoryGC] GC free process top memory overcommit query, finished, no task need be canceled.
I0103 08:02:30.542214 19832 mem_info.cpp:128] [MemoryGC] end minor GC, free memory 0. cost(us): 292, details: :

This is from the FE node:
0# doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6
Done with web_day_hour.sql!
Loading web_device.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=c594ca6f2ed04809-8adbb8493c34128e>, peak used 0, current used 0, exec node:<>, process memory used 78.23 GB exceed limit 76.80 GB or sys available memory 17.95 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_device.sql!
Loading web_os.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=d1ce5c961b374d1b-bcedc4451cc26e2e>, peak used 0, current used 0, exec node:<>, process memory used 78.27 GB exceed limit 76.80 GB or sys available memory 17.92 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_os.sql!
Loading web_referrer.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=b323af1e0484c9e-9238b97e47535d7a>, peak used 0, current used 0, exec node:<>, process memory used 78.33 GB exceed limit 76.80 GB or sys available memory 17.86 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64

Done with web_referrer.sql!
Loading web_segment.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=7ca28dcb9779422b-964952796354906f>, peak used 0, current used 0, exec node:<>, process memory used 78.52 GB exceed limit 76.80 GB or sys available memory 17.67 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_segment.sql!
Loading web_social.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=bec21df0a24e4882-a1c9791c70060ae8>, peak used 0, current used 0, exec node:<>, process memory used 78.57 GB exceed limit 76.80 GB or sys available memory 17.62 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

from doris.

xinyiZzz avatar xinyiZzz commented on June 8, 2024

Unfortunately, it can't be reproduced at the moment

@xingyingone Try reducing segment cache capacity, add segment_cache_capacity=10000 to be.conf.
The problem of memory not free in Doris 2.0.x is often caused by the segment cache being too large, because the segment cache is not accurately tracked in Doris 2.0.x

from doris.

xinyiZzz avatar xinyiZzz commented on June 8, 2024

@sepastian
grep "Memory Tracker Summary" be.INFO, find the last log before BE OOM time, you will see the BE memory statistics at that time.

from doris.

seanfulton avatar seanfulton commented on June 8, 2024

I updated be.conf with the segment_cache_capacity=10000 and restarted the three BE nodes. Still getting the same error in the same place. Again, these are relatively small files with mysql insert statements generated by mysqldump.

I have a shell script that pipes the contents of the file into mysql -P9030, so each table is a new connection to doris. This is happening on every single file, all but one are less than 1 G

Error:

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_device.sql!
Loading web_os.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=ba3be
2681a0d4635-8bea3f5473715482>, peak used 0, current used 0, exec node:<>, process memory used 77.24 GB exceed limit 76.80 GB or sys available memory 18.85 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_os.sql!
Loading web_referrer.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=e4aa2
acb46e94cd9-8906c1b443022160>, peak used 0, current used 0, exec node:<>, process memory used 77.24 GB exceed limit 76.80 GB or sys available memory 18.85 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_6

Done with web_referrer.sql!
Loading web_segment.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=ae5c5
3468384e87-9bf5c7ef84169cf4>, peak used 0, current used 0, exec node:<>, process memory used 77.24 GB exceed limit 76.80 GB or sys available memory 18.85 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64

Done with web_segment.sql!
Loading web_social.sql ...
ERROR 1105 (HY000) at line 35: errCode = 2, detailMessage = (10.10.21.14)[MEM_ALLOC_FAILED]Create Expr failed because [E11] Allocator sys memory check failed: Cannot alloc:64, consuming tracker:<Load#Id=550b7
cabf404250-940d301bb7dca3be>, peak used 0, current used 0, exec node:<>, process memory used 77.24 GB exceed limit 76.80 GB or sys available memory 18.85 GB less than low water mark 1.60 GB.

0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> >) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64

from doris.

seanfulton avatar seanfulton commented on June 8, 2024

Is there any update on this?

from doris.

seanfulton avatar seanfulton commented on June 8, 2024

This is definitely a memory leak. After the above import, I tried creating smaller dump files and the BE servers immediately generated OOM becuase the prior leak had not been resolved. Only restarting the BE process seems to clear it.
This appears to me to be a pretty substantial bug.
No one else is experiencing this? Is there anything I can do to move this forward?
Trying to build a POC but if I can get past this, we'll need to abandon and look at other options.

from doris.

xinyiZzz avatar xinyiZzz commented on June 8, 2024

This is definitely a memory leak. After the above import, I tried creating smaller dump files and the BE servers immediately generated OOM becuase the prior leak had not been resolved. Only restarting the BE process seems to clear it. This appears to me to be a pretty substantial bug. No one else is experiencing this? Is there anything I can do to move this forward? Trying to build a POC but if I can get past this, we'll need to abandon and look at other options.

@seanfulton
Print heap profile when error Allocator sys memory check failed is reported,

1. Change prof:false of JEMALLOC_CONF in be.conf to prof:true and restart BE
2. curl http://be_host:be_webport/jeheap/dump
3. jeprof --dot lib/doris_be heap_dump_file_1
4. paste it to the http://www.webgraphviz.com/

Refer to https://doris.apache.org/community/developer-guide/debug-tool/#jemalloc-1

from doris.

xinyiZzz avatar xinyiZzz commented on June 8, 2024

i match the same problem on 2.0.3 version

@liujunfei2230 Same as the previous comment: #29378 (comment)

from doris.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.