Hello, I'm noticing increasing memory usage over time (~7GB in ~24h)

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Possible memory leak,about versatica/mediasoup

Comments (34)

nazar-pc commented on September 27, 2024 1

You do have access, it is just hidden method in documentation because you're not supposed to normally use it or rely on it, but might be helpful for debugging purposes.

from mediasoup.

ibc commented on September 27, 2024 1

Clear, thanks, so there is a clear bug then. I will get deep into this on next week. Can you please copy your exact previous comment into the issue description please?

from mediasoup.

ibc commented on September 27, 2024 1

Ok I have a fix. Will write PR probably tomorrow.

from mediasoup.

ibc commented on September 27, 2024 1

@PaulOlteanu, mediasoup crate 0.17.0 has been published.

from mediasoup.

satoren commented on September 27, 2024

Does your application start Worker statically at server startup? In other words, do you ever drop a Worker?

from mediasoup.

PaulOlteanu commented on September 27, 2024

Yes, we only create one worker at startup and never drop it

from mediasoup.

satoren commented on September 27, 2024

If possible, run worker.dump to make sure nothing is running in the worker. Also, have you confirmed that router, transport, producer, consumer, etc. on the Rust side are all dropped?

from mediasoup.

PaulOlteanu commented on September 27, 2024

We don't seem to have access to a worker.dump method from the Rust bindings.

I added counters that get incremented on the creation of all routers, transports, producers, and consumers, and get decremented in the on_close callbacks, and after having everyone disconnect they return to 0, so I believe everything is getting dropped.

from mediasoup.

PaulOlteanu commented on September 27, 2024

Here's the results of a worker dump that was running for 2 hours before I took it offline:

WorkerDump {
    router_ids: [RouterId(3e91ba78-9c9c-44b1-aac0-ad31cab9e210)],
    webrtc_server_ids: [WebRtcServerId(4dd58dce-29ac-4531-8466-076847ee15e9)],
    channel_message_handlers: ChannelMessageHandlers {
        channel_request_handlers: [
            61b86584-ea8c-4286-80e4-d55a83e29d5d,
            4dd58dce-29ac-4531-8466-076847ee15e9,
            3e91ba78-9c9c-44b1-aac0-ad31cab9e210,
            7c8d0474-c584-47bd-b327-7d7a9f5679bb
        ],
        channel_notification_handlers: [61b86584-ea8c-4286-80e4-d55a83e29d5d, 7c8d0474-c584-47bd-b327-7d7a9f5679bb]
    },
    liburing: None
}

I don't know what channel_message_handlers, channel_request_handlers, and channel_notification_handlers refer to. The webrtc server is expected because we create a static one per worker. The single remaining router is also expected as we do have a global producer to work around a bug in our client. This memory issue has been occurring before I moved that producer to its own global router so I don't believe it would be the cause of this issue, but I can try to remove it if you think it would make a difference.

from mediasoup.

ibc commented on September 27, 2024

I'm afraid we cannot manage this potential memory leak as a bug here. There is nothing we can specifically investigate to conclude whether this is a real memory leak (a real bug in mediasoup) or some OS memory management stuff. So I'm afraid I will close this issue. If a real leak is detected (I mean, by looking at the code) then please comment here or fill a new issue.

For example, here a real memory leak bug #1382 that's being fixed in this PR #1383.

from mediasoup.

PaulOlteanu commented on September 27, 2024

Okay I got a valgrind dump (while the server was still running - https://stackoverflow.com/questions/9720632/can-valgrind-output-partial-reports-without-having-to-quit-the-profiled-applicat)

==1== 15,138,816 bytes in 231 blocks are possibly lost in loss record 3,262 of 3,262
==1==    at 0x4886B58: operator new[](unsigned long) (vg_replace_malloc.c:640)
==1==    by 0xCC4D9F: onAlloc(uv_handle_s*, unsigned long, uv_buf_t*) (in /usr/bin/mercury)
==1==    by 0xA7BA23: uv__read (in /usr/bin/mercury)
==1==    by 0xA7C573: uv__stream_io (in /usr/bin/mercury)
==1==    by 0xA83057: uv__io_poll (in /usr/bin/mercury)
==1==    by 0xA76B23: uv_run (in /usr/bin/mercury)
==1==    by 0xA86ABB: DepLibUV::RunLoop() (in /usr/bin/mercury)
==1==    by 0xA91787: Worker::Worker(Channel::ChannelSocket*) (in /usr/bin/mercury)
==1==    by 0xA8566F: mediasoup_worker_run (in /usr/bin/mercury)
==1==    by 0x60B35F: {closure#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>> (utils.rs:77)
==1==    by 0x60B35F: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154)
==1==    by 0x60D7CF: UnknownInlinedFun (mod.rs:529)
==1==    by 0x60D7CF: UnknownInlinedFun (unwind_safe.rs:272)
==1==    by 0x60D7CF: UnknownInlinedFun (panicking.rs:552)
==1==    by 0x60D7CF: UnknownInlinedFun (panicking.rs:516)
==1==    by 0x60D7CF: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<mediasoup::worker::utils::run_worker_with_channels::{closure_env#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>>, ()>>, ()> (panic.rs:142)
==1==    by 0x60D7CF: {closure#1}<mediasoup::worker::utils::run_worker_with_channels::{closure_env#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>>, ()> (mod.rs:528)
==1==    by 0x60D7CF: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:250)
==1==    by 0x17EFB87: call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> (boxed.rs:2007)
==1==    by 0x17EFB87: call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> (boxed.rs:2007)
==1==    by 0x17EFB87: std::sys::unix::thread::Thread::new::thread_start (thread.rs:108)

That callstack lines up with the flamegraphs I posted earlier, but the operator new[](unsigned long) stands out - is that referring to this inside of TcpConnectionHandle::OnUvReadAlloc which is called from onAlloc inside of TcpConnectionHandle.cpp? I see that in UcpConnectionHandle, the buffer is static, so the callstack can't be referring to the onAlloc defined in UdpConnectionHandle right (since there's no new)?

I see that buffer gets freed in the destructor of TcpConnectionHandle but I have no idea what the lifecycle of the TcpConnectionHandle looks like so I'm not sure what could be keeping it around if this is the issue. I'll keep digging to try and figure it out but of course I'd appreciate pointers on what to look for.

from mediasoup.

ibc commented on September 27, 2024

On UDP the buffer is static because we receive datagrams that can only contain a complete packet. That's not true in TCP where we receive bytes in random chunks until we detect a complete packet.

I won't be able to check this thing during next immediate weeks but will reopen the ticket. Any more debugging you c an do is welcome. Thanks.

from mediasoup.

PaulOlteanu commented on September 27, 2024

I've found that dropping the WebRtcServer struct associated with each worker frees that memory:

Before dropping the WebRtcServer:

After dropping the WebRtcServer (the allocations for that callstack are completely gone):

So it seems the connections associated with the server aren't being removed so those buffers aren't being deleteed.

from mediasoup.

ibc commented on September 27, 2024

So you mean this is a problem just in Rust side?

from mediasoup.

PaulOlteanu commented on September 27, 2024

I'm not sure. I'm still not super familiar with the mediasoup codebase so I'm just trying to understand when the destructor of TcpConnectionHandle should actually be called to free that allocation.

It seems to get deleted from TcpServerHandle::OnTcpConnectionClosed which seems like it should get called from TcpConnectionHandle::OnUvRead when it detects the connection was closed, in which case the lifecycle of the TcpConnectionHandle is unrelated to the Rust side (I could be wrong here).

I'm gonna look at what the c++ worker does when it receives a WorkerWebrtcserverClose message from Rust and see why that would clear out those connections.

from mediasoup.

ibc commented on September 27, 2024

The TcpConnectionHandle is destroyed when:

Client closes the TCP connection, so that callback is called and in there the instance will be freed.
When app closes the WebRtcTransport via close() method.
When app closes the parent Router or Worker.
When app closes the WebRtcServer that is managing that WebRtcTransport (in case it was provided with a WebRtcServer).

All this is heavily tested in C++ side but of course I may be wrong and something is not good.

from mediasoup.

PaulOlteanu commented on September 27, 2024

So far what I can tell is this:

The on_close callback for all WebRtcTransports, Routers, Consumers, and Producers are being called correctly (so none of those structs are still around on the Rust side).
If I call WebRtcServer::dump in Rust while the server is under normal use, I get a long list of transport ids. If I call WebRtcServer::dump once everyone has disconnected, there are no transport ids listed. This information seems to be filled here in WebRtcServer.cpp, so it seems to me even the WebRtcServer sees the transports as being closed as well.
The same as the above applies for routers (in the corresponding call to Worker::dump)

I'm quite confused how the handle isn't being closed given all this

from mediasoup.

PaulOlteanu commented on September 27, 2024

I see in the destructor for WebRtcTransport it does delete tcpServer on its list of TcpServers, but when constructed with a WebRtcServer, it never actually inserts anything into that list.

So I think when the transport is created with a server it doesn't close the connection when the transport is closed? I must be missing something...

from mediasoup.

Lynnworld commented on September 27, 2024

try lsof to check if there is tcp connection leak , if so there will be many close wait connections

from mediasoup.

PaulOlteanu commented on September 27, 2024

lsof -n | grep -i "close_wait" doesn't show any in CLOSE_WAIT

from mediasoup.

PaulOlteanu commented on September 27, 2024

I added a log to the destructor of UdpSocketHandle (I know this issue is with TCP but it's easier for me to test with a udp connection), and I see that it is not destroyed when the transport is closed, but only once I drop the WebRtcServer.

I'll try again with a TCP connection, but I think the issue might actually be that the connections aren't closed on transport drop when using a WebRtcServer.

from mediasoup.

ibc commented on September 27, 2024

UDP are sockets and must remain open until webRtcServer is destroyed because they are the same UDP sockets shared for all webRtcTransports created using that webRtcServer.

TCP is different. TCP servers are shared for all webRtcTransports using the same WebRtcServer,but each TCP connection belongs to its corresponding transport. Here is where the bug could be.

BTW not sure when I will be able to jump into this, I am off for some days.

from mediasoup.

PaulOlteanu commented on September 27, 2024

I've found a way to somewhat reliably reproduce this issue in our staging environment.

I enable only TCP on the server, then from an Android client I create a webrtctransport, get it to the connected state, and then turn airplane mode on. I see the destructor for the WebRtcTransport run, but not the destructor for the TcpConnectionHandle.

There is definitely a bug when a WebRtcTransport is created with a WebRtcServer: it doesn't close the connection (without the WebRtcServer it works fine because it deletes the tcpServer, but with a WebRtcServer, this doesn't happen).

Most of the time the connection closes itself (I think from the read loop receiving UV_EOF or UV_ECONNRESET or an error) but when closed improperly, for example by turning airplane mode on, it doesn't seem to close itself and then also doesn't get closed by the transport closing.

I'm not sure what the best way to proceed is, I can try and find out why TcpConnectionHandle::OnUvRead isn't properly closing itself (my guess would be that we're removing the handle from the uv loop), or I can try and find a way to close it from the WebRtcTransport destructor in a way that won't result in a double free :p

from mediasoup.

ibc commented on September 27, 2024

NOTE: the WebRtcTransport NEVER auto closes itself even if clients closed its TCP side by sending TCP FIN or RTP to mediasoup. In those cases the WebRtcTransport emits "iceconnectionstate" event with state "disconnected" (or "closed", I don't remember) and the Node application may want to close the transport by calling its close() method. I assume this is what you Node app does.

So in your case you say that client just abruptly disconnects without sending TCP FIN or TCP RST. Then it's up to the Node application to detect that client went away and close the transport. There is also an option on WebRtcTransportOptions to monitor ICE status by expenting periodic ICE binding requests (AKA PING messages) from client to mediasoup. If that stops happening then the WebRtcTransport emits "icestatuschange" event with state "disconnected" and the Node app can close the transport. But if such a mechanism is disabled then mediasoup cannot do anything by its own to detect the disconnection.

So let me a question: when that Android client abruptly disconnects its TCP connection side, is your Node app calling webRtcTransport.close() or not? If not, then this is not a leak in mediasoup but a wrong usage.

from mediasoup.

PaulOlteanu commented on September 27, 2024

So this is a Rust application, but the same logic applies obviously. We don't close the transport when we see an icestatuschange so maybe we need to add that. But that is not the issue here:

We notice the client disconnect via our signaling layer (websocket). We close the transport (I can see that the transport is closed because of the transport destructor logs). The TcpConnectionHandle does not get closed (I can see that the log in its destructor does not appear). Once I close the WebRtcServer, I see the TcpConnectionHandle close (again, through the log in its destructor)

from mediasoup.

PaulOlteanu commented on September 27, 2024

What I meant earlier by "TcpConnectionHandle closing itself" is TcpConnectionHandle::OnUvRead handles a connection close by calling OnTcpConnectionClosed:

	// Client disconnected.
	else if (nread == UV_EOF || nread == UV_ECONNRESET)
	{
		MS_DEBUG_DEV("connection closed by peer, closing server side");


		this->isClosedByPeer = true;


		// Close server side of the connection.
		Close();


		// Notify the listener.
		this->listener->OnTcpConnectionClosed(this);
	}

which then deletes that connection (and thus frees the buffer):

inline void TcpServerHandle::OnTcpConnectionClosed(TcpConnectionHandle* connection)
{
	MS_TRACE();

	MS_DEBUG_DEV("TCP connection closed");

	// Remove the TcpConnectionHandle from the set.
	this->connections.erase(connection);

	// Notify the subclass.
	UserOnTcpConnectionClosed(connection);

	// Delete it.
	delete connection;
}

I believe sometimes this code path is hit, and the connection is correctly freed. When I disconnected my client by turning airplane mode on, I believe that code path isn't getting hit, and so the connection is not correctly freed.

from mediasoup.

ibc commented on September 27, 2024

I believe sometimes this code path is hit, and the connection is correctly freed. When I disconnected my client by turning airplane mode on, I believe that code path isn't getting hit, and so the connection is not correctly freed.

So in this case of course that path is not hit because client didn't send any TCP FIN or RST, but then your Node app detects WebSocket disconnection (or whatever) ends finally calls transport.close(), doesn't it? And assuming the answer is yes, doesn't that transport.close() destroys the transport and its buffer and its associated TcpConnectionHandles? Sorry I'm not on my computer for same days so maybe you confirmed this already and I don't remember.

from mediasoup.

ibc commented on September 27, 2024

Ok, testing. This is even worse since I'm getting a crash when closing the WebRtcServer .via its close() method. Reporting it in a different ticket:

#1388

from mediasoup.

ibc commented on September 27, 2024

Ok, so I'm testing this scenario running the mediasoup-demo locally:

Connect a browser with forceTcp=true&consume=false to ensure a single WebRtcTransport is created with a single RTC::TcpConnection.
Browser is connected at ICE and DTLS levels.
In server interactive console enter terminal mode by typing "t" + ENTER.
Then type transport.close().

Here the logs. Notice that I've added dump logs in constructor and destructors and close() methods of TcpConnectionHandle and RTC::Transport and a few others:

During the initial connection browser creates 2 TCP connections then it closes one of them (expected):

ConnectionHandle() | ++++++++++ TcpConnectionHandle constructor
RTC::TcpConnection::TcpConnection() | ++++++++++ RTC::TcpConnection constructor
TcpConnectionHandle::TcpConnectionHandle() | ++++++++++ TcpConnectionHandle constructor
RTC::TcpConnection::TcpConnection() | ++++++++++ RTC::TcpConnection constructor
TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close()
RTC::TcpConnection::~TcpConnection() | ---------- RTC::TcpConnection DESTRUCTOR
TcpConnectionHandle::~TcpConnectionHandle() | ----------- TcpConnectionHandle DESTRUCTOR

Now I call transport.close() in server side:

terminal> TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close()
RTC::WebRtcTransport::~WebRtcTransport() | ---------- calling this->webRtcTransportListener->OnWebRtcTransportClosed()

So the problem is that we are calling TcpConnectionHandle::Close() but we are NOT calling its destructor so we are not freeing this->buffer:

TcpConnectionHandle::~TcpConnectionHandle()
{
	MS_TRACE();
	MS_DUMP("----------- TcpConnectionHandle DESTRUCTOR");

	if (!this->closed)
	{
		Close();
	}

	delete[] this->buffer;
}

void TcpConnectionHandle::Close()
{
	MS_TRACE();
	MS_DUMP("----------- TcpConnectionHandle Close()");

	if (this->closed)
	{
		return;
	}

	int err;

	this->closed = true;

	// Tell the UV handle that the TcpConnectionHandle has been closed.
	this->uvHandle->data = nullptr;

	// Don't read more.
	err = uv_read_stop(reinterpret_cast<uv_stream_t*>(this->uvHandle));

	if (err != 0)
	{
		MS_ABORT("uv_read_stop() failed: %s", uv_strerror(err));
	}

	// If there is no error and the peer didn't close its connection side then close gracefully.
	if (!this->hasError && !this->isClosedByPeer)
	{
		// Use uv_shutdown() so pending data to be written will be sent to the peer
		// before closing.
		auto* req = new uv_shutdown_t;
		req->data = static_cast<void*>(this);
		err       = uv_shutdown(
      req, reinterpret_cast<uv_stream_t*>(this->uvHandle), static_cast<uv_shutdown_cb>(onShutdown));

		if (err != 0)
		{
			MS_ABORT("uv_shutdown() failed: %s", uv_strerror(err));
		}
	}
	// Otherwise directly close the socket.
	else
	{
		uv_close(reinterpret_cast<uv_handle_t*>(this->uvHandle), static_cast<uv_close_cb>(onCloseTcp));
	}
}

from mediasoup.

ibc commented on September 27, 2024

So honestly I don't know where we are calling TcpConnectionHandle::Close() after we call the destructor or WebRtcTransport, second log line here:

terminal> transport.close()

terminal> RTC::WebRtcTransport::~WebRtcTransport() | ----------- WebRtcTransport DESTRUCTOR
TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close()  // <----- we are we calling this??
RTC::WebRtcTransport::~WebRtcTransport() | ---------- calling this->webRtcTransportListener->OnWebRtcTransportClosed()

UPDATE: Ah, ok, it's in delete this->iceServer in ~WebRtcTransport.

from mediasoup.

ibc commented on September 27, 2024

Issue accidentally closed because I pushed to v3 branch by accident. Reverted. PR here #1389

from mediasoup.

ibc commented on September 27, 2024

PR merged. Will release new version soon. Thanks a lot @PaulOlteanu

from mediasoup.

PaulOlteanu commented on September 27, 2024

Thanks for fixing it so quickly @ibc. It'd be great to get a Rust release out as well with this fix, but we're already running 2cae8ec in production and the leak seems to be fixed with no other issues

from mediasoup.

ibc commented on September 27, 2024

I'll release on a few days when I am back to real world.

from mediasoup.

Possible memory leak about mediasoup HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs