Comments (34)
You do have access, it is just hidden method in documentation because you're not supposed to normally use it or rely on it, but might be helpful for debugging purposes.
from mediasoup.
Clear, thanks, so there is a clear bug then. I will get deep into this on next week. Can you please copy your exact previous comment into the issue description please?
from mediasoup.
Ok I have a fix. Will write PR probably tomorrow.
from mediasoup.
@PaulOlteanu, mediasoup crate 0.17.0 has been published.
from mediasoup.
Does your application start Worker statically at server startup? In other words, do you ever drop a Worker?
from mediasoup.
Yes, we only create one worker at startup and never drop it
from mediasoup.
If possible, run worker.dump to make sure nothing is running in the worker. Also, have you confirmed that router, transport, producer, consumer, etc. on the Rust side are all dropped?
from mediasoup.
We don't seem to have access to a worker.dump
method from the Rust bindings.
I added counters that get incremented on the creation of all routers, transports, producers, and consumers, and get decremented in the on_close
callbacks, and after having everyone disconnect they return to 0, so I believe everything is getting dropped.
from mediasoup.
Here's the results of a worker dump that was running for 2 hours before I took it offline:
WorkerDump {
router_ids: [RouterId(3e91ba78-9c9c-44b1-aac0-ad31cab9e210)],
webrtc_server_ids: [WebRtcServerId(4dd58dce-29ac-4531-8466-076847ee15e9)],
channel_message_handlers: ChannelMessageHandlers {
channel_request_handlers: [
61b86584-ea8c-4286-80e4-d55a83e29d5d,
4dd58dce-29ac-4531-8466-076847ee15e9,
3e91ba78-9c9c-44b1-aac0-ad31cab9e210,
7c8d0474-c584-47bd-b327-7d7a9f5679bb
],
channel_notification_handlers: [61b86584-ea8c-4286-80e4-d55a83e29d5d, 7c8d0474-c584-47bd-b327-7d7a9f5679bb]
},
liburing: None
}
I don't know what channel_message_handlers
, channel_request_handlers
, and channel_notification_handlers
refer to. The webrtc server is expected because we create a static one per worker. The single remaining router is also expected as we do have a global producer to work around a bug in our client. This memory issue has been occurring before I moved that producer to its own global router so I don't believe it would be the cause of this issue, but I can try to remove it if you think it would make a difference.
from mediasoup.
I'm afraid we cannot manage this potential memory leak as a bug here. There is nothing we can specifically investigate to conclude whether this is a real memory leak (a real bug in mediasoup) or some OS memory management stuff. So I'm afraid I will close this issue. If a real leak is detected (I mean, by looking at the code) then please comment here or fill a new issue.
For example, here a real memory leak bug #1382 that's being fixed in this PR #1383.
from mediasoup.
Okay I got a valgrind dump (while the server was still running - https://stackoverflow.com/questions/9720632/can-valgrind-output-partial-reports-without-having-to-quit-the-profiled-applicat)
==1== 15,138,816 bytes in 231 blocks are possibly lost in loss record 3,262 of 3,262
==1== at 0x4886B58: operator new[](unsigned long) (vg_replace_malloc.c:640)
==1== by 0xCC4D9F: onAlloc(uv_handle_s*, unsigned long, uv_buf_t*) (in /usr/bin/mercury)
==1== by 0xA7BA23: uv__read (in /usr/bin/mercury)
==1== by 0xA7C573: uv__stream_io (in /usr/bin/mercury)
==1== by 0xA83057: uv__io_poll (in /usr/bin/mercury)
==1== by 0xA76B23: uv_run (in /usr/bin/mercury)
==1== by 0xA86ABB: DepLibUV::RunLoop() (in /usr/bin/mercury)
==1== by 0xA91787: Worker::Worker(Channel::ChannelSocket*) (in /usr/bin/mercury)
==1== by 0xA8566F: mediasoup_worker_run (in /usr/bin/mercury)
==1== by 0x60B35F: {closure#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>> (utils.rs:77)
==1== by 0x60B35F: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154)
==1== by 0x60D7CF: UnknownInlinedFun (mod.rs:529)
==1== by 0x60D7CF: UnknownInlinedFun (unwind_safe.rs:272)
==1== by 0x60D7CF: UnknownInlinedFun (panicking.rs:552)
==1== by 0x60D7CF: UnknownInlinedFun (panicking.rs:516)
==1== by 0x60D7CF: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<mediasoup::worker::utils::run_worker_with_channels::{closure_env#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>>, ()>>, ()> (panic.rs:142)
==1== by 0x60D7CF: {closure#1}<mediasoup::worker::utils::run_worker_with_channels::{closure_env#0}<mediasoup::worker::{impl#6}::new::{async_fn#0}::{closure_env#0}<mediasoup::worker_manager::{impl#3}::create_worker::{async_fn#0}::{closure_env#0}>>, ()> (mod.rs:528)
==1== by 0x60D7CF: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:250)
==1== by 0x17EFB87: call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> (boxed.rs:2007)
==1== by 0x17EFB87: call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> (boxed.rs:2007)
==1== by 0x17EFB87: std::sys::unix::thread::Thread::new::thread_start (thread.rs:108)
That callstack lines up with the flamegraphs I posted earlier, but the operator new[](unsigned long)
stands out - is that referring to this inside of TcpConnectionHandle::OnUvReadAlloc
which is called from onAlloc
inside of TcpConnectionHandle.cpp
? I see that in UcpConnectionHandle
, the buffer is static, so the callstack can't be referring to the onAlloc
defined in UdpConnectionHandle
right (since there's no new
)?
I see that buffer gets freed in the destructor of TcpConnectionHandle
but I have no idea what the lifecycle of the TcpConnectionHandle
looks like so I'm not sure what could be keeping it around if this is the issue. I'll keep digging to try and figure it out but of course I'd appreciate pointers on what to look for.
from mediasoup.
On UDP the buffer is static because we receive datagrams that can only contain a complete packet. That's not true in TCP where we receive bytes in random chunks until we detect a complete packet.
I won't be able to check this thing during next immediate weeks but will reopen the ticket. Any more debugging you c an do is welcome. Thanks.
from mediasoup.
I've found that dropping the WebRtcServer
struct associated with each worker frees that memory:
Before dropping the WebRtcServer
:
After dropping the WebRtcServer
(the allocations for that callstack are completely gone):
So it seems the connections associated with the server aren't being removed so those buffers aren't being delete
ed.
from mediasoup.
So you mean this is a problem just in Rust side?
from mediasoup.
I'm not sure. I'm still not super familiar with the mediasoup codebase so I'm just trying to understand when the destructor of TcpConnectionHandle
should actually be called to free that allocation.
It seems to get deleted from TcpServerHandle::OnTcpConnectionClosed
which seems like it should get called from TcpConnectionHandle::OnUvRead
when it detects the connection was closed, in which case the lifecycle of the TcpConnectionHandle
is unrelated to the Rust side (I could be wrong here).
I'm gonna look at what the c++ worker does when it receives a WorkerWebrtcserverClose
message from Rust and see why that would clear out those connections.
from mediasoup.
The TcpConnectionHandle is destroyed when:
- Client closes the TCP connection, so that callback is called and in there the instance will be freed.
- When app closes the WebRtcTransport via close() method.
- When app closes the parent Router or Worker.
- When app closes the WebRtcServer that is managing that WebRtcTransport (in case it was provided with a WebRtcServer).
All this is heavily tested in C++ side but of course I may be wrong and something is not good.
from mediasoup.
So far what I can tell is this:
- The
on_close
callback for allWebRtcTransport
s,Router
s,Consumer
s, andProducer
s are being called correctly (so none of those structs are still around on the Rust side). - If I call
WebRtcServer::dump
in Rust while the server is under normal use, I get a long list of transport ids. If I callWebRtcServer::dump
once everyone has disconnected, there are no transport ids listed. This information seems to be filled here inWebRtcServer.cpp
, so it seems to me even theWebRtcServer
sees the transports as being closed as well. - The same as the above applies for routers (in the corresponding call to
Worker::dump
)
I'm quite confused how the handle isn't being closed given all this
from mediasoup.
I see in the destructor for WebRtcTransport
it does delete tcpServer
on its list of TcpServer
s, but when constructed with a WebRtcServer
, it never actually inserts anything into that list.
So I think when the transport is created with a server it doesn't close the connection when the transport is closed? I must be missing something...
from mediasoup.
try lsof
to check if there is tcp connection leak , if so there will be many close wait connections
from mediasoup.
lsof -n | grep -i "close_wait"
doesn't show any in CLOSE_WAIT
from mediasoup.
I added a log to the destructor of UdpSocketHandle
(I know this issue is with TCP but it's easier for me to test with a udp connection), and I see that it is not destroyed when the transport is closed, but only once I drop the WebRtcServer
.
I'll try again with a TCP connection, but I think the issue might actually be that the connections aren't closed on transport drop when using a WebRtcServer
.
from mediasoup.
UDP are sockets and must remain open until webRtcServer is destroyed because they are the same UDP sockets shared for all webRtcTransports created using that webRtcServer.
TCP is different. TCP servers are shared for all webRtcTransports using the same WebRtcServer,but each TCP connection belongs to its corresponding transport. Here is where the bug could be.
BTW not sure when I will be able to jump into this, I am off for some days.
from mediasoup.
I've found a way to somewhat reliably reproduce this issue in our staging environment.
I enable only TCP on the server, then from an Android client I create a webrtctransport, get it to the connected state, and then turn airplane mode on. I see the destructor for the WebRtcTransport
run, but not the destructor for the TcpConnectionHandle
.
There is definitely a bug when a WebRtcTransport
is created with a WebRtcServer
: it doesn't close the connection (without the WebRtcServer
it works fine because it deletes the tcpServer
, but with a WebRtcServer
, this doesn't happen).
Most of the time the connection closes itself (I think from the read loop receiving UV_EOF
or UV_ECONNRESET
or an error) but when closed improperly, for example by turning airplane mode on, it doesn't seem to close itself and then also doesn't get closed by the transport closing.
I'm not sure what the best way to proceed is, I can try and find out why TcpConnectionHandle::OnUvRead
isn't properly closing itself (my guess would be that we're removing the handle from the uv loop), or I can try and find a way to close it from the WebRtcTransport
destructor in a way that won't result in a double free :p
from mediasoup.
NOTE: the WebRtcTransport NEVER auto closes itself even if clients closed its TCP side by sending TCP FIN or RTP to mediasoup. In those cases the WebRtcTransport emits "iceconnectionstate" event with state "disconnected" (or "closed", I don't remember) and the Node application may want to close the transport by calling its close() method. I assume this is what you Node app does.
So in your case you say that client just abruptly disconnects without sending TCP FIN or TCP RST. Then it's up to the Node application to detect that client went away and close the transport. There is also an option on WebRtcTransportOptions to monitor ICE status by expenting periodic ICE binding requests (AKA PING messages) from client to mediasoup. If that stops happening then the WebRtcTransport emits "icestatuschange" event with state "disconnected" and the Node app can close the transport. But if such a mechanism is disabled then mediasoup cannot do anything by its own to detect the disconnection.
So let me a question: when that Android client abruptly disconnects its TCP connection side, is your Node app calling webRtcTransport.close() or not? If not, then this is not a leak in mediasoup but a wrong usage.
from mediasoup.
So this is a Rust application, but the same logic applies obviously. We don't close the transport when we see an icestatuschange so maybe we need to add that. But that is not the issue here:
We notice the client disconnect via our signaling layer (websocket). We close the transport (I can see that the transport is closed because of the transport destructor logs). The TcpConnectionHandle
does not get closed (I can see that the log in its destructor does not appear). Once I close the WebRtcServer
, I see the TcpConnectionHandle
close (again, through the log in its destructor)
from mediasoup.
What I meant earlier by "TcpConnectionHandle
closing itself" is TcpConnectionHandle::OnUvRead
handles a connection close by calling OnTcpConnectionClosed
:
// Client disconnected.
else if (nread == UV_EOF || nread == UV_ECONNRESET)
{
MS_DEBUG_DEV("connection closed by peer, closing server side");
this->isClosedByPeer = true;
// Close server side of the connection.
Close();
// Notify the listener.
this->listener->OnTcpConnectionClosed(this);
}
which then deletes that connection (and thus frees the buffer):
inline void TcpServerHandle::OnTcpConnectionClosed(TcpConnectionHandle* connection)
{
MS_TRACE();
MS_DEBUG_DEV("TCP connection closed");
// Remove the TcpConnectionHandle from the set.
this->connections.erase(connection);
// Notify the subclass.
UserOnTcpConnectionClosed(connection);
// Delete it.
delete connection;
}
I believe sometimes this code path is hit, and the connection is correctly freed. When I disconnected my client by turning airplane mode on, I believe that code path isn't getting hit, and so the connection is not correctly freed.
from mediasoup.
I believe sometimes this code path is hit, and the connection is correctly freed. When I disconnected my client by turning airplane mode on, I believe that code path isn't getting hit, and so the connection is not correctly freed.
So in this case of course that path is not hit because client didn't send any TCP FIN or RST, but then your Node app detects WebSocket disconnection (or whatever) ends finally calls transport.close(), doesn't it? And assuming the answer is yes, doesn't that transport.close() destroys the transport and its buffer and its associated TcpConnectionHandles? Sorry I'm not on my computer for same days so maybe you confirmed this already and I don't remember.
from mediasoup.
Ok, testing. This is even worse since I'm getting a crash when closing the WebRtcServer
.via its close()
method. Reporting it in a different ticket:
from mediasoup.
Ok, so I'm testing this scenario running the mediasoup-demo locally:
- Connect a browser with
forceTcp=true&consume=false
to ensure a singleWebRtcTransport
is created with a singleRTC::TcpConnection
. - Browser is connected at ICE and DTLS levels.
- In server interactive console enter terminal mode by typing "t" + ENTER.
- Then type
transport.close()
.
Here the logs. Notice that I've added dump logs in constructor and destructors and close()
methods of TcpConnectionHandle
and RTC::Transport
and a few others:
During the initial connection browser creates 2 TCP connections then it closes one of them (expected):
ConnectionHandle() | ++++++++++ TcpConnectionHandle constructor
RTC::TcpConnection::TcpConnection() | ++++++++++ RTC::TcpConnection constructor
TcpConnectionHandle::TcpConnectionHandle() | ++++++++++ TcpConnectionHandle constructor
RTC::TcpConnection::TcpConnection() | ++++++++++ RTC::TcpConnection constructor
TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close()
RTC::TcpConnection::~TcpConnection() | ---------- RTC::TcpConnection DESTRUCTOR
TcpConnectionHandle::~TcpConnectionHandle() | ----------- TcpConnectionHandle DESTRUCTOR
Now I call transport.close()
in server side:
terminal> TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close()
RTC::WebRtcTransport::~WebRtcTransport() | ---------- calling this->webRtcTransportListener->OnWebRtcTransportClosed()
So the problem is that we are calling TcpConnectionHandle::Close()
but we are NOT calling its destructor so we are not freeing this->buffer
:
TcpConnectionHandle::~TcpConnectionHandle()
{
MS_TRACE();
MS_DUMP("----------- TcpConnectionHandle DESTRUCTOR");
if (!this->closed)
{
Close();
}
delete[] this->buffer;
}
void TcpConnectionHandle::Close()
{
MS_TRACE();
MS_DUMP("----------- TcpConnectionHandle Close()");
if (this->closed)
{
return;
}
int err;
this->closed = true;
// Tell the UV handle that the TcpConnectionHandle has been closed.
this->uvHandle->data = nullptr;
// Don't read more.
err = uv_read_stop(reinterpret_cast<uv_stream_t*>(this->uvHandle));
if (err != 0)
{
MS_ABORT("uv_read_stop() failed: %s", uv_strerror(err));
}
// If there is no error and the peer didn't close its connection side then close gracefully.
if (!this->hasError && !this->isClosedByPeer)
{
// Use uv_shutdown() so pending data to be written will be sent to the peer
// before closing.
auto* req = new uv_shutdown_t;
req->data = static_cast<void*>(this);
err = uv_shutdown(
req, reinterpret_cast<uv_stream_t*>(this->uvHandle), static_cast<uv_shutdown_cb>(onShutdown));
if (err != 0)
{
MS_ABORT("uv_shutdown() failed: %s", uv_strerror(err));
}
}
// Otherwise directly close the socket.
else
{
uv_close(reinterpret_cast<uv_handle_t*>(this->uvHandle), static_cast<uv_close_cb>(onCloseTcp));
}
}
from mediasoup.
So honestly I don't know where we are calling TcpConnectionHandle::Close()
after we call the destructor or WebRtcTransport
, second log line here:
terminal> transport.close()
terminal> RTC::WebRtcTransport::~WebRtcTransport() | ----------- WebRtcTransport DESTRUCTOR
TcpConnectionHandle::Close() | ----------- TcpConnectionHandle Close() // <----- we are we calling this??
RTC::WebRtcTransport::~WebRtcTransport() | ---------- calling this->webRtcTransportListener->OnWebRtcTransportClosed()
UPDATE: Ah, ok, it's in delete this->iceServer
in ~WebRtcTransport
.
from mediasoup.
Issue accidentally closed because I pushed to v3 branch by accident. Reverted. PR here #1389
from mediasoup.
PR merged. Will release new version soon. Thanks a lot @PaulOlteanu
from mediasoup.
Thanks for fixing it so quickly @ibc. It'd be great to get a Rust release out as well with this fix, but we're already running 2cae8ec in production and the leak seems to be fixed with no other issues
from mediasoup.
I'll release on a few days when I am back to real world.
from mediasoup.
Related Issues (20)
- Mediasoup worker died, exiting in 2 seconds... HOT 2
- [Rust] Error log “XXX closing failed on drop: Channel already closed” HOT 2
- Add interface for RTC::Shared, Channel::ChannelNotifier and ChannelMessageRegistrator HOT 3
- [Rust] Transport with only audio producer results in server-side transport cc is not created. HOT 2
- Bug Simulcast plainTransport -> webrtcTransport HOT 5
- Video software encoding Quality Limitation vs hardware encoding HOT 1
- Ci tests fail consistently in Windows (Node and Rust) HOT 10
- Failed to run custom build command for `mediasoup-sys v0.9.0` HOT 6
- H264 temporal layers no longer detected due to removal of frame-marking RTP extension in libwebrtc (must implement dependency descriptor RTP extension) HOT 6
- AddressSanitizer issues HOT 6
- Couldn't run on ARM64 platform. HOT 3
- consumer#resume/pause not working as it should HOT 2
- worker process silently exits if TcpConnectionHandle::SetPeerAddress() fails HOT 2
- Simulcast frozen video due to dropping RTP packets HOT 4
- the "PacketRouter* packet_router_" and "PacedSender pacer_" is different class, but the pacer_ is assigned by the "packet_router_".
- Update ESLint and its ecosystem
- Docker and io_uring don't work together - add prebuilt binary without liburing HOT 16
- libsrtp: wraparound with loss decryption failure HOT 2
- Include ubuntu-24.04 in CI HOT 8
- mediasoup-rs segfault HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mediasoup.