Bug Report mediasoup version: 0.15 I'm trying to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I work with <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

[Rust] Worker panics when running with multiple workers about mediasoup HOT 19 CLOSED

PaulOlteanu commented on May 29, 2024

[Rust] Worker panics when running with multiple workers

from mediasoup.

Comments (19)

ibc commented on May 29, 2024 3

@jmillan and me have talked long about this and we have identified the problem. I'm exposing it in detail in a new issue: #1352

I'm then closing this issue so let's please move to the new one.

CC @satoren @PaulOlteanu @nazar-pc @GEverding

from mediasoup.

PaulOlteanu commented on May 29, 2024 1

Yeah, I've been using mediasoup 0.15 since January 8 and never saw this issue in a single-worker setup. As soon as I tried multiple workers I saw many of these kinds of errors within 24 hours.

I did experiment with multiple workers for a week a few months ago and I also never saw this issue (I think I was using mediasoup 0.12 at the time which was before the flatbuffer change).

from mediasoup.

nazar-pc commented on May 29, 2024

@jmillan is it possible that some non-thread local static were introduced recently?

from mediasoup.

jmillan commented on May 29, 2024

Seems related to flatbuffers. I'll be checking this in few hours.

@PaulOlteanu, I'd suggest using a single worker if possible while we troubleshoot this.

from mediasoup.

jmillan commented on May 29, 2024

For now I don't see any static and non-thread local variables.

Definitely Rust cannot properly read the incoming buffer because it is malformed.

In the following case the payload is a log message (LogRef), but Rust is considering it a notification ('NotificationRef').

EDIT: The following message contains a Notification (DataconsumerBufferedAmountLow) and also the log WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment, which clearly shows the corruption.

thread 'mediasoup-worker-cd9eec89-8672-4293-b4eb-3a857fdd0341' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

from mediasoup.

jmillan commented on May 29, 2024

@PaulOlteanu, can you confirm this only happens when using multiple workers?

from mediasoup.

jmillan commented on May 29, 2024

@PaulOlteanu, I assume you're back to using a single worker. Please let us know if you get the crash in this scenario. So far we've been unable to reproduce the crash enabling multiple workers.

from mediasoup.

GEverding commented on May 29, 2024

I work with @PaulOlteanu. The majority of our traffic comes from ios and android. Could that be a factor is reproducing this crash? What other info do you want to help debug this crash?

from mediasoup.

nazar-pc commented on May 29, 2024

Ideally we'd have a reliable reproduction. If you can take one of example apps (videoroom is good), modify (if necessary) to reproduce an issue and share code with us, that'd be ideal.

from mediasoup.

PaulOlteanu commented on May 29, 2024

One thing that comes to mind is that we're running on aarch64 which I believe has different memory model guarantees to x86. So I guess if the bug is an incorrect atomic usage somewhere it could be not reproducing because of that (I have no idea why this issue would be related to atomics though)?

I've also only seen this issue happen in production where we have 1000-2000 users per server at the moment. So that would also make it difficult to reproduce

from mediasoup.

PaulOlteanu commented on May 29, 2024

We tried running with multiple workers in prod again with mediasoup logs set to debug level in the rust filter, and this:

    let mut worker_settings = WorkerSettings::default();
    worker_settings.log_level = WorkerLogLevel::Debug;
    worker_settings.log_tags = vec![
        WorkerLogTag::Info,
        WorkerLogTag::Ice,
        WorkerLogTag::Dtls,
        WorkerLogTag::Rtp,
        WorkerLogTag::Srtp,
        WorkerLogTag::Rtcp,
        WorkerLogTag::Rtx,
        WorkerLogTag::Bwe,
        WorkerLogTag::Score,
        WorkerLogTag::Simulcast,
        WorkerLogTag::Svc,
        WorkerLogTag::Sctp,
    ];

in the worker settings. We didn't see the same issue previously mentioned, but instead saw a segfault happen twice. Will experiment more tomorrow and see if I can get a core dump. In one of the logs for the segfault this is the last thing:

2024-02-07T01:11:12.541426Z DEBUG mediasoup::worker::channel: request() [method:TransportCloseDataproducer, id:56474]    
2024-02-07T01:11:12.541457Z DEBUG mediasoup::worker::channel: request() [method:RouterCloseTransport, id:56475]    
double free or corruption (out)

from mediasoup.

satoren commented on May 29, 2024

@PaulOlteanu
Are you using SCTP(DataProducer/DataConsumer)?
This has been an issue since versions before 0.15, but there is an open issue where it crashes when turning on sctp. So maybe this is the cause.

https://mediasoup.discourse.group/t/rarely-crashes-on-sendsctpdata-when-enablesctp-is-enabled/5340/5

Since I stopped using dataproducer, the priority was lowered and no further investigation was made. I'm sorry.

from mediasoup.

PaulOlteanu commented on May 29, 2024

Yes we are using data producers & consumers but are phasing them out (less than 2% usage at this point). If the issue is with sctp I would be surprised that we didn't run into this issue the first time we tried multiple workers (pre-0.15).

Since then we've stopped pinning mediasoup worker threads to specific cpu cores, so unless there's some way the pinning was masking that issue (I don't really see how that's possible) I'm not sure it's the same thing.

Edit: I may have misunderstood. If you're referring to the segfault then I guess that's possible, I haven't really looked into that issue (or the original issue), but I think I'll have time to focus on this again now.

from mediasoup.

satoren commented on May 29, 2024

When I investigated, it appeared that SCTP was running on the thread of the first Worker that executed. And other Workers are sharing it. So I think it is possible that a thread data race could corrupt the data in the flatbuffers to be sent. In the case of JSON (pre-0.15), it may have been a transmission method that was less likely to cause conflicts.

from mediasoup.

ibc commented on May 29, 2024

Is this something that was not properly done when Rust layer was created? In theory DepUsrSCTP.cpp should run a Checker singleton (which runs the usrsctp thread) per Worker even in Rust rather than running a unique singleton for all threads when multiple Workers run in a Rust application. This was solved in many other areas using thread_local, perhaps this was not done here.

UPDATE: By looking at the code it looks ok.

from mediasoup.

ibc commented on May 29, 2024

@nazar-pc perhaps you see something wrong in https://github.com/versatica/mediasoup/blob/v3/worker/src/DepUsrSCTP.cpp when running multiple mediasoup Worker threads in Rust?

The constructor in Worker.cpp creates a DepUsrSCTP::Checker singleton, which is defined as thread_local static Checker* checker in DepUsrSCTP.hpp.

Problem (not sure) is that we start such a singleton when first SctpAssociation instance is created (it calls DepUsrSCTP::RegisterSctpAssociation()) and stop the singleton when last running SctpAssociation instance is closed (it calls DepUsrSCTP::DeregisterSctpAssociation(). However these methods are protected by mutex locks:

void DepUsrSCTP::DeregisterSctpAssociation(RTC::SctpAssociation* sctpAssociation)
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);

	MS_ASSERT(DepUsrSCTP::checker != nullptr, "Checker not created");

	auto found = DepUsrSCTP::mapIdSctpAssociation.erase(sctpAssociation->id);

	MS_ASSERT(found > 0, "SctpAssociation not found");
	MS_ASSERT(DepUsrSCTP::numSctpAssociations > 0u, "numSctpAssociations was not higher than 0");

	if (--DepUsrSCTP::numSctpAssociations == 0u)
	{
		DepUsrSCTP::checker->Stop();
	}
}

Then when a Worker is closed, lib.cpp "may" finish usrsctp stuff since it calls DepUsrSCTP::ClassDestroy(), but IMHO we are also protected here:

void DepUsrSCTP::ClassDestroy()
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);
	--GlobalInstances;

	if (GlobalInstances == 0)
	{
		usrsctp_finish();

		numSctpAssociations   = 0u;
		nextSctpAssociationId = 0u;

		DepUsrSCTP::mapIdSctpAssociation.clear();
	}
}

@satoren do you see something here that may produce the crash when running multiple Workers in Rust within the same process?

from mediasoup.

satoren commented on May 29, 2024

@ibc

mediasoup/worker/src/DepUsrSCTP.cpp

Line 76 in af7fb8c

usrsctp_init_nothreads(0, onSendSctpData, sctpDebug);

Although I cannot be sure because it has been some time since our investigation, it appeared that this callback　(onSendSctpData) was being executed in the thread of the Worker where it was first created.

UPDATE:
My guess is that this concern is related.

However, usrsctp lacks documentation and the code is hard to read and appears impossible to debug.

from mediasoup.

ibc commented on May 29, 2024

It makes sense. IMHO the problem is not even related to usrsctp being multithreaded since we use the single thread api, so that shouldn't be the problem.

The problem is that usrsctp uses a global state (such as the send callback passed to usrsctp_start_no_threads() function, it may have another name sorry).

We set that callback within a Worker. When the worker is closed (in this case the thread in multi-worker app in Rust) we don't explicitly finish usrsctp since there are more workers running (see the counter in DepUsrSCTP.cpp). However the thread is closed anyway and probably usrsctp keeps calling it. In this scenario I have no idea what is going on, but doesn't look good.

We use single thread version of usrsctp in mediasoup because originally (in Node apps) there is indeed a single thread, and if the app creates N Workers those run in a separate process with its own and separated usrsctp stack. However in Rust apps we run N threads when we launch N Workers.

Given that, I wonder if we should instead use the multi thread API of usrsctp when in Rust by creating a specific thread for usrsctp (that is never closed) and pass the send callback within it. Or we keep using the single thread API of usrsctp but create a separate thread (that we never close if there are running workers) and pass the send callback to usrsctp within it.

@nazar-pc does it make sense?

from mediasoup.

nazar-pc commented on May 29, 2024

If there is a multi-thread API in usrsctp - we probably should

from mediasoup.

[Rust] Worker panics when running with multiple workers about mediasoup HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs