GithubHelp home page GithubHelp logo

Regression: crash on startup about fboss HOT 13 OPEN

facebook avatar facebook commented on April 19, 2024
Regression: crash on startup

from fboss.

Comments (13)

bluecmd avatar bluecmd commented on April 19, 2024

Config: https://gist.github.com/bluecmd/82298eabea8a7a4bf7d447e263f125b5

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

My working thesis is that https://github.com/facebook/fboss/blob/master/fboss/agent/hw/bcm/oss/BcmControlPlane.cpp is missing an init of queueManager_

EDIT:
We've tried initing queueManager_ like this: dhtech@837bf7f without success. Current thesis is that controlPlane_ in BcmSwitch might not be correctly set. initTables for example seem to set it, but I cannot find any call for the function.

from fboss.

capveg avatar capveg commented on April 19, 2024

Thanks for the report - can you confirm before the update everything was working correctly? (sorry if it's a dumb question - I'm trying not to sound too surprised :-)

from fboss.

capveg avatar capveg commented on April 19, 2024

Also, unfortunately it's not trivial for me to test the OSS internally, so if you could compile with debugging options (similar to what's described here: https://stackoverflow.com/questions/7724569/debug-vs-release-in-cmake), we can get a stack trace with line numbers and can make a better guess as to what broke. Thanks for the interest!

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

We were running at the stated commit, or at least one around that commit, in June and it certainly worked. The same switch was unboxed from storage and the only thing we did was:

  1. Modify the config to set port speed
  2. Realize we have to patch fboss to support that transceiver we want to use
  3. Compile a new fboss from head
  4. Install libsodium23 from Debian stretch backports as that seems to be a new dependency
  5. Start fboss - boom

We can certainly compile a version at the previously stated commit and run with it to verify the bisection window, and we will change the config back to the one we ran successfully in June.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

And we will get back to you with the result of the cmake -DCMAKE_BUILD_TYPE=Debug run

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

This is the log for the startup based on:

dhtech@r1a0:~/fboss$ git status
HEAD detached at e440fcd
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   fboss/agent/ApplyThriftConfig.cpp
	modified:   getdeps.sh

no changes added to commit (use "git add" and/or "git commit -a")
dhtech@r1a0:~/fboss$ git diff
diff --git a/fboss/agent/ApplyThriftConfig.cpp b/fboss/agent/ApplyThriftConfig.cpp
index cc087e7..79f2bd8 100644
--- a/fboss/agent/ApplyThriftConfig.cpp
+++ b/fboss/agent/ApplyThriftConfig.cpp
@@ -328,7 +328,7 @@ shared_ptr<SwitchState> ThriftConfigApplier::run() {
     // Make sure there is a one-to-one map between vlan and interface
     // Remove this sanity check if multiple interfaces are allowed per vlans
     auto& entry = vlanInterfaces_[vlanInfo.first];
-    if (entry.interfaces.size() != 1) {
+    if (entry.interfaces.size() > 1) {
       auto cpu_vlan = newState->getDefaultVlan();
       if (vlanInfo.first != cpu_vlan) {
         throw FbossError("Vlan ", vlanInfo.first, " refers to ",
diff --git a/getdeps.sh b/getdeps.sh
index ec60709..19f8fc2 100755
--- a/getdeps.sh
+++ b/getdeps.sh
@@ -117,9 +117,9 @@ NPROC=$(grep -c processor /proc/cpuinfo)
     fi
     # iproute2 v4.4.0
     update https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git 7ca63aef7d1b0c808da0040c6b366ef7a61f38c1
-    update https://github.com/facebook/folly.git
-    update https://github.com/facebook/wangle.git
-    update https://github.com/facebook/fbthrift.git
+    update https://github.com/facebook/folly.git v2018.06.04.00
+    update https://github.com/facebook/wangle.git v2018.06.04.00
+    update https://github.com/facebook/fbthrift.git v2018.06.04.00
     update https://github.com/no1msd/mstch.git
     update https://github.com/facebook/zstd.git
     update https://github.com/google/googletest.git release-1.8.0

That should be the state where we ran it in June.
This is the startup log:

dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 17:48:54.393641 23401 WedgeProductInfo.cpp:136] json parse error on line 0: expected json value
E0928 17:48:54.393896 23401 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
E0928 17:49:08.767557 23439 QsfpCache.cpp:166] Exception talking to qsfp_service: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0928 17:49:08.795991 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache
E0928 17:49:08.901234 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache

The transceiver error is due to I'm not having qsfp_service running, I currently have a weird QSFP in which makes qsfp_service crash, but I don't believe it's relevant here (and I'm remotely present so I cannot unplug it).

Building from master:

# docker run --name fboss_build_phase2 -v `pwd`:/tmp/code fboss/tmp /bin/sh -c "cd /tmp/code && mkdir -p build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. && make -j`grep -c processor /proc/cpuinfo`"
dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 18:20:48.462908 23573 WedgeProductInfo.cpp:140] json parse error on line 0: expected json value
E0928 18:20:48.463168 23573 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
I0928 18:20:48.507074 23573 Main.cpp:338] serving on localhost on port 5909
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
I0928 18:21:02.347016 23593 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
*** Aborted at 1538158862 (unix time) try "date -d @1538158862" if you are using GNU date ***
PC: @     0x564a3693b2d4 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 23573 (TID 0x7f70048f3700) from PID 16; stack trace: ***
    @     0x7f701be5e0c0 (unknown)
    @     0x564a3693b2d4 std::unique_ptr<>::get()
    @     0x564a3693b2f2 std::unique_ptr<>::operator->()
    @     0x564a369366a8 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
    @     0x564a3692470a facebook::fboss::BcmSwitch::getColdBootSwitchState()
    @     0x564a36926ccc facebook::fboss::BcmSwitch::init()
    @     0x564a36aae464 facebook::fboss::SwSwitch::init()
    @     0x564a369e2a03 facebook::fboss::Initializer::initImpl()
    @     0x564a369e2322 facebook::fboss::Initializer::initThread()
    @     0x564a369eab32 _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x564a369eaabf _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
    @     0x564a369eaa70 _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
    @     0x564a369eaa17 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x564a369ea853 std::_Bind_simple<>::operator()()
    @     0x564a369ea61a std::thread::_State_impl<>::_M_run()
    @     0x7f7014f45e6f (unknown)
    @     0x7f701be54494 start_thread
    @     0x7f70146baacf clone
    @                0x0 (unknown)
Segmentation fault

Not sure why the debug didn't take, do I need to do anything more than passing that to cmake?

from fboss.

capveg avatar capveg commented on April 19, 2024

Sorry for the slow reply to this. I'm also surprised that the debug symbols didn't pop up for this - let me figure that out in parallel. Looking at the code and the error, I think the patch below may help. I'm trying to test it locally, but our ability to emulate a non-facebook setup inside facebook is limited :-(

If you get a chance to test this before I do, please let me know if it fixes the problem.

diff --git a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
--- a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
+++ b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
@@ -20,8 +20,9 @@

 std::shared_ptr<PortQueue> BcmControlPlaneQueueManager::getCurrentQueueSettings(
     cfg::StreamType /*streamType*/,
-    opennsl_cos_queue_t /*cosQ*/) const {
-  return std::shared_ptr<PortQueue>{};
+    opennsl_cos_queue_t cosQ) const {
+  // stub implementation - depends on newer OpenNSL
+  return std::make_shared<PortQueue>(cosQ);
 }

from fboss.

capveg avatar capveg commented on April 19, 2024

also, if you could please create a new issue for the qsfp_service segfault. Even if it is because you have a funky optic, it still shouldn't segfault. I won't promise I can fix it promptly, but still good to track.

Thank you again!

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Done, #77 . I'm compiling with your patch right now - I had to pull my OpenNSL 3.5.0.1 changes as well, but I guess they should be orthogonal to this issue.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

@capveg The patch seems to not work sadly, https://gist.github.com/bluecmd/bd16185170dff642de197e34349aa14c

I wish the stack trace could be more useful. Granted, that particular build I did not run with the cmake debug settings, I can try to re-do that and spend some time to see if I can get line numbers if you feel it would be the next logical step.

from fboss.

cubic1271 avatar cubic1271 commented on April 19, 2024

Quick note to say that I'll echo this behavior (and this trace) on a Wedge 100 and using the master current as of around a month ago. Stack trace is similarly not very useful. I did try to apply the patch described above with no useful results.

In our case, we also don't have a functional FBOSS version to revert to, 'cause the infrastructure for a Wedge 100 FBOSS on Open Network Linux seems to be pretty broken at the moment (for quite a few other reasons). I will try a build of e440fcd and seeing if that gets us going, though.

from fboss.

arnauddorgans avatar arnauddorgans commented on April 19, 2024

Still have crash on startup on EventBase::bumpHandlingTime in 2021 facebook/flipper#2577

from fboss.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.