GithubHelp home page GithubHelp logo

OpenNSL 3.5.0.1 report about fboss HOT 21 OPEN

bluecmd avatar bluecmd commented on April 19, 2024
OpenNSL 3.5.0.1 report

from fboss.

Comments (21)

shri-khare avatar shri-khare commented on April 19, 2024

Thank you for reporting this.

Could you please share more details about the crash dump? For example, the stack trace etc.?
Also which SDK was FBOSS agent built against when the crash was observed?
The post mentions 'reported crash in getdeps.sh', which crash? could you please share more details about it?

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Hi,

I'm reporting an absence of a crash and recommending you to reconsider or provide more details of the crash referred to here:

# SIGSEV in opennsl_pkt_alloc()

Essentially "It works for us".

from fboss.

capveg avatar capveg commented on April 19, 2024

hi @bluecmd,

Are you sure you didn't have to do anything else to get opennsl 3.5.0.1 working? I can definitely believe that the crash in opennsl_pkt_alloc() has been fixed but there were a number of other changes that were required to get opennsl 3.5.0.1 working -- trivially, opennsl_driver_init()'s prototype changed -- see my changes here to at least get it to compile: #65

And even after it compiled, it was my experience that all of the packet forwarding was broken because the initialization process was quite different.

If you have it working, we'd definitely appreciate to understand how, because if we could update to OpenNSL 3.5.0.1, then we can unlock a bunch of previously unreleased changed (e.g., ACLs) that depend on newer versions.

Please confirm and let us know - thanks as always for the interest!

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Hi @capveg. I admit it's a bit sneaky, but if you click on "OpenNSL 3.5.0.1" in my report you get the diff of the patch, and you'll see the actual code changes that we did.

Since we have Wedges graciously donated from FB running with ONL + FBOSS we're more than happy to help you collect any data that you need to debug any issues, but as far as we've seen It Just Works(TM) with the somewhat trivial patch of essentially only changing the opennsl_driver_init call.

EDIT: Direct link to what I'm talking about here: https://github.com/dhtech/fboss/pull/4/files#diff-941e4fb204c29b957373093d97373880
EDIT x2: And we also needed to specify OPENNSL_CONFIG_FILE=/etc/config.wedge40 as the environment of course.

from fboss.

capveg avatar capveg commented on April 19, 2024

Hmm... so your patch looks effectively identical to my patch... so I'm wondering why your's works. I saw in one of the comments there a "Status: not working" - can you clarify? Just because the FBOSS agent logs "sending lldp to X" doesn't necessarily mean it's happening. Are you seeing that packet received on the other side? Sorry if this seems pedantic - but we've been (admittedly, slowly) debugging this for a while...

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

The status: not working is my quest to downrate the serdes's to support 1G line rate (Broadcom-Switch/OpenNSL#37).

No worries, I also wouldn't trust strangers on the internet.
What I can give you in terms of proof is the neighbouring Cisco switch receiving the LLDP and accepting them:

Switch#show lldp neighbors
Capability codes:
    (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
    (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other

Device ID           Local Intf     Hold-time  Capability      Port ID
wedge1              Te1/0/2        120        R               XE5

Total entries displayed: 1

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?

EDIT:

dhtech@wedge1:~$ strings /lib/modules/4.14.48-OpenNetworkLinux/linux-kernel-bde.ko | grep OpenNSL | head -n1
/home/bluecmd/OpenNSL/sdk-6.5.12-gpl-modules/include/sal

from fboss.

capveg avatar capveg commented on April 19, 2024

Thanks for all the info.

The kernel API is fairly stable so I'm not surprised that the 3.5.0.1 kernel modules work for older versions of OpenNSL. I wouldn't run that way long term (it's definitely not a tested setup :-), but not surprised it works. We run a fairly new kernel internally... let me confirm some details with some other folks and see if we can come up with a theory.

In any case, glad to hear this is working for you.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Just to add more data to keep myself honest:

dhtech@wedge1:~$ ldd /usr/local/bin/wedge_agent  | grep libopennsl
        libopennsl.so.1 => /usr/local/lib/libopennsl.so.1 (0x00007fe4583cc000)
dhtech@wedge1:~$ sudo find / -name libopennsl.so.1 | xargs sha1sum
c5a00a16bb0e0be3d557a6e21bc1ee43aa06d4c2  /usr/local/lib/libopennsl.so.1

That matches with the Dec-27 release that's current in https://github.com/Broadcom-Switch/OpenNSL/tree/master/bin/wedge. So I'm pretty sure I'm not messing up the versioning on my end.

from fboss.

sonoble avatar sonoble commented on April 19, 2024

Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?

Can you provide any more information about the hacks? Compiling OpenNSL 3.5.0.1 for the 4.14 kernel I have fixed pci_enable_msix, copy_to/from_user and dev->trans_start = jiffies; but FBOSS is still having issues:

I1010 20:50:13.945568 4058 BcmSwitch.cpp:560] Initializing BcmSwitch for unit 0
*** Aborted at 1539204614 (unix time) try "date -d @1539204614" if you are using GNU date ***
PC: @ 0x560640774da2 std::unique_ptr<>::get()

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.

A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Ah @sonoble, looking at the last line of your report you're probably hitting #74. Not sure without the full stack trace however.

You can try using our fork that is using FBOSS from May with some patches applied: https://github.com/dhtech/fboss if you need it up and running right now.

from fboss.

sonoble avatar sonoble commented on April 19, 2024

It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.

A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.

No one runs knet that I know of. Looks like your changes are the same as mine. I build the entire OpenNSL from source, so I just set the KERNEL_SRC and LINUX_UAPI_SPLIT="1".

I don't need FBOSS running right now, I was just trying to confirm that your patch worked for me on the 40's. I have been working on getting everything working on the 100S but in a totally different way, by removing the init from OpenNSL and having FBOSS handle it.

I will build your fboss and see if I can get it working.

Thank you!

from fboss.

sonoble avatar sonoble commented on April 19, 2024

I built your fboss + the modified OpenNSL and while everything is running, there are no interfaces at all using your config or mine. I will dig more into it later.

from fboss.

sonoble avatar sonoble commented on April 19, 2024

@bluecmd I don't see it in this thread, have you been able to confirm packets other than LLDP are passing? We have seen LLDP packets before but were unable to ping or send any different traffic between boxes.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Only LLDP so far as well as normal L2 switching.

from fboss.

sonoble avatar sonoble commented on April 19, 2024

Hi @bluecmd I am able to confirm L2 and LLDP on the Wedge 100S but no L3 (Packets are not making it to the CPU) so no routing protocols can be run. Can you check if you assign an IP to a port that you can or cannot ping it?
Thank you!

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

@sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS.

from fboss.

sonoble avatar sonoble commented on April 19, 2024

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

So these are my observations. This is with 3.5.0.1 and our FBOSS fork from May/June. We have never tried running this with the old FBOSS, so I have no idea if this is a regression - but as requested by @sonoble.

I added an L3 interface like this:

    "interfaces": [
        {
              "intfID": 10,
              "routerID": 0,
              "vlanID": 552,
              "ipAddresses": [
                    "10.32.12.250/24"
              ]
        }
    ]

This configured an fboss10 interface that does see the incoming packets:

I1017 21:55:00.147141 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:00.920305 10744 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:00.920405 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:01.147235 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:01.245810 10746 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:01.246166 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:01.246147 10745 SwSwitch.cpp:787] preparing state update add pending entry 10.32.12.1
V1017 21:55:01.246336 10745 NeighborCacheImpl-defs.h:137] Adding pending entry for 10.32.12.1 on interface 10
I1017 21:55:01.246403 10745 SwSwitch.cpp:921] Updating state: old_gen=6 new_gen=7
V1017 21:55:01.246495 10745 BcmSwitch.cpp:1048] updating VLAN 552: 0 ports added, 0 ports removed
V1017 21:55:01.246592 10745 BcmHost.cpp:394] created BcmHost: 10.32.12.1@vrf0. new ref count: 1
V1017 21:55:01.246645 10745 BcmSwitch.cpp:1259] adding pending neighbor entry to 10.32.12.1
V1017 21:55:01.246701 10745 BcmHost.cpp:149] Host entry for BcmHost: 10.32.12.1@vrf0 does not have an egress, create one.
V1017 21:55:01.246842 10745 BcmEgress.cpp:145] programmed L3 egress object 100005 for to CPU on unit 0 for ip: 10.32.12.1 @ brcmif 0 flags 8392704 towards port 0
V1017 21:55:01.246900 10745 BcmHost.cpp:594] insert egress 100005 into egress map
V1017 21:55:01.246962 10745 BcmHost.cpp:131] Adding host entry for : 10.32.12.1
V1017 21:55:01.247110 10745 BcmHost.cpp:135] created L3 host object for BcmHost: 10.32.12.1@vrf0 @egress 100005
V1017 21:55:01.247167 10745 BcmHost.cpp:177] Updating egress 100005 from physical port 0 to physical port 0
V1017 21:55:01.247382 10748 QsfpCache.cpp:101] All 64 ports up to date
V1017 21:55:01.247386 10745 SwSwitch.cpp:970] Update state took 981us
I1017 21:55:02.147334 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:02.247273 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:02.256384 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:02.256596 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:03.147433 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:03.247602 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:03.280399 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:03.280599 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:04.147526 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:04.247937 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:04.281910 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:04.282104 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:05.147614 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:05.248259 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:05.296386 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:05.296606 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
V1017 21:55:05.921184 10744 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:05.921291 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:06.147709 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:06.248940 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:06.320411 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:06.320631 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0

ICMP replies are also sent (looking at tcpdump fboss10) but they never arrive at the pinger.
The IP above is on the same subnet as the management, so there is a bit of ARP shortcuts that can be done there.

Using another IP address that is on its own subnet makes things break earlier. The fboss10 interface still shows some random IPv6 traffic that it captures, so packet capture works - however not much more than that.

FBOSS output:

V1017 22:09:51.115411 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 59
V1017 22:09:51.115458 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 60
V1017 22:09:51.115563 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for61
V1017 22:09:51.115658 10986 LldpManager.cpp:191] sent LLDP  on port 61 with CPU MAC 56:ab:3a:05:fc:0a port id XE61 and vlan 552
V1017 22:09:51.115746 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for62
V1017 22:09:51.115817 10986 LldpManager.cpp:191] sent LLDP  on port 62 with CPU MAC 56:ab:3a:05:fc:0a port id XE62 and vlan 552
V1017 22:09:51.115864 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 63
V1017 22:09:51.115912 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 64
I1017 22:09:52.114615 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:53.114711 10992 FunctionScheduler.cpp:505] Now running updateStats
V1017 22:09:53.225328 10986 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 77.80.231.34
V1017 22:09:53.225431 10986 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
I1017 22:09:54.114811 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:55.114902 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:56.115000 10992 FunctionScheduler.cpp:505] Now running updateStats

Notice that it is sending out an ARP broadcast but never logs an "sendPacketOutOfPort" message, following the code it is because this path calls "sendPacketSwitched". See here.

Maybe sendPacketSwitched is broken while sendPacketOutOfPort works?

Next steps to confirm that could be:

  1. Test with OpenNSL 6.4 to make sure this used to work
  2. Add logging to the sendPacketSwitched call to see if it fails
  3. See if OpenNSL documentation of how to use the switched TX matches with what is done here for 3.5.0.1.

EDIT: I have a thesis this might also be related to L1 errors, I'll debug a bit and update.

from fboss.

bluecmd avatar bluecmd commented on April 19, 2024

Update: Yes, it was L1 error. Having fixed the cabling I can now see packets egressing as well. Ping doesn't work, but that is most likely FBOSS related.

1019 12:05:26.602193  3250 FunctionScheduler.cpp:505] Now running updateStats
V1019 12:05:26.783669  3244 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 77.80.231.34
V1019 12:05:26.783776  3244 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a

tcpdump on computer:

14:05:26.681010 ARP, Request who-has 77.80.231.34 tell 77.80.231.34, length 50

from fboss.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.