Comments (21)
Thank you for reporting this.
Could you please share more details about the crash dump? For example, the stack trace etc.?
Also which SDK was FBOSS agent built against when the crash was observed?
The post mentions 'reported crash in getdeps.sh', which crash? could you please share more details about it?
from fboss.
Hi,
I'm reporting an absence of a crash and recommending you to reconsider or provide more details of the crash referred to here:
Line 118 in 346e30c
Essentially "It works for us".
from fboss.
hi @bluecmd,
Are you sure you didn't have to do anything else to get opennsl 3.5.0.1 working? I can definitely believe that the crash in opennsl_pkt_alloc() has been fixed but there were a number of other changes that were required to get opennsl 3.5.0.1 working -- trivially, opennsl_driver_init()'s prototype changed -- see my changes here to at least get it to compile: #65
And even after it compiled, it was my experience that all of the packet forwarding was broken because the initialization process was quite different.
If you have it working, we'd definitely appreciate to understand how, because if we could update to OpenNSL 3.5.0.1, then we can unlock a bunch of previously unreleased changed (e.g., ACLs) that depend on newer versions.
Please confirm and let us know - thanks as always for the interest!
from fboss.
Hi @capveg. I admit it's a bit sneaky, but if you click on "OpenNSL 3.5.0.1" in my report you get the diff of the patch, and you'll see the actual code changes that we did.
Since we have Wedges graciously donated from FB running with ONL + FBOSS we're more than happy to help you collect any data that you need to debug any issues, but as far as we've seen It Just Works(TM) with the somewhat trivial patch of essentially only changing the opennsl_driver_init
call.
EDIT: Direct link to what I'm talking about here: https://github.com/dhtech/fboss/pull/4/files#diff-941e4fb204c29b957373093d97373880
EDIT x2: And we also needed to specify OPENNSL_CONFIG_FILE=/etc/config.wedge40
as the environment of course.
from fboss.
Hmm... so your patch looks effectively identical to my patch... so I'm wondering why your's works. I saw in one of the comments there a "Status: not working" - can you clarify? Just because the FBOSS agent logs "sending lldp to X" doesn't necessarily mean it's happening. Are you seeing that packet received on the other side? Sorry if this seems pedantic - but we've been (admittedly, slowly) debugging this for a while...
from fboss.
The status: not working is my quest to downrate the serdes's to support 1G line rate (Broadcom-Switch/OpenNSL#37).
No worries, I also wouldn't trust strangers on the internet.
What I can give you in terms of proof is the neighbouring Cisco switch receiving the LLDP and accepting them:
Switch#show lldp neighbors
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
wedge1 Te1/0/2 120 R XE5
Total entries displayed: 1
from fboss.
Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?
EDIT:
dhtech@wedge1:~$ strings /lib/modules/4.14.48-OpenNetworkLinux/linux-kernel-bde.ko | grep OpenNSL | head -n1
/home/bluecmd/OpenNSL/sdk-6.5.12-gpl-modules/include/sal
from fboss.
Thanks for all the info.
The kernel API is fairly stable so I'm not surprised that the 3.5.0.1 kernel modules work for older versions of OpenNSL. I wouldn't run that way long term (it's definitely not a tested setup :-), but not surprised it works. We run a fairly new kernel internally... let me confirm some details with some other folks and see if we can come up with a theory.
In any case, glad to hear this is working for you.
from fboss.
Just to add more data to keep myself honest:
dhtech@wedge1:~$ ldd /usr/local/bin/wedge_agent | grep libopennsl
libopennsl.so.1 => /usr/local/lib/libopennsl.so.1 (0x00007fe4583cc000)
dhtech@wedge1:~$ sudo find / -name libopennsl.so.1 | xargs sha1sum
c5a00a16bb0e0be3d557a6e21bc1ee43aa06d4c2 /usr/local/lib/libopennsl.so.1
That matches with the Dec-27 release that's current in https://github.com/Broadcom-Switch/OpenNSL/tree/master/bin/wedge. So I'm pretty sure I'm not messing up the versioning on my end.
from fboss.
Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?
Can you provide any more information about the hacks? Compiling OpenNSL 3.5.0.1 for the 4.14 kernel I have fixed pci_enable_msix, copy_to/from_user and dev->trans_start = jiffies; but FBOSS is still having issues:
I1010 20:50:13.945568 4058 BcmSwitch.cpp:560] Initializing BcmSwitch for unit 0
*** Aborted at 1539204614 (unix time) try "date -d @1539204614" if you are using GNU date ***
PC: @ 0x560640774da2 std::unique_ptr<>::get()
from fboss.
It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.
A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.
from fboss.
Ah @sonoble, looking at the last line of your report you're probably hitting #74. Not sure without the full stack trace however.
You can try using our fork that is using FBOSS from May with some patches applied: https://github.com/dhtech/fboss if you need it up and running right now.
from fboss.
It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.
A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.
No one runs knet that I know of. Looks like your changes are the same as mine. I build the entire OpenNSL from source, so I just set the KERNEL_SRC and LINUX_UAPI_SPLIT="1".
I don't need FBOSS running right now, I was just trying to confirm that your patch worked for me on the 40's. I have been working on getting everything working on the 100S but in a totally different way, by removing the init from OpenNSL and having FBOSS handle it.
I will build your fboss and see if I can get it working.
Thank you!
from fboss.
I built your fboss + the modified OpenNSL and while everything is running, there are no interfaces at all using your config or mine. I will dig more into it later.
from fboss.
@bluecmd I don't see it in this thread, have you been able to confirm packets other than LLDP are passing? We have seen LLDP packets before but were unable to ping or send any different traffic between boxes.
from fboss.
Only LLDP so far as well as normal L2 switching.
from fboss.
Hi @bluecmd I am able to confirm L2 and LLDP on the Wedge 100S but no L3 (Packets are not making it to the CPU) so no routing protocols can be run. Can you check if you assign an IP to a port that you can or cannot ping it?
Thank you!
from fboss.
@sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS.
from fboss.
from fboss.
So these are my observations. This is with 3.5.0.1 and our FBOSS fork from May/June. We have never tried running this with the old FBOSS, so I have no idea if this is a regression - but as requested by @sonoble.
I added an L3 interface like this:
"interfaces": [
{
"intfID": 10,
"routerID": 0,
"vlanID": 552,
"ipAddresses": [
"10.32.12.250/24"
]
}
]
This configured an fboss10
interface that does see the incoming packets:
I1017 21:55:00.147141 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:00.920305 10744 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:00.920405 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:01.147235 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:01.245810 10746 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:01.246166 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:01.246147 10745 SwSwitch.cpp:787] preparing state update add pending entry 10.32.12.1
V1017 21:55:01.246336 10745 NeighborCacheImpl-defs.h:137] Adding pending entry for 10.32.12.1 on interface 10
I1017 21:55:01.246403 10745 SwSwitch.cpp:921] Updating state: old_gen=6 new_gen=7
V1017 21:55:01.246495 10745 BcmSwitch.cpp:1048] updating VLAN 552: 0 ports added, 0 ports removed
V1017 21:55:01.246592 10745 BcmHost.cpp:394] created BcmHost: 10.32.12.1@vrf0. new ref count: 1
V1017 21:55:01.246645 10745 BcmSwitch.cpp:1259] adding pending neighbor entry to 10.32.12.1
V1017 21:55:01.246701 10745 BcmHost.cpp:149] Host entry for BcmHost: 10.32.12.1@vrf0 does not have an egress, create one.
V1017 21:55:01.246842 10745 BcmEgress.cpp:145] programmed L3 egress object 100005 for to CPU on unit 0 for ip: 10.32.12.1 @ brcmif 0 flags 8392704 towards port 0
V1017 21:55:01.246900 10745 BcmHost.cpp:594] insert egress 100005 into egress map
V1017 21:55:01.246962 10745 BcmHost.cpp:131] Adding host entry for : 10.32.12.1
V1017 21:55:01.247110 10745 BcmHost.cpp:135] created L3 host object for BcmHost: 10.32.12.1@vrf0 @egress 100005
V1017 21:55:01.247167 10745 BcmHost.cpp:177] Updating egress 100005 from physical port 0 to physical port 0
V1017 21:55:01.247382 10748 QsfpCache.cpp:101] All 64 ports up to date
V1017 21:55:01.247386 10745 SwSwitch.cpp:970] Update state took 981us
I1017 21:55:02.147334 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:02.247273 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:02.256384 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:02.256596 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:03.147433 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:03.247602 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:03.280399 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:03.280599 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:04.147526 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:04.247937 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:04.281910 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:04.282104 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:05.147614 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:05.248259 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:05.296386 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:05.296606 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
V1017 21:55:05.921184 10744 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:05.921291 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:06.147709 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:06.248940 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:06.320411 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:06.320631 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
ICMP replies are also sent (looking at tcpdump fboss10) but they never arrive at the pinger.
The IP above is on the same subnet as the management, so there is a bit of ARP shortcuts that can be done there.
Using another IP address that is on its own subnet makes things break earlier. The fboss10 interface still shows some random IPv6 traffic that it captures, so packet capture works - however not much more than that.
FBOSS output:
V1017 22:09:51.115411 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 59
V1017 22:09:51.115458 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 60
V1017 22:09:51.115563 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for61
V1017 22:09:51.115658 10986 LldpManager.cpp:191] sent LLDP on port 61 with CPU MAC 56:ab:3a:05:fc:0a port id XE61 and vlan 552
V1017 22:09:51.115746 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for62
V1017 22:09:51.115817 10986 LldpManager.cpp:191] sent LLDP on port 62 with CPU MAC 56:ab:3a:05:fc:0a port id XE62 and vlan 552
V1017 22:09:51.115864 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 63
V1017 22:09:51.115912 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 64
I1017 22:09:52.114615 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:53.114711 10992 FunctionScheduler.cpp:505] Now running updateStats
V1017 22:09:53.225328 10986 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 77.80.231.34
V1017 22:09:53.225431 10986 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
I1017 22:09:54.114811 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:55.114902 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:56.115000 10992 FunctionScheduler.cpp:505] Now running updateStats
Notice that it is sending out an ARP broadcast but never logs an "sendPacketOutOfPort" message, following the code it is because this path calls "sendPacketSwitched". See here.
Maybe sendPacketSwitched is broken while sendPacketOutOfPort works?
Next steps to confirm that could be:
- Test with OpenNSL 6.4 to make sure this used to work
- Add logging to the sendPacketSwitched call to see if it fails
- See if OpenNSL documentation of how to use the switched TX matches with what is done here for 3.5.0.1.
EDIT: I have a thesis this might also be related to L1 errors, I'll debug a bit and update.
from fboss.
Update: Yes, it was L1 error. Having fixed the cabling I can now see packets egressing as well. Ping doesn't work, but that is most likely FBOSS related.
1019 12:05:26.602193 3250 FunctionScheduler.cpp:505] Now running updateStats
V1019 12:05:26.783669 3244 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 77.80.231.34
V1019 12:05:26.783776 3244 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
tcpdump on computer:
14:05:26.681010 ARP, Request who-has 77.80.231.34 tell 77.80.231.34, length 50
from fboss.
Related Issues (20)
- [build] build failed on switch(centos 7)
- ModuleNotFoundError: No module named 'fboss.fb_thrift_clients' HOT 1
- question for HSDK support
- Build Error: TransientFailure: Failed to download http://localhost:8000/opennsa-6.5.22.tgz
- Build error (Missing CONTAINER.tar) for Docker Container for Debian 10
- Build error .
- Miss head file of common/strings/StringUtil.h
- Port numbering in FBOSS Python Tool output HOT 4
- Does`parserType` set to `0` works as expected ? HOT 2
- Regression: crash on startup HOT 13
- Logging in qsfp_service HOT 1
- qsfp_service crashes on bad modules
- Using interface ID 1000 crashes wedge_agent
- Make Wedge transparent to VLANs
- compile fboss fail:fboss/agent/hw/bcm/BcmHostKey.cpp.o' failed
- fboss build failure HOT 5
- Unable to run SAI fake tests HOT 3
- no limitation for "%s" while calling fscanf() HOT 2
- Please make tests conditional on the cmake option BUILD_TESTING, and make benchmark also conditional on some cmake variable
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fboss.