Comments (29)
Hi @brian-brt , I also got side-tracked with #46 where I was getting nowhere. But earlier in this ticket you said
The other computer still reproduces it pretty frequently though
So I probably took that as meaning "still has issues, WIP" ?
And also @elliotwoods stating " It still has the same issue." which I assume he tested with PR62.
I have 3 devices on my bench ready to run tests. Is there a foolproof (i.e. me-proof) test I can run to witness the exact issue addressed by PR62 ? We're in a tricky situation with 2-3 problems possibly at play here .
from candlelight_fw.
Hi @brian-brt , I also got side-tracked with #46 where I was getting nowhere. But earlier in this ticket you said
The other computer still reproduces it pretty frequently though
So I probably took that as meaning "still has issues, WIP" ?
And also @elliotwoods stating " It still has the same issue." which I assume he tested with PR62.
Ah, sorry for the confusion. That was before the next comment, where I concluded I had indeed been chasing ghosts but then found a real problem. #62 definitely fixes that problem, although it looks like there are still others which I haven't succeeded in reproducing, like what @elliotwoods is seeing.
I have 3 devices on my bench ready to run tests. Is there a foolproof (i.e. me-proof) test I can run to witness the exact issue addressed by PR62 ? We're in a tricky situation with 2-3 problems possibly at play here .
I reproduced it by sending frames while flooding error frames to completely overwhelm the queues. I was using in-progress bit error reporting code that sent way more error frames than it should each time I shorted the CAN bus, so it reproduced the problem faster. You could hack up the candleLight code to just send a bunch of fake receive frames occasionally (fill up the candleLight-side queue), or see if you can reproduce this with just USB bus load and/or slow USB drivers. Changing the logic around should_send
in can_parse_error_status
to report every up or down count of the error counters might work too.
It's fairly easy to tell if you're hitting the issue I fixed. Transmitted frames will go unacknowledged until the host-side Linux gs_usb driver runs out of transmit slots and stops. You can monitor the flow by either adding prints to the kernel driver or using wireshark to look at the USB transfers. The pattern is it rotates among all the echo_id
values reserved for transmitted frames at first, and then loses them one at a time due to the bug, and eventually has none left and stops submitting OUT URBs. Setting GS_MAX_TX_URBS
to 1 would also make it really obvious when a single buffer is lost.
Agreed on this being tricky to test due to the overlapping issues with similar symptoms.
from candlelight_fw.
Ok, I think I finally got it... ID=0 was a good idea, as well as the 2ms gap.
I think this works (two candles on same machine, 500kbps)
cangen can0 -I 0 -g 2 -i
to send ID=0 every 2ms, ignoring buffer-fulls
cangen can1 -I 333 -g 100 -i
ID=333, 100ms
candump can1,0~0xfff
to show only the ID=333 frames
After shorting the bus a few seconds, can1 still receives the ID=0 frames, but isn't sending anymore.
Bringing interface down+up clears the condition.
from candlelight_fw.
Hmm... well I presume it can't be a buffer overflow error since there wouldn't be enough RAM to store all those failed messages.
So perhaps overflowing a counter.
Being 'optimistic', I could be sending 1/4 of the count which would cause an int32_t overflow. I might be off by that order of magnitude though.
from candlelight_fw.
I'm not surprised, and I've had similar issues too. I think it has to do with the bxCAN periph going into bus off state. But the firmware should probably just drop packets, not hang... Have you a debugger (stlink etc) to check what the firmware is up to when it hangs ?
Can you try ethtool -p can0
(should blink the lights) to see if part of the USB stack at least is still running ?
Some observations:
- at main.c:111, when
can_send()
fails due to no mailbox available, then queue_push_front() is used to retry the frame later, which will faill if the q_from_host queue is already full. This should otherwise not cause any problems, just drop the frame. - there is a uint32 counter "hcan->out_requests++;" that is incremented, but never checked (?) so also shouldn't cause problems
- BUS OFF condition (CAN_ESR_BOFF) seems to be checked only in can.c:249 to set fields in the frame sent to the host. I didn't find any other special handling
- the bxCAN periph should normally recover from bus-off automatically after "128 occurences of 11 consecutive recessive bits monitored on CANRX", since ABOM (CAN_MCR_ABOM) is set.
When it hangs, have you tried disconnecting the CAN bus side for a few seconds to see if it recovers ? (assuming CANRX will be pulled up to recessive with no bus connected)
from candlelight_fw.
Thank you for the response. It sounds like we're getting closer to fixing this
To note, the firmware does not hang. I can close and open connections to it.
It's just that no frames are sent (and currently that means that no frames are received either since in my system the main has to send for the secondary to respond).
I'm going to try your suggestion here (disconnect the CAN H/L for a few seconds to see if it recovers).
I do have an STLink, but the Canable Pro isn't delivered with a method to attach it (i think those pins are snapped off after provisioning). So hopefully I can debug without it.
I have Ubuntu running on WSL (windows subsystem linux) as a build system, but this doesn't always work directly with hardware devices (e.g. serial is OK, but i presume native USB doesn't work)
Elliot
from candlelight_fw.
Ok first test results.
I have a torture test sending messages just about as quickly as the driver can deliver them
On first try, the bus stops sending messages after just under 240,000 messages were sent.
1/256 messages were responded to (acknowledged)
Disconnecting the CANH/CANL for 10 seconds does not fix it
Restarting the application on PC side does not fix it
Power cycling the Canable still does fix it
from candlelight_fw.
Trying again here. It seems that it behaves a little differently each time still.
I am in a state where the device is stuck (doesn't send messages).
I am able to get a timestamp from the device with code:
bool candle_ctrl_get_timestamp(candle_device_t *dev, uint32_t *current_timestamp)
{
bool rc = usb_control_msg(
dev->winUSBHandle,
CANDLE_TIMESTAMP_GET,
USB_DIR_IN|USB_TYPE_VENDOR|USB_RECIP_INTERFACE,
1,
dev->interfaceNumber,
current_timestamp,
sizeof(*current_timestamp)
);
dev->last_error = rc ? CANDLE_ERR_OK : CANDLE_ERR_GET_TIMESTAMP;
return rc;
}
e.g. I get a value of 3848724056
.
So the device hasn't completely hung.
However, whenever I call send with code:
bool CALL_TYPE DLL candle_frame_send(candle_handle hdev, uint8_t ch, candle_frame_t *frame)
{
// TODO ensure device is open, check channel count..
candle_device_t *dev = (candle_device_t*)hdev;
unsigned long bytes_sent = 0;
frame->echo_id = 0;
frame->channel = ch;
bool rc = WinUsb_WritePipe(
dev->winUSBHandle,
dev->bulkOutPipe,
(uint8_t*)frame,
sizeof(*frame),
&bytes_sent,
0
);
dev->last_error = rc ? CANDLE_ERR_OK : CANDLE_ERR_SEND_FRAME;
return rc;
}
The device does not respond. (i.e. the driver remains stuck there)
from candlelight_fw.
Ah, interesting. So the bulk transfer fails ? that (should) narrow down possibilities...
from candlelight_fw.
Can you try this mod to usbd_gs_can.c:622 (add the retval = USBD_OK;
line)
else{
// Discard current packet from host if we have no place
// to put the next one
retval = USBD_OK;
}
That's not really a fix, just trying to isolate the problem. Returning USBD_FAIL there may explain the bulk transfers failing; if that's the case then we need to check why the queue isn't purged when interface is brought down + up.
from candlelight_fw.
Testing today, it seems that it doesn't actually get stuck there.
It returns success. The symptom is only that a message is not sent on the CAN bus (and no indicator lights activity)
I'm sorry for the confusion. It is possible that I'm seeing a different issue also, but in my testing today it always returns success even after the device stops sending messages.
from candlelight_fw.
Ok then. So the USB layer is not the problem then. Next thing I would check is some of the bxCAN registers, like CAN_TSR and CAN_ESR . Not sure what would be the best way for you check those... It's on my list of things to check here.
from candlelight_fw.
@elliotwoods have you tried using another device on the network to send something when the candleLight device is in this state, to see if it receives frames? I've observed similar behavior (it won't send anything, but in my case I know it still receives) when I intermittently short the CAN bus (banana plug across the bus is an easy way to generate errors). I'm wondering if we're experiencing the same problem or not.
from candlelight_fw.
Today I received some other USB CAN devices to test with (GCAN devices)
Over the next week I'll compare the GCAN devices against canable/candlelight on the same CAN network and see if the issues exist on both platforms. Also GCAN might help to debug the issue (e.g. detect if there are network issues).
I'll share more results when I have them
Thanks all for advice so far
from candlelight_fw.
@fenugrec I'm actually suspecting the USB code. If I comment out sending error frames, then I can't reproduce it any more.
I tried adding a bunch of interrupts-disabled sections, and now I can't reproduce it on one computer. The other computer still reproduces it pretty frequently though. The two computers are completely different hardware (different CPU architecture even), and completely different kernels, so it's hard to say what the relevant difference is.
I can have 2 candleLight devices running the exact same code plugged into the same bus, and one of them consistently locks up but the other one is fine. They're both doing occasional cansend calls (same frequency, different source addresses so they don't collide). That seems like evidence against it being a problem with the interaction with the CAN peripheral.
Even just decreasing the frequency of error frames makes the problem "go away" (or at least become much less frequent, I'm not trying long-duration tests yet). I suspect I'm only running into it because I changed things to send error frames way more frequently, while @elliotwoods saw it because he left it running for a long time.
from candlelight_fw.
I think I found the problem with gs_usb on the host: it waits for each transmitted packet to be ACKed, so when the CANable drops them it eventually fills up the queue and refuses to transmit anything more. Normally that doesn't happen because the gs_usb transmit queue is shorter than the CANable's packet pool, and received packets are transferred quickly. However, when you get errors, you get error frames, which can potentially be generated fast enough to fill up the CANable's pool and break things. I think the correct behavior is to stall the endpoint instead. I'm going to try that.
I'm not sure how that Windows driver tracks echoed packets, so I'm not sure if this applies to @elliotwoods 's original problem or not.
from candlelight_fw.
I'm not sure if GitHub notified you @elliotwoods: I sent out a potential fix in #62 if you're interested in testing.
from candlelight_fw.
Great!
I haven't built / updated myself so far but I will try today here
(Previously I used canable's online updater tool)
from candlelight_fw.
OK built and uploaded. I will do some tests with flooding messages
from candlelight_fw.
It still has the same issue. I have captured some video, I will upload a link when I get back to my desk.
from candlelight_fw.
things. I think the correct behavior is to stall the endpoint instead.
Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).
I think we should at least be clearing the queues in can_disable() - I figure if that func is called, there's no point in trying to keep queued frames for the next can_enable()
from candlelight_fw.
things. I think the correct behavior is to stall the endpoint instead.
Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).
Yea, I tried it and couldn't get it to unstall. Ended up just letting the STM32-side hardware NACK it instead.
from candlelight_fw.
things. I think the correct behavior is to stall the endpoint instead.
Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).
Yea, I tried it and couldn't get it to unstall. Ended up just letting the STM32-side hardware NACK it instead.
I'm having similar problems. I have a linux kernel 4.9.140 with Canable pro. It stops transmitting at some point when the CAN bus load is high. The higher the load, the sooner it stops transmitting. Receiving still works.
@brian-brt Did you experience the same and did you fix it?
from candlelight_fw.
@mfalkjensen that does sound like the same problem. #62 is my fixes. With those fixes, I can't reproduce the problem any more.
Sorry, I lost track of that PR and forgot it wasn't merged yet. @fenugrec are you waiting for anything from me on fixing this?
from candlelight_fw.
Ok, testing this right now. Currently on master, changed #define CAN_ERRCOUNT_THRESHOLD 1
,
started cangen can0 -g 0 -i
, and shorted the bus. Wireshark capture shows over 19000 CAN error frames per second while the bus was shorted; communications resumed (at 99%+ bus load) immediately after.
Also tried running cangen
from the other connected interface; no change : both devices recover when I unshort the bus.
Going to try maybe a different USB host/hub to see if I can stress this differently...
from candlelight_fw.
from candlelight_fw.
I'm putting this on hold until we update the USB library. 1.11.1 fixed significant issues that could explain some of the behaviour we're seeing.
from candlelight_fw.
Whoops the commit message auto-closed this, but I want it still opened pending more tests
from candlelight_fw.
Ok, I can't reproduce anymore as of e6b9724 . Hopefully because it's fixed and not because of current astrological conjecture.
I'm taking a chance in closing this, hope these issues are over. Anyone, feel free to reopen (if so, I will need the commit hash used and exact steps to reproduce).
from candlelight_fw.
Related Issues (20)
- Adding support for the STM32G0 HOT 17
- RFC: cmake-presets to select included toolchain file by default ?
- tweak `atexit` to get rid of malloc HOT 2
- RFC : empty fifos + hardware mailboxes when re-enabling CAN ? HOT 6
- startup.c is broken - hardfault with large BSS HOT 8
- While loops in USB ISR HOT 7
- Feature: add support for `GS_CAN_FEATURE_GET_STATE` and `GS_CAN_FEATURE_BERR_REPORTING` HOT 5
- build warning: "_close is not implemented and will always fail" HOT 2
- Canable device hangs after desktop application crash (= stop without disonnecting) HOT 25
- cansequence tool shows wrong telegram order HOT 19
- Clear internal data structures holding unsent frames HOT 1
- Openmoko firmware upgrade , no jumper to enter DFU mode or button HOT 21
- About candlelight and Cangaroo HOT 2
- version for stm32F042C4xx? HOT 7
- No Can Frames recieved after device reset without a power cycle. candleLight Firmware. HOT 7
- CandleLight FD at 8 Mbps HOT 3
- CPU_FAMILY STM32G0B1XK vs. CPU_FAMILY STM32G0B1XE (not comprehensible for rookies) HOT 4
- Potential firmware / driver issue HOT 21
- Problem building firmware for STM32G0B1 HOT 11
- Problem ACK on tx error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from candlelight_fw.