GithubHelp home page GithubHelp logo

Comments (29)

fenugrec avatar fenugrec commented on July 17, 2024 1

Hi @brian-brt , I also got side-tracked with #46 where I was getting nowhere. But earlier in this ticket you said

The other computer still reproduces it pretty frequently though

So I probably took that as meaning "still has issues, WIP" ?
And also @elliotwoods stating " It still has the same issue." which I assume he tested with PR62.

I have 3 devices on my bench ready to run tests. Is there a foolproof (i.e. me-proof) test I can run to witness the exact issue addressed by PR62 ? We're in a tricky situation with 2-3 problems possibly at play here .

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024 1

Hi @brian-brt , I also got side-tracked with #46 where I was getting nowhere. But earlier in this ticket you said

The other computer still reproduces it pretty frequently though

So I probably took that as meaning "still has issues, WIP" ?
And also @elliotwoods stating " It still has the same issue." which I assume he tested with PR62.

Ah, sorry for the confusion. That was before the next comment, where I concluded I had indeed been chasing ghosts but then found a real problem. #62 definitely fixes that problem, although it looks like there are still others which I haven't succeeded in reproducing, like what @elliotwoods is seeing.

I have 3 devices on my bench ready to run tests. Is there a foolproof (i.e. me-proof) test I can run to witness the exact issue addressed by PR62 ? We're in a tricky situation with 2-3 problems possibly at play here .

I reproduced it by sending frames while flooding error frames to completely overwhelm the queues. I was using in-progress bit error reporting code that sent way more error frames than it should each time I shorted the CAN bus, so it reproduced the problem faster. You could hack up the candleLight code to just send a bunch of fake receive frames occasionally (fill up the candleLight-side queue), or see if you can reproduce this with just USB bus load and/or slow USB drivers. Changing the logic around should_send in can_parse_error_status to report every up or down count of the error counters might work too.

It's fairly easy to tell if you're hitting the issue I fixed. Transmitted frames will go unacknowledged until the host-side Linux gs_usb driver runs out of transmit slots and stops. You can monitor the flow by either adding prints to the kernel driver or using wireshark to look at the USB transfers. The pattern is it rotates among all the echo_id values reserved for transmitted frames at first, and then loses them one at a time due to the bug, and eventually has none left and stops submitting OUT URBs. Setting GS_MAX_TX_URBS to 1 would also make it really obvious when a single buffer is lost.

Agreed on this being tricky to test due to the overlapping issues with similar symptoms.

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024 1

Ok, I think I finally got it... ID=0 was a good idea, as well as the 2ms gap.
I think this works (two candles on same machine, 500kbps)
cangen can0 -I 0 -g 2 -i to send ID=0 every 2ms, ignoring buffer-fulls
cangen can1 -I 333 -g 100 -i ID=333, 100ms
candump can1,0~0xfff to show only the ID=333 frames

After shorting the bus a few seconds, can1 still receives the ID=0 frames, but isn't sending anymore.
Bringing interface down+up clears the condition.

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Hmm... well I presume it can't be a buffer overflow error since there wouldn't be enough RAM to store all those failed messages.
So perhaps overflowing a counter.
Being 'optimistic', I could be sending 1/4 of the count which would cause an int32_t overflow. I might be off by that order of magnitude though.

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

I'm not surprised, and I've had similar issues too. I think it has to do with the bxCAN periph going into bus off state. But the firmware should probably just drop packets, not hang... Have you a debugger (stlink etc) to check what the firmware is up to when it hangs ?
Can you try ethtool -p can0 (should blink the lights) to see if part of the USB stack at least is still running ?

Some observations:

  • at main.c:111, when can_send() fails due to no mailbox available, then queue_push_front() is used to retry the frame later, which will faill if the q_from_host queue is already full. This should otherwise not cause any problems, just drop the frame.
  • there is a uint32 counter "hcan->out_requests++;" that is incremented, but never checked (?) so also shouldn't cause problems
  • BUS OFF condition (CAN_ESR_BOFF) seems to be checked only in can.c:249 to set fields in the frame sent to the host. I didn't find any other special handling
  • the bxCAN periph should normally recover from bus-off automatically after "128 occurences of 11 consecutive recessive bits monitored on CANRX", since ABOM (CAN_MCR_ABOM) is set.

When it hangs, have you tried disconnecting the CAN bus side for a few seconds to see if it recovers ? (assuming CANRX will be pulled up to recessive with no bus connected)

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Thank you for the response. It sounds like we're getting closer to fixing this

To note, the firmware does not hang. I can close and open connections to it.
It's just that no frames are sent (and currently that means that no frames are received either since in my system the main has to send for the secondary to respond).

I'm going to try your suggestion here (disconnect the CAN H/L for a few seconds to see if it recovers).

I do have an STLink, but the Canable Pro isn't delivered with a method to attach it (i think those pins are snapped off after provisioning). So hopefully I can debug without it.

I have Ubuntu running on WSL (windows subsystem linux) as a build system, but this doesn't always work directly with hardware devices (e.g. serial is OK, but i presume native USB doesn't work)

Elliot

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Ok first test results.
I have a torture test sending messages just about as quickly as the driver can deliver them

On first try, the bus stops sending messages after just under 240,000 messages were sent.
1/256 messages were responded to (acknowledged)
Disconnecting the CANH/CANL for 10 seconds does not fix it
Restarting the application on PC side does not fix it
Power cycling the Canable still does fix it

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Trying again here. It seems that it behaves a little differently each time still.

I am in a state where the device is stuck (doesn't send messages).
I am able to get a timestamp from the device with code:

bool candle_ctrl_get_timestamp(candle_device_t *dev, uint32_t *current_timestamp)
{
    bool rc = usb_control_msg(
        dev->winUSBHandle,
        CANDLE_TIMESTAMP_GET,
        USB_DIR_IN|USB_TYPE_VENDOR|USB_RECIP_INTERFACE,
        1,
        dev->interfaceNumber,
        current_timestamp,
        sizeof(*current_timestamp)
    );

    dev->last_error = rc ? CANDLE_ERR_OK : CANDLE_ERR_GET_TIMESTAMP;
    return rc;
}

e.g. I get a value of 3848724056.
So the device hasn't completely hung.

However, whenever I call send with code:

bool CALL_TYPE DLL candle_frame_send(candle_handle hdev, uint8_t ch, candle_frame_t *frame)
{
    // TODO ensure device is open, check channel count..
    candle_device_t *dev = (candle_device_t*)hdev;

    unsigned long bytes_sent = 0;

    frame->echo_id = 0;
    frame->channel = ch;

    bool rc = WinUsb_WritePipe(
        dev->winUSBHandle,
        dev->bulkOutPipe,
        (uint8_t*)frame,
        sizeof(*frame),
        &bytes_sent,
        0
    );

    dev->last_error = rc ? CANDLE_ERR_OK : CANDLE_ERR_SEND_FRAME;
    return rc;

}

The device does not respond. (i.e. the driver remains stuck there)

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Ah, interesting. So the bulk transfer fails ? that (should) narrow down possibilities...

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Can you try this mod to usbd_gs_can.c:622 (add the retval = USBD_OK; line)

		else{
			// Discard current packet from host if we have no place
			// to put the next one
			retval = USBD_OK;
		}

That's not really a fix, just trying to isolate the problem. Returning USBD_FAIL there may explain the bulk transfers failing; if that's the case then we need to check why the queue isn't purged when interface is brought down + up.

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Testing today, it seems that it doesn't actually get stuck there.
It returns success. The symptom is only that a message is not sent on the CAN bus (and no indicator lights activity)
I'm sorry for the confusion. It is possible that I'm seeing a different issue also, but in my testing today it always returns success even after the device stops sending messages.

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Ok then. So the USB layer is not the problem then. Next thing I would check is some of the bxCAN registers, like CAN_TSR and CAN_ESR . Not sure what would be the best way for you check those... It's on my list of things to check here.

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

@elliotwoods have you tried using another device on the network to send something when the candleLight device is in this state, to see if it receives frames? I've observed similar behavior (it won't send anything, but in my case I know it still receives) when I intermittently short the CAN bus (banana plug across the bus is an easy way to generate errors). I'm wondering if we're experiencing the same problem or not.

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Today I received some other USB CAN devices to test with (GCAN devices)
Over the next week I'll compare the GCAN devices against canable/candlelight on the same CAN network and see if the issues exist on both platforms. Also GCAN might help to debug the issue (e.g. detect if there are network issues).
I'll share more results when I have them
Thanks all for advice so far

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

@fenugrec I'm actually suspecting the USB code. If I comment out sending error frames, then I can't reproduce it any more.

I tried adding a bunch of interrupts-disabled sections, and now I can't reproduce it on one computer. The other computer still reproduces it pretty frequently though. The two computers are completely different hardware (different CPU architecture even), and completely different kernels, so it's hard to say what the relevant difference is.

I can have 2 candleLight devices running the exact same code plugged into the same bus, and one of them consistently locks up but the other one is fine. They're both doing occasional cansend calls (same frequency, different source addresses so they don't collide). That seems like evidence against it being a problem with the interaction with the CAN peripheral.

Even just decreasing the frequency of error frames makes the problem "go away" (or at least become much less frequent, I'm not trying long-duration tests yet). I suspect I'm only running into it because I changed things to send error frames way more frequently, while @elliotwoods saw it because he left it running for a long time.

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

I think I found the problem with gs_usb on the host: it waits for each transmitted packet to be ACKed, so when the CANable drops them it eventually fills up the queue and refuses to transmit anything more. Normally that doesn't happen because the gs_usb transmit queue is shorter than the CANable's packet pool, and received packets are transferred quickly. However, when you get errors, you get error frames, which can potentially be generated fast enough to fill up the CANable's pool and break things. I think the correct behavior is to stall the endpoint instead. I'm going to try that.

I'm not sure how that Windows driver tracks echoed packets, so I'm not sure if this applies to @elliotwoods 's original problem or not.

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

I'm not sure if GitHub notified you @elliotwoods: I sent out a potential fix in #62 if you're interested in testing.

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

Great!
I haven't built / updated myself so far but I will try today here
(Previously I used canable's online updater tool)

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

OK built and uploaded. I will do some tests with flooding messages

from candlelight_fw.

elliotwoods avatar elliotwoods commented on July 17, 2024

It still has the same issue. I have captured some video, I will upload a link when I get back to my desk.

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

things. I think the correct behavior is to stall the endpoint instead.

Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).

I think we should at least be clearing the queues in can_disable() - I figure if that func is called, there's no point in trying to keep queued frames for the next can_enable()

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

things. I think the correct behavior is to stall the endpoint instead.

Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).

Yea, I tried it and couldn't get it to unstall. Ended up just letting the STM32-side hardware NACK it instead.

from candlelight_fw.

mfalkjensen avatar mfalkjensen commented on July 17, 2024

things. I think the correct behavior is to stall the endpoint instead.

Hmm, that may be a bit extreme - the only way to then un-stall it would be via EP0 (not sure ifdown + up would be sufficient).

Yea, I tried it and couldn't get it to unstall. Ended up just letting the STM32-side hardware NACK it instead.

I'm having similar problems. I have a linux kernel 4.9.140 with Canable pro. It stops transmitting at some point when the CAN bus load is high. The higher the load, the sooner it stops transmitting. Receiving still works.

@brian-brt Did you experience the same and did you fix it?

from candlelight_fw.

brian-brt avatar brian-brt commented on July 17, 2024

@mfalkjensen that does sound like the same problem. #62 is my fixes. With those fixes, I can't reproduce the problem any more.

Sorry, I lost track of that PR and forgot it wasn't merged yet. @fenugrec are you waiting for anything from me on fixing this?

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Ok, testing this right now. Currently on master, changed #define CAN_ERRCOUNT_THRESHOLD 1 ,
started cangen can0 -g 0 -i , and shorted the bus. Wireshark capture shows over 19000 CAN error frames per second while the bus was shorted; communications resumed (at 99%+ bus load) immediately after.
Also tried running cangen from the other connected interface; no change : both devices recover when I unshort the bus.

Going to try maybe a different USB host/hub to see if I can stress this differently...

from candlelight_fw.

mfalkjensen avatar mfalkjensen commented on July 17, 2024

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

I'm putting this on hold until we update the USB library. 1.11.1 fixed significant issues that could explain some of the behaviour we're seeing.

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Whoops the commit message auto-closed this, but I want it still opened pending more tests

from candlelight_fw.

fenugrec avatar fenugrec commented on July 17, 2024

Ok, I can't reproduce anymore as of e6b9724 . Hopefully because it's fixed and not because of current astrological conjecture.

I'm taking a chance in closing this, hope these issues are over. Anyone, feel free to reopen (if so, I will need the commit hash used and exact steps to reproduce).

from candlelight_fw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.