<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Here's another one: <div class="snippet-clipboard-content notranslate position-rel

Intermittent unit test failure,about cryostatio/cryostat

Comments (18)

andrewazores commented on May 29, 2024

Here's another one:

[ERROR] Failures: 
[ERROR]   MessagingServerTest.shouldHandleRemovedConnections:145 
crw2.readLine();
Wanted 1 time:
-> at com.redhat.rhjmc.containerjfr.tui.ws.WsClientReaderWriter.readLine(WsClientReaderWriter.java:65)
But was 2 times:
-> at com.redhat.rhjmc.containerjfr.tui.ws.MessagingServer.lambda$addConnection$6(MessagingServer.java:123)
-> at com.redhat.rhjmc.containerjfr.tui.ws.MessagingServer.lambda$addConnection$6(MessagingServer.java:123)

from cryostat.

vic-ma commented on May 29, 2024

Ah, I've experienced this for some time now and was planning to ask you about it, since I wasn't sure if it was just me. Now I see it is already a known issue.

But I also started getting this same problem with a list-command test, right after I made the changes in #168. Do you get this as well sometimes?

[ERROR] Failures: 
[ERROR]   ListCommandTest.shouldPrintDownloadAndReportURL:192 
Verification in order failure
Wanted but not invoked:
cw.println(
    contains("	getDownloadUrl		http://example.com:1234/api/v1/targets/fooHost:1/recordings/foo
	getReportUrl		http://example.com:1234/api/v1/targets/fooHost:1/reports/foo")
);
-> at com.redhat.rhjmc.containerjfr.core.tui.ClientWriter.println(ClientWriter.java:54)
Wanted anywhere AFTER following interaction:
cw.println("Available recordings:");
-> at com.redhat.rhjmc.containerjfr.commands.internal.ListCommand.lambda$execute$0(ListCommand.java:86)

from cryostat.

andrewazores commented on May 29, 2024

Yea, I've seen that a couple of times too. I don't know the exact root cause of these intermittent failures, although like I said initially it seems to only have started after we switched from Gradle back down to Maven and then configured Maven for parallel test execution. So, there's either something wrong with our parallel execution configuration, or we have some badly written asynchronous tests that can occasionally break. The problem with that second scenario is that often the test that gets marked as failing is actually not the one that is async and broken - it's just one that was running at the same time as the real async broken test that failed in another thread/process.

It'll slow down test executions a fair amount, but perhaps the best thing to do to start troubleshooting this is to remove the parallel execution configuration so that everything runs serially. If any more intermittent failures occur after this change then it should be easier to trace exactly which async test(s) actually fails, and then correct the test from there.

from cryostat.

vic-ma commented on May 29, 2024

I did some testing for this.

First, I ran mvn surefire:test 50 times to check the current behaviour. There were 30 successes and 20 failures. There were 18 failures of ListCommandTest.shouldPrintDownloadAndReportURL and three failures of MessagingServerTest.shouldHandleRemovedConnections (for one test run, both of these tests failed).

Then I went into the pom and changed junit.jupiter.execution.parallel.enabled to false. I wasn't sure if there were any other steps I needed to do to disable parallel execution, but from the new test output, it seemed like it worked.

Out of the 50 serial mvn surefire:test I ran, 35 were successes and 15 were failures. There were 15 failures of ListCommandTest.shouldPrintDownloadAndReportURL and no failures of any other tests.

Afterwards, I disabled the ListCommandTest.shouldPrintDownloadAndReportURL test and ran about 20 serial mvn surefire:test to see if I could get MessagingServerTest.shouldHandleRemovedConnections to fail in serial execution, but all the tests succeeded.

Then, still with the ListCommand test disabled, I re-enabled parallel execution and ran 20 mvn surefire:test to make sure that the MessagingServerTest.shouldHandleRemovedConnections still intermittently fails when running tests in parallel. I was able to get one failure of the test.

Finally, I wanted to properly check if MessagingServerTest.shouldHandleRemovedConnections ever fails in serial execution, since I only ran 20 the last time and it seems to fail pretty rarely. So, I ran 50 more mvn surefire:test in series (still with the ListCommand test disabled), and the result was that every command succeeded.

One last thing that may be work noting is that anytime I got a failure for MessagingServerTest.shouldHandleRemovedConnections, it was about Expected: "another message" but: was "hello world" rather than the second type that you found. I think in the past as well I haven't ever encountered this second type, although I guess it might just be more rare.

So, to summarize, it looks like:

MessagingServerTest.shouldHandleRemovedConnections fails pretty rarely
MessagingServerTest.shouldHandleRemovedConnections fails in parallel execution but not serial execution
ListCommandTest.shouldPrintDownloadAndReportURL fails quite frequently (~1/3 chance)
ListCommandTest.shouldPrintDownloadAndReportURL fails in both parallel and serial execution

I think from this we could say that the failures of these two tests are unrelated? And could we also say that the problem with ListCommandTest.shouldPrintDownloadAndReportURL is probably with the test itself (or more generally something in #168)? Since it's the only test that fails in serial execution.

from cryostat.

andrewazores commented on May 29, 2024

Awesome, thanks for doing all of that investigation. Although I'm not convinced that the ListCommandTest.shouldPrintDownloadAndReportUrl is actually a test at fault - it's a pretty simple, synchronous test, so it isn't clear how that might be faulty. The MessagingServer and its tests do contain some asynchronous execution so to me it seems like that's a much more likely source of problems. I think it's likely that the ListCommandTest is actually fine and its reported failures are really false negatives caused by a faulty async test, whether that's the MessagingServerTest or another, which would explain why the failures can occur even in serial test executions.

from cryostat.

vic-ma commented on May 29, 2024

Yep, it looks like you were right about ListCommandTest.shouldPrintDownloadAndReportUrl not being the source of the problem, since when I deleted all the other tests I wasn't able to get an error anymore.

But I'm just confused as to why the test that was actually faulty did not cause any issues with any other tests beforehand, and now only breaks ListCommandTest.shouldPrintDownloadAndReportUrl. Is this just some concurrency wizardry? Or is there maybe something in shouldPrintDownloadAndReportUrl that other tests do not have, that makes it specifically vulnerable to the issue at hand?

Anyway, after this, I tried to look for the test that's causing the problem. The results for much of the testing I did on this are a bit confused now, but here is what I can say for sure, from the tests that I ran just now.

First, if DeleteSavedRecordingCommandTest and ListCommandTest are the only two tests that get run, the error will sometimes occur. Furthermore, the likelihood of the error occurring with just these two tests seems to be a bit lower than the likelihood when all tests are present.

Second, when every test is present except for DeleteSavedRecordingCommandTest, the error still sometimes occurs. However, in this case, the likelihood of the error is significantly lower than when all tests are present. In my most recent attempt, it took me around 50 test runs to get a single error. I made sure that it was indeed a failure from ListCommandTest.shouldPrintDownloadAndReportUrl and not the other one.

So it seems like there are multiple tests that are each capable of causing the error, and I would guess that the likelihood of the error occurring in any given test run is equal to the sum of the independent likelihoods of each error-causing test that is present.

One last thing to note is that I did all this testing with serial execution. The behaviour is different in parallel execution, though I'm not exactly sure to what extent, since I didn't test it much. At the very least, when I run every test except for DeleteSavedRecordingCommandTest in parallel execution, the likelihood of error is about as high as it is with all tests on, in contrast to in serial execution where the likelihood is greatly reduced. (Also, all-tests error rate seems to be consistent between parallel and serial execution.)

I'll stop working on this for now, since running all these tests takes a lot of time heh, but if there's any other testing that you think might be useful, feel free to let me know.

from cryostat.

Alexjsenn commented on May 29, 2024

So as I was writing a similar test for list-saved, I also did some extensive testing and I believe I figured out why it is failing. For some reason the toString function will sometimes reverse the order of the two URLs it prints. At least thats what it looks like from the error logs. As im still learing some of the intricacies of Java im not sure where this could be happening. I think this line Method[] methods = ArrayUtils.addAll(descriptor.getClass().getDeclaredMethods());, and more specifically the getDeclaredMethods() does not necessarily preserve order, which is why sometimes the URL order is switched when printed. I don't know if this should be fixed when its being printed, or if we should fix the tests.

from cryostat.

Alexjsenn commented on May 29, 2024

So as I was writing a similar test for list-saved, I also did some extensive testing and I believe I figured out why it is failing. For some reason the toString function will sometimes reverse the order of the two URLs it prints. At least thats what it looks like from the error logs. As im still learing some of the intricacies of Java im not sure where this could be happening. I think this line Method[] methods = ArrayUtils.addAll(descriptor.getClass().getDeclaredMethods());, and more specifically the getDeclaredMethods() does not necessarily preserve order, which is why sometimes the URL order is switched when printed. I don't know if this should be fixed when its being printed, or if we should fix the tests.

Sorry, i was specifially referring to the ListCommandTest.shouldPrintDownloadAndReportUrl error (and the new one my test would introduce) just to clarify.

from cryostat.

Alexjsenn commented on May 29, 2024

On further though it would probably make the most sense to print fields in a explicit order instead of iterating through the list that sometimes changes order.

from cryostat.

vic-ma commented on May 29, 2024

Yeah, I think that's probably the cause of the issue. I did 20 test runs with only ListCommandTest just now and was able to get a failure. I think the likelihood of it failing just decreases a lot when it's the only test run, which is why I thought it didn't break when it's the only test.

Using run.sh and creating recordings and calling list a bunch of times, I wasn't able to get a case of getDownloadUrl and getReportUrl swapping in order, though that's probably besides the point, since it seems like the likelihood of this swapping occurring is heavily dependent on the runtime environment.

from cryostat.

andrewazores commented on May 29, 2024

It's feasible I suppose, but I don't see how or why the order would not be preserved in such a case. Seems like this should be a very deterministic code path. It should be an easy enough hypothesis to test - just sort the resulting list in the implementation and set the test to expect the sorted order, then run your battery of repeated tests again and see if you can still trip this failure.

from cryostat.

vic-ma commented on May 29, 2024

Yup, looks like that was it. Nice work figuring it out.

I'm guessing all the differences in failure frequency due to other tests run and parallel vs. serial just come down to how likely those changes were to cause getDeclaredMethods() to return the "incorrect" order.

############ Logging method invocation #2 on mock/spy ########
cw.println(
    "   getDuration     0
    getName     foo
    getId       1
    getState        STOPPED
    getStartTime        0
    isContinuous        false
    getToDisk       false                                                                                         
    getMaxSize      0
    getMaxAge       0
    getReportUrl        http://example.com:1234/api/v1/targets/fooHost:1/reports/foo
    getDownloadUrl      http://example.com:1234/api/v1/targets/fooHost:1/recordings/foo
"
);
   invoked: -> at com.redhat.rhjmc.containerjfr.commands.internal.ListCommand.lambda$execute$0(ListCommand.java:98
   has returned: "null"

Side-note: I only just realized that you can debug mockito like this :x

Edit: Just saw your comment. Yeah, it does seem weird that the ordering of methods can change like that, and especially how the way we run the tests (other tests present, serial vs. parallel) can affect the probability of it changing so much. But I guess the method says it doesn't guarantee any order, so probably just some weirdness going on underneath?

What's also interesting is that the test has two recordings, foo and bar, and when this swapping failure happens, both foo and bar have their getReportUrl and getDownloadUrl swapped.

from cryostat.

andrewazores commented on May 29, 2024

#330 does some rewriting of these tests and introduces a synchronous, test-controlled executor, so that should also correct the remaining async-related deficiencies with these tests.

from cryostat.

andrewazores commented on May 29, 2024

Using the script in the linked PR I just opened, I ran 50 test runs and had 0 failures.
cjfr-unittests-2020-11-18T12:53-05:00.log

from cryostat.

andrewazores commented on May 29, 2024

@Alexjsenn and @vic-ma could you run some test runs with that script on your machines and see what results turn up?

from cryostat.

Alexjsenn commented on May 29, 2024

It seems to not fail anymore! I ran it three times and its all good!

from cryostat.

vic-ma commented on May 29, 2024

Works for me as well, 0/50 failures.

from cryostat.

andrewazores commented on May 29, 2024

Awesome. Pretty sure that fix comes from #330 's cleanup of the MessagingServerTests, so I'll close this now.

from cryostat.

Intermittent unit test failure about cryostat HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs