All jobs for a couple of the testq* users are stuck i

Ok, I think <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Thanks for helping me narrow this down <a class="user-mention notranslate" data-hoverc

Just to follow up on this, <a class="user-mention notranslate" data-hovercard-type="us

all jobs stuck in DEPEND:max-jobs-limit on fluke,about flux-framework/flux-accounting

Comments (19)

grondo commented on August 11, 2024 1

I meant to cancel all jobs for the testqc user as an experiment, but accidentally did testqd. After that jobs now run for user testqd:

[testqd@fluke1:~]$ flux job cancelall -f --user=testqd
flux-job: Canceled 162 jobs (0 errors)
[testqd@fluke1:~]$ flux mini run -vvv hostname
jobid: ƒ7sbBer5bXV
0.000s: job.submit {"userid":61493,"urgency":16,"flags":0,"version":1}
0.015s: job.jobspec-update {"attributes.system.duration":60.0}
0.015s: job.validate
0.027s: job.depend
0.027s: job.priority {"priority":1084000}
0.033s: job.alloc {"annotations":{"sched":{"queue":"default"}}}
0.033s: job.prolog-start {"description":"job-manager.prolog"}
0.490s: job.prolog-finish {"description":"job-manager.prolog","status":0}
0.497s: job.start
0.492s: exec.init
0.495s: exec.starting
0.553s: exec.shell.init {"leader-rank":64,"size":1,"service":"61493-shell-51044583512997888"}
0.586s: exec.shell.start {"task-count":1}
fluke67
0.589s: exec.shell.task-exit {"localid":0,"rank":0,"state":"Exited","pid":699877,"wait_status":0,"signaled":0,"exitcode":0}
0.591s: exec.complete {"status":0}
0.591s: exec.done
0.591s: job.finish {"status":0}

from flux-accounting.

cmoussa1 commented on August 11, 2024 1

Interesting. If this is on fluke (and flux-accounting v0.19.0 is installed), we should be able to query the internal state of the plugin now to look at job counts for the testqc and testqd users using the flux jobtap query command. I can try that to see if anything looks off.

from flux-accounting.

grondo commented on August 11, 2024 1

Good catch Chris! Maybe for now you could look at the code where the bank entry is assigned and ensure the "DNE" entry is never used, logging an error when it is and skipping to the next non-DNE bank as the default. This would let us catch when this problem is occurring, while at the same time possibly fixing the issue.

from flux-accounting.

grondo commented on August 11, 2024 1

If the user continues to submit jobs without having valid user/bank information loaded in the plugin, it will continue to use the "DNE" information it already has when the first job was submitted. But because the max_active_jobs limit for the "DNE" entry is set to 0, the max_active_jobs limit is never triggered because of the following check in the job.validate callback:

Since the user/bank hasn't been configured with a limit yet, it seems ok to allow the user to submit as many jobs as they want until their bank entry has been created by the external plugin update. I'm not sure there is much else we can do to work around this race condition except either configuring a built-in default limit as you suggest, or by rejecting all jobs for users that do not have a configured bank (though this last one might need special handling on a restart)

However, I think what is perhaps missing from the mf_priority.so plugin is that the plugin needs to search for jobs using the DNE bank entry after the bank entries are externally updated, and propetly account for those jobs. If there are more active jobs than the newly configured bank allows, then those jobs should perhaps have an exception raised on them. Meanwhile, the jobs should have their mf_priority:bank_info pointer updated to the correct bank.

Once this is done, it should be an invariant that any job in any state beyond PRIORITY should not have a "DNE" bank entry.

from flux-accounting.

grondo commented on August 11, 2024 1

Ok, I think @cmoussa1 and I got to the bottom of at least one issue here.

Currently, the count of running jobs for a bank is being updated in the mf_priority priority_cb() function. This function is called for both the job.state.priority and job.priority.get callbacks, so it actually can be called multiple times per job. In fact, when the accounting data is updated in mf_priority.so via flux account-priority-update all jobs are reprioritized, so every pending job will end up incrementing cur_run_jobs each time this command is used (once per hour I think normally?)

To reproduce, I ran a job that used all the resources on fluke, then submitted 4 more jobs. A query of my bank shows:

# flux jobtap query mf_priority.so | jq -S '.mf_priority_map[] | select(.userid == 6885) | .banks[1]'
{
  "active": 1,
  "bank": "lc",
  "cur_active_jobs": 5,
  "cur_run_jobs": 5,
  "fairshare": 0.25,
  "held_jobs": [],
  "max_active_jobs": 4000,
  "max_run_jobs": 100
}

Note that cur_run_jobs is 5, showing that indeed, this counter is being incremented for pending jobs instead of only running jobs.

I then ran flux account-priority-update to force a reprioritization of jobs, and queried mf_priority.so again:

# flux jobtap query mf_priority.so | jq -S '.mf_priority_map[] | select(.userid == 6885) | .banks[1]'
{
  "active": 1,
  "bank": "lc",
  "cur_active_jobs": 5,
  "cur_run_jobs": 9,
  "fairshare": 0.25,
  "held_jobs": [],
  "max_active_jobs": 4000,
  "max_run_jobs": 100
}

Note that cur_run_jobs has been incremented by 4, the number of pending jobs, and is now greater than cur_active_jobs which should never happen.

The fix will be to move the increment of cur_run_jobs to the proper callback(s).

from flux-accounting.

cmoussa1 commented on August 11, 2024 1

Thanks for helping me narrow this down @grondo. As I see it and as we've discussed, there are a couple of fixes to the plugin that might fix some of these issues:

The plugin should not increment cur_run_jobs in job.state.priority because the job will not necessarily enter RUN state after it receives a priority. Jobs enter SCHED state after receiving a priority and could stay there while waiting for the requested resources before actually running. So, a callback should be added to the plugin for job.state.run that increments cur_run_jobs there instead of in job.state.priority.
The inactive_cb() should check that the job transitioned to INACTIVE state actually ran with:

if (flux_jobtap_job_event_posted (p, FLUX_JOBTAP_CURRENT_JOB, "alloc"))

If true, then it is known that the job ran, and so both cur_active_jobs and cur_run_jobs should be decremented. If false, then only cur_active_jobs should be decremented and the check to see if a currently held job should be released can be skipped.

A reproducer of the issue should also be added to the testsuite, consisting of submitting a job that consumes all resources, submitting more jobs that will be held in SCHED state, and ensuring that cur_run_jobs and cur_active_jobs are correct.

from flux-accounting.

grondo commented on August 11, 2024 1

A reproducer of the issue should also be added to the testsuite, consisting of submitting a job that consumes all resources, submitting more jobs that will be held in SCHED state, and ensuring that cur_run_jobs and cur_active_jobs are correct.

Also

run flux account-priority-update and ensure cur_run_jobs and cur_active_jobs do not change
change job priorities with flux job urgency ID value and ensure cur_run_jobs and cur_active_jobs do not change
cancel at least one job that did not run and ensure active jobs count is correct
allow all jobs to complete and ensure cur_run_jobs and cur_active_jobs are 0

from flux-accounting.

cmoussa1 commented on August 11, 2024

Just to follow up on this, @garlick brought up a great point in our weekly meeting today that perhaps the cause of this issue was some due to some mishandling of some active jobs in the PENDING state after a broker restart due to a seg fault on Fluke. When the system instance started up again, perhaps the priority plugin miscounted the already active jobs and incorrectly added dependencies on subsequently submitted jobs.

FWIW, the priority plugin currently has sharness tests for the scenario where a plugin is loaded "late" (i.e after jobs are submitted), but not in the case where jobs are pending/active and then the instance is restarted. Is that possible to recreate in a sharness test? If so, I'd be curious to see if I can actually reproduce the issue reported above...

@garlick: let me know if my above description didn't actually reflect what you mentioned this morning and there are things to change or add.

from flux-accounting.

ryanday36 commented on August 11, 2024

Looks like some more jobs got stuck in a DEPEND state over the weekend:

[day36@fluke108:~]$ flux jobs -A -o "{id.f58:>12} {username:<8.8} {status_abbrev:>2.2} {ntasks:>6} {nnodes:>6h} {t_submit!D} {t_depend!D}"
       JOBID USER     ST NTASKS NNODES T_SUBMIT T_DEPEND
 ƒ6Mmwz6sB6w testqc    D      2      1 2022-08-07T18:00:13 2022-08-07T18:00:13
 ƒ6Mmx5PkcSK testqc    D      2      1 2022-08-07T18:00:13 2022-08-07T18:00:13
 ƒ6MmxAge3mh testqc    D      2      1 2022-08-07T18:00:13 2022-08-07T18:00:13
 ƒ6MmxFvZWYP testqc    D      2      1 2022-08-07T18:00:13 2022-08-07T18:00:13
 ƒCfSiqWk9W7 testqd    D      2      1 2022-09-09T18:00:15 2022-09-09T18:00:15
 ƒCfSiwBsQ4f testqd    D      2      1 2022-09-09T18:00:15 2022-09-09T18:00:15
 ƒCfSj2uxdBu testqd    D      2      1 2022-09-09T18:00:16 2022-09-09T18:00:16
 ƒCfSj8fXqbV testqd    D      2      1 2022-09-09T18:00:16 2022-09-09T18:00:16
 ƒCfSjE7KCcw testqd    D      2      1 2022-09-09T18:00:16 2022-09-09T18:00:16
 ƒCfSjKf2Wmm testqd    D      2      1 2022-09-09T18:00:16 2022-09-09T18:00:16
 ƒCfSjR6osoD testqd    D      2      1 2022-09-09T18:00:16 2022-09-09T18:00:16
 ƒCfSjWd3Cfh testqd    D      2      1 2022-09-09T18:00:17 2022-09-09T18:00:17
 ƒCfSjc6JYyV testqd    D      2      1 2022-09-09T18:00:17 2022-09-09T18:00:17
 ƒCfSjhcXsqy testqd    D      2      1 2022-09-09T18:00:17 2022-09-09T18:00:17
 ƒCrmCZaTsmZ testqc    D      2      1 2022-09-10T18:00:12 2022-09-10T18:00:12
 ƒCrmCf5DDMh testqc    D      2      1 2022-09-10T18:00:12 2022-09-10T18:00:12
 ƒCrmCkWzaP9 testqc    D      2      1 2022-09-10T18:00:12 2022-09-10T18:00:12
 ƒCrmCqzFvgw testqc    D      2      1 2022-09-10T18:00:12 2022-09-10T18:00:12
 ƒCrmCwTXGzj testqc    D      2      1 2022-09-10T18:00:13 2022-09-10T18:00:13
 ƒCrmD2xGcas testqc    D      2      1 2022-09-10T18:00:13 2022-09-10T18:00:13
 ƒCrmD8Q3ycK testqc    D      2      1 2022-09-10T18:00:13 2022-09-10T18:00:13
 ƒCrmDDqqLdm testqc    D      2      1 2022-09-10T18:00:13 2022-09-10T18:00:13
[day36@fluke108:~]$

from flux-accounting.

cmoussa1 commented on August 11, 2024

To cap off some of the discussion from today's coffee call, the internal fetch of the plugin state revealed some information, but probably not all that's required to track down this issue. The internal plugin fetch revealed that both testqc and testqd have DNE entries, which means that for some reason the plugin did not have the user account information for both of these users at time of job submission, so it assigned BANK_INFO_MISSING information for these users. What is odd, though, is that there exists entries for both of these users in the flux-accounting database as revealed by flux account view-user and flux account-shares. Some more digging into the plugin is required.

@grondo had made a good suggestion of adding a debug mode for the plugin where it essentially just records all of the actions it can about submitted jobs, limit assignment, receiving updates, and more, and adding capability to turn it on or off.

@grondo: could you let me know what adding this support to the plugin would look like? Sorry that I am unfamiliar with it. I'm not sure if it just means adding a bunch of print statements throughout the plugin. Is there an example somewhere in flux-core or flux-sched that does this?

EDIT: Just realized that I think that the issue is that the testqc jobs are in fact being submitted under the DNE bank after looking at the plugin's internal info:

{
  "bank": "DNE",
  "fairshare": 0.1,
  "max_run_jobs": -9,
  "cur_run_jobs": 0,
  "max_active_jobs": 0,
  "cur_active_jobs": 11, <--
  "held_jobs": [],
  "active": 0
},

which aligns with the 11 testqc jobs that are in DEPEND state above.

from flux-accounting.

cmoussa1 commented on August 11, 2024

OK! That sounds like a good plan. I'll try to make some progress on this. Maybe in the process I'll actually be able to figure out how to reproduce this issue in the first place too.

from flux-accounting.

cmoussa1 commented on August 11, 2024

Just want to note a couple of things after an hour or two of digging at this:

I started looking more closely at the flow of submitted jobs when an association (i.e user/bank) does not have an entry in the plugin's internal map. In this case, the plugin is supposed to add a "DNE" entry to the internal map for the user that submitted the job with some arbitrary info regarding job count limits. This information is set as follows:

user gets an entry in internal map under the bank "DNE"
user's default bank gets set to "DNE"

fairshare = 0.1
max_run_jobs = BANK_INFO_MISSING <-- set to -9
cur_run_jobs = 0
max_active_jobs = 0
cur_active_jobs = 0
active = 1
held_jobs = list of long int job ID's

When this first job transitions to job.state.priority, the callback checks for the value of max_run_jobs, and if it sees the MISSING_BANK_INFO value (-9), it returns flux_jobtap_priority_unavail () and the job is held in PRIORITY state. This is where I found a couple of issues with the plugin, but I think they are only somewhat related to this issue:

If the user continues to submit jobs without having valid user/bank information loaded in the plugin, it will continue to use the "DNE" information it already has when the first job was submitted. But because the max_active_jobs limit for the "DNE" entry is set to 0, the max_active_jobs limit is never triggered because of the following check in the job.validate callback:

// if a user/bank has reached their max_active_jobs limit, subsequently
// submitted jobs will be rejected
if (max_active_jobs > 0 && cur_active_jobs >= max_active_jobs)
    return flux_jobtap_reject_job (p, args, "user has max active jobs");

So, above, this check is never satisfied because of the first half of this if-condition. I think this means that as long as a user does not have valid user/bank information loaded to the plugin, they can submit as many jobs as they want. 🤦‍♂️ The same thing goes for the check in job.state.depend:

// if user has already hit their max running jobs count, add a job
// dependency to hold job until an already running job has finished
if ((b->max_run_jobs > 0) && (b->cur_run_jobs == b->max_run_jobs)) {
    if (flux_jobtap_dependency_add (p,
                                    id,
                                    "max-running-jobs-user-limit") < 0) {
        flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB,
                                     "mf_priority", 0, "failed to add " \
                                     "job dependency");

        return -1;
    }
    b->held_jobs.push_back (id);
}

So the values for the "DNE" entry will need to be reworked. I'm not sure what would be a good value for a max active jobs limit though. Should the user be able to submit up to, say, 10 jobs or something? Or just 1 job?

What I am still trying to figure out is why the jobs stuck in DEPEND state show up under the "DNE" entry in the plugin's internal data returned by flux jobtap query, since just above, I don't think a dependency should be added at all to the submitted jobs:

    {
      "userid": 61490,
      "banks": [
        {
          "bank": "DNE",
          "fairshare": 0.10000000000000001,
          "max_run_jobs": -9,
          "cur_run_jobs": 0,
          "max_active_jobs": 0,
          "cur_active_jobs": 11,
          "held_jobs": [],
          "active": 0
        },
        {
          "bank": "dev",
          "fairshare": 0.98742099999999999,
          "max_run_jobs": 100,
          "cur_run_jobs": 0,
          "max_active_jobs": 1000,
          "cur_active_jobs": 0,
          "held_jobs": [],
          "active": 1
        }
      ]
    },

from flux-accounting.

cmoussa1 commented on August 11, 2024

However, I think what is perhaps missing from the mf_priority.so plugin is that the plugin needs to search for jobs using the DNE bank entry after the bank entries are externally updated, and properly account for those jobs. If there are more active jobs than the newly configured bank allows, then those jobs should perhaps have an exception raised on them. Meanwhile, the jobs should have their mf_priority:bank_info pointer updated to the correct bank.

Ah, okay. I think I see what you are saying. Currently, when flux_jobtap_reprioritize_all () is called and as every job re-enters job.state.priority, each job is checked to see if it is associated with a "DNE" entry. If it is, then it performs a lookup in the internal map to see if it can find updated information on the user and any bank information. If it still does not have the information it needs, it continues to return flux_jobtap_priority_unavail () until the next flux_jobtap_reprioritize_all (). If it does find information, then it will assign the user's correct bank information to the job using flux_jobtap_aux_set (). Let me know if this is what you were thinking in your comment or if you were thinking of something in addition (sorry in advance if I misunderstood!).

What the plugin does not do currently is look at the job counts for the newly configured bank information after a flux account-priority-update, so that would be some functionality to add in priority_cb (), along with raising a job exception if those limits are hit (as you mentioned above), since a dependency cannot be added on the job in that state.

from flux-accounting.

cmoussa1 commented on August 11, 2024

After some more digging in the dependencies part of the plugin, I think I've stumbled across what I think might be a cause of the issue:

The root problem listed at the top of this thread was that a number of jobs were stuck in DEPEND state due to a max-running-jobs-limit: a limit that determines how many running jobs a user can have at any given time. Jobs submitted after this limit is hit have a dependency placed on them (max-running-jobs-limit) until a currently running job transitions to job.state.inactive - there, both currently_running_jobs and currently_active_jobs are decremented by 1, and a check is performed on the user/bank combo to see if there are 1) any held jobs, and 2) there is room to have another running job (i.e currently_running_jobs < max_running_jobs).

Where there is potentially incorrect functionality is when a job that is in held DEPEND state gets cancelled and transitions to job.state.inactive before running. Currently, this still decrements currently_running_jobs and currently_active_jobs, which is wrong because the job never was in RUN state. This ultimately results in an incorrect running jobs count, and could be a reason why a number of jobs got stuck in DEPEND. In this case, the plugin should instead just decrement the user/bank's current_active_jobs count, remove itself from the list of held_jobs, and return.

Another potentially incorrect scenario arises when a job held in DEPEND transitions to job.state.inactive but is not the first held job in the user/bank combo's list of held jobs. This results in a held job being released when there are already a max number of running jobs. In this case, the plugin should instead just decrement the user/bank's current_active_jobs count, remove itself from the list of held_jobs, and then return.

from flux-accounting.

cmoussa1 commented on August 11, 2024

During our coffee call on Friday I mentioned that I believed I found a potential reproducer to the issue described at the top of this thread. I'll write it here for some thoughts on how the current plugin might need to add or change functionality to support jobs in DEPEND state after a restart:

In my interactive Docker container (and inside a Flux instance), I created a flux-accounting database and added just one user with a low max_running_jobs limit but a high max_active_jobs limit and sent it to a loaded multi-factor priority plugin. I submitted the max number of jobs to the queue, which, when listed with flux jobs -a, showed a couple of running jobs and then a number of jobs in DEPEND (for more clarity, I added a print statement to notify me when a job had a dependency added to it because of the max_running_jobs limit).

While the jobs were still in the queue, I shut down the Flux instance with flux shutdown. The jobs that were running transitioned to inactive, and the jobs in DEPEND remained in DEPEND. However, the inactive jobs removed dependencies for the held jobs in the queue. These jobs did not transition to RUN, though.

I then started up a Flux instance again and specified a path to the rundir from the last instance. I loaded the multi-factor priority plugin. The job-manager output a message once the plugin was loaded with a message saying something like:

BUG: job id didn't return priority <-- I think these were the jobs that had their dependencies removed by the running jobs from the last instance.

I then sent the data from the flux-accounting DB to the plugin (which also calls reprioritize_all () on all of the current jobs). 2 jobs ran successfully, and the rest of the jobs in DEPEND state remained with a max-running-jobs-user-limit dependency indefinitely.

I'll try to work on a sharness test that could potentially recreate this scenario, but I just wanted to post this to maybe get some thoughts on where the plugin is mishandling/not handling at all a restarted instance with jobs in DEPEND state.

from flux-accounting.

grondo commented on August 11, 2024

Unsure if this is helpful or not, but a set of jobs were submitted Mar 09 at 18:00 and several hit this issue. Since Flux has not been restarted in 2d, I don't think a restart fully explains this issue

[root@fluke108:~]# flux jobs -Ao long
       JOBID QUEUE    USER     NAME          STATUS NTASKS NNODES     T_SUBMIT  T_REMAINING     TIME INFO
 ƒno37shjxwy batch    testqb   ./buildan+    DEPEND     16     16  Mar09 18:00            -     3.1h depends:max-running-jobs-user-limit
 ƒno37yGwGP9 batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38512VWP batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38Akbhuy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 ƒno38GQEyCB batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 ƒno38MySGdM batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38TpwSAK batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38ZjQaFy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 ƒno38ffMhdy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 ƒno38mTttcF batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38sHv4rs batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno38y7wF7V batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 ƒno394tzSoR batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 ƒno39Ag3eVM batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno39GQ8scb batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 ƒno39MyLB3m batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 ƒno39TSbXMZ batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
[root@fluke108:~]# flux uptime
 12:32:31 run 2d,  owner flux,  depth 0,  size 101,  5 drained,  6 offline

Here's the output of flux jobtap query mf_priority.so for user testqb:

{
      "userid": 61489,
      "banks": [
        {
          "bank": "guests",
          "fairshare": 0.012500000000000001,
          "max_run_jobs": 100,
          "cur_run_jobs": 111,
          "max_active_jobs": 1000,
          "cur_active_jobs": 17,
          "held_jobs": [
            340140222702422016,
            340140226359855104,
            340140230117951488,
            340140233892825088,
            340140237600589824,
            340140241258022912,
            340140245100005376,
            340140248975542272,
            340140252867856384,
            340140256676284416,
            340140260501489664,
            340140264326694912,
            340140268118345728,
            340140271909996544,
            340140275668092928,
            340140279325526016,
            340140282915850240
          ],
          "active": 1
        }
      ]
    }
}

Is there an explanation why cur_run_jobs is 111 when active jobs is 17?

from flux-accounting.

cmoussa1 commented on August 11, 2024

The information returned by flux jobtap query makes me think that for some reason the cur_run_jobs count is not getting decremented when jobs are completed from testqb. It's just a guess at this point, but what comes to mind just off the top of my head is the cur_run_jobs count for testqb is not getting decremented when the instance is shut down and currently running jobs are getting cleaned up and transitioning to inactive. I'm also not sure why it is greater than max_active_jobs (100)?

from flux-accounting.

grondo commented on August 11, 2024

Does the mf_priority plugin carry any state over a shutdown? When an instance is shutdown, any jobtap plugins are unloaded (I suppose that is obvious), then reloaded when the instance starts again. Since Flux does not support recovery of running jobs at this time, it is guaranteed that there were no running jobs when the plugin was loaded.

What is also interesting is that 1 of the jobs submitted by user testqb on Mar 09 did get scheduled, and it appears all the jobs from the previous day also ran:

$ flux jobs -a --user=testqb --since=-2.1d -o long
       JOBID QUEUE    USER     NAME          STATUS NTASKS NNODES     T_SUBMIT  T_REMAINING     TIME INFO
 fno37shjxwy batch    testqb   ./buildan+    DEPEND     16     16  Mar09 18:00            -     3.1h depends:max-running-jobs-user-limit
 fno37yGwGP9 batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38512VWP batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38Akbhuy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 fno38GQEyCB batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 fno38MySGdM batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38TpwSAK batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38ZjQaFy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 fno38ffMhdy batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 fno38mTttcF batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38sHv4rs batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno38y7wF7V batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 fno394tzSoR batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 fno39Ag3eVM batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno39GQ8scb batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     6.1h depends:max-running-jobs-user-limit
 fno39MyLB3m batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     2.1h depends:max-running-jobs-user-limit
 fno39TSbXMZ batch    testqb   ./buildan+    DEPEND      1      1  Mar09 18:00            -     1.1h depends:max-running-jobs-user-limit
 fno37myejpj batch    testqb   ./buildan+ COMPLETED     54     54  Mar09 18:00            -   1.785h fluke[6-16,18-23,25-60,62]
 fnbichWxzFh batch    testqb   ./buildan+ COMPLETED     54     54  Mar08 18:00            -   1.784h fluke[6-16,18-23,25-60,62]
 fnbicoC6EpF batch    testqb   ./buildan+ COMPLETED     16     16  Mar08 18:00            -   46.12s fluke[63-65,67-71,73-78,80-81]
 fnbieLQB2Kh batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   43.97s fluke87
 fnbieEyseab batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   43.95s fluke83
 fnbie9Y6HZ9 batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   1.026m fluke87
 fnbie46JvXh batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   44.15s fluke83
 fnbidxbZawZ batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   14.55s fluke87
 fnbids5LG55 batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   44.24s fluke83
 fnbidmW8xdu batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -    43.8s fluke87
 fnbidfuTfvP batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   43.93s fluke83
 fnbidaELRMq batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   45.27s fluke87
 fnbidUFRKR9 batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   15.41s fluke83
 fnbidNbn48w batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   44.72s fluke87
 fnbidH16mRR batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -    46.6s fluke83
 fnbid5ug97m batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   44.74s fluke87
 fnbidBTPTGb batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   32.94s fluke83
 fnbiczSQnoy batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   45.55s fluke83
 fnbictmHYFR batch    testqb   ./buildan+ COMPLETED      1      1  Mar08 18:00            -   45.05s fluke87

Maybe there were some jobs still active after the recent flux restart, but only a few (much less than 100) should have been able to run since the restart, which makes the cur_run_jobs value for this user suspect.

from flux-accounting.

cmoussa1 commented on August 11, 2024

I've checked in on fluke a couple of times now since #325 has landed and haven't seen that set of jobs get stuck in DEPEND state since. I think I might close this for now, and if it ever pops back up again, we can always re-open this.

from flux-accounting.

all jobs stuck in DEPEND:max-jobs-limit on fluke about flux-accounting HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs