Comments (19)
I meant to cancel all jobs for the testqc
user as an experiment, but accidentally did testqd
. After that jobs now run for user testqd
:
[testqd@fluke1:~]$ flux job cancelall -f --user=testqd
flux-job: Canceled 162 jobs (0 errors)
[testqd@fluke1:~]$ flux mini run -vvv hostname
jobid: ƒ7sbBer5bXV
0.000s: job.submit {"userid":61493,"urgency":16,"flags":0,"version":1}
0.015s: job.jobspec-update {"attributes.system.duration":60.0}
0.015s: job.validate
0.027s: job.depend
0.027s: job.priority {"priority":1084000}
0.033s: job.alloc {"annotations":{"sched":{"queue":"default"}}}
0.033s: job.prolog-start {"description":"job-manager.prolog"}
0.490s: job.prolog-finish {"description":"job-manager.prolog","status":0}
0.497s: job.start
0.492s: exec.init
0.495s: exec.starting
0.553s: exec.shell.init {"leader-rank":64,"size":1,"service":"61493-shell-51044583512997888"}
0.586s: exec.shell.start {"task-count":1}
fluke67
0.589s: exec.shell.task-exit {"localid":0,"rank":0,"state":"Exited","pid":699877,"wait_status":0,"signaled":0,"exitcode":0}
0.591s: exec.complete {"status":0}
0.591s: exec.done
0.591s: job.finish {"status":0}
from flux-accounting.
Interesting. If this is on fluke (and flux-accounting v0.19.0
is installed), we should be able to query the internal state of the plugin now to look at job counts for the testqc
and testqd
users using the flux jobtap query
command. I can try that to see if anything looks off.
from flux-accounting.
Good catch Chris! Maybe for now you could look at the code where the bank entry is assigned and ensure the "DNE" entry is never used, logging an error when it is and skipping to the next non-DNE bank as the default. This would let us catch when this problem is occurring, while at the same time possibly fixing the issue.
from flux-accounting.
If the user continues to submit jobs without having valid user/bank information loaded in the plugin, it will continue to use the "DNE" information it already has when the first job was submitted. But because the max_active_jobs limit for the "DNE" entry is set to 0, the max_active_jobs limit is never triggered because of the following check in the job.validate callback:
Since the user/bank hasn't been configured with a limit yet, it seems ok to allow the user to submit as many jobs as they want until their bank entry has been created by the external plugin update. I'm not sure there is much else we can do to work around this race condition except either configuring a built-in default limit as you suggest, or by rejecting all jobs for users that do not have a configured bank (though this last one might need special handling on a restart)
However, I think what is perhaps missing from the mf_priority.so
plugin is that the plugin needs to search for jobs using the DNE
bank entry after the bank entries are externally updated, and propetly account for those jobs. If there are more active jobs than the newly configured bank allows, then those jobs should perhaps have an exception raised on them. Meanwhile, the jobs should have their mf_priority:bank_info
pointer updated to the correct bank.
Once this is done, it should be an invariant that any job in any state beyond PRIORITY should not have a "DNE" bank entry.
from flux-accounting.
Ok, I think @cmoussa1 and I got to the bottom of at least one issue here.
Currently, the count of running jobs for a bank is being updated in the mf_priority
priority_cb()
function. This function is called for both the job.state.priority
and job.priority.get
callbacks, so it actually can be called multiple times per job. In fact, when the accounting data is updated in mf_priority.so
via flux account-priority-update
all jobs are reprioritized, so every pending job will end up incrementing cur_run_jobs
each time this command is used (once per hour I think normally?)
To reproduce, I ran a job that used all the resources on fluke, then submitted 4 more jobs. A query of my bank shows:
# flux jobtap query mf_priority.so | jq -S '.mf_priority_map[] | select(.userid == 6885) | .banks[1]'
{
"active": 1,
"bank": "lc",
"cur_active_jobs": 5,
"cur_run_jobs": 5,
"fairshare": 0.25,
"held_jobs": [],
"max_active_jobs": 4000,
"max_run_jobs": 100
}
Note that cur_run_jobs
is 5
, showing that indeed, this counter is being incremented for pending jobs instead of only running jobs.
I then ran flux account-priority-update
to force a reprioritization of jobs, and queried mf_priority.so
again:
# flux jobtap query mf_priority.so | jq -S '.mf_priority_map[] | select(.userid == 6885) | .banks[1]'
{
"active": 1,
"bank": "lc",
"cur_active_jobs": 5,
"cur_run_jobs": 9,
"fairshare": 0.25,
"held_jobs": [],
"max_active_jobs": 4000,
"max_run_jobs": 100
}
Note that cur_run_jobs
has been incremented by 4
, the number of pending jobs, and is now greater than cur_active_jobs
which should never happen.
The fix will be to move the increment of cur_run_jobs
to the proper callback(s).
from flux-accounting.
Thanks for helping me narrow this down @grondo. As I see it and as we've discussed, there are a couple of fixes to the plugin that might fix some of these issues:
- The plugin should not increment
cur_run_jobs
injob.state.priority
because the job will not necessarily enter RUN state after it receives a priority. Jobs enter SCHED state after receiving a priority and could stay there while waiting for the requested resources before actually running. So, a callback should be added to the plugin forjob.state.run
that incrementscur_run_jobs
there instead of injob.state.priority
. - The
inactive_cb()
should check that the job transitioned to INACTIVE state actually ran with:
if (flux_jobtap_job_event_posted (p, FLUX_JOBTAP_CURRENT_JOB, "alloc"))
If true, then it is known that the job ran, and so both cur_active_jobs
and cur_run_jobs
should be decremented. If false, then only cur_active_jobs
should be decremented and the check to see if a currently held job should be released can be skipped.
A reproducer of the issue should also be added to the testsuite, consisting of submitting a job that consumes all resources, submitting more jobs that will be held in SCHED state, and ensuring that cur_run_jobs
and cur_active_jobs
are correct.
from flux-accounting.
A reproducer of the issue should also be added to the testsuite, consisting of submitting a job that consumes all resources, submitting more jobs that will be held in SCHED state, and ensuring that cur_run_jobs and cur_active_jobs are correct.
Also
- run
flux account-priority-update
and ensurecur_run_jobs
andcur_active_jobs
do not change - change job priorities with
flux job urgency ID value
and ensurecur_run_jobs
andcur_active_jobs
do not change - cancel at least one job that did not run and ensure active jobs count is correct
- allow all jobs to complete and ensure
cur_run_jobs
andcur_active_jobs
are 0
from flux-accounting.
Just to follow up on this, @garlick brought up a great point in our weekly meeting today that perhaps the cause of this issue was some due to some mishandling of some active jobs in the PENDING state after a broker restart due to a seg fault on Fluke. When the system instance started up again, perhaps the priority plugin miscounted the already active jobs and incorrectly added dependencies on subsequently submitted jobs.
FWIW, the priority plugin currently has sharness tests for the scenario where a plugin is loaded "late" (i.e after jobs are submitted), but not in the case where jobs are pending/active and then the instance is restarted. Is that possible to recreate in a sharness test? If so, I'd be curious to see if I can actually reproduce the issue reported above...
@garlick: let me know if my above description didn't actually reflect what you mentioned this morning and there are things to change or add.
from flux-accounting.
Looks like some more jobs got stuck in a DEPEND state over the weekend:
[day36@fluke108:~]$ flux jobs -A -o "{id.f58:>12} {username:<8.8} {status_abbrev:>2.2} {ntasks:>6} {nnodes:>6h} {t_submit!D} {t_depend!D}"
JOBID USER ST NTASKS NNODES T_SUBMIT T_DEPEND
ƒ6Mmwz6sB6w testqc D 2 1 2022-08-07T18:00:13 2022-08-07T18:00:13
ƒ6Mmx5PkcSK testqc D 2 1 2022-08-07T18:00:13 2022-08-07T18:00:13
ƒ6MmxAge3mh testqc D 2 1 2022-08-07T18:00:13 2022-08-07T18:00:13
ƒ6MmxFvZWYP testqc D 2 1 2022-08-07T18:00:13 2022-08-07T18:00:13
ƒCfSiqWk9W7 testqd D 2 1 2022-09-09T18:00:15 2022-09-09T18:00:15
ƒCfSiwBsQ4f testqd D 2 1 2022-09-09T18:00:15 2022-09-09T18:00:15
ƒCfSj2uxdBu testqd D 2 1 2022-09-09T18:00:16 2022-09-09T18:00:16
ƒCfSj8fXqbV testqd D 2 1 2022-09-09T18:00:16 2022-09-09T18:00:16
ƒCfSjE7KCcw testqd D 2 1 2022-09-09T18:00:16 2022-09-09T18:00:16
ƒCfSjKf2Wmm testqd D 2 1 2022-09-09T18:00:16 2022-09-09T18:00:16
ƒCfSjR6osoD testqd D 2 1 2022-09-09T18:00:16 2022-09-09T18:00:16
ƒCfSjWd3Cfh testqd D 2 1 2022-09-09T18:00:17 2022-09-09T18:00:17
ƒCfSjc6JYyV testqd D 2 1 2022-09-09T18:00:17 2022-09-09T18:00:17
ƒCfSjhcXsqy testqd D 2 1 2022-09-09T18:00:17 2022-09-09T18:00:17
ƒCrmCZaTsmZ testqc D 2 1 2022-09-10T18:00:12 2022-09-10T18:00:12
ƒCrmCf5DDMh testqc D 2 1 2022-09-10T18:00:12 2022-09-10T18:00:12
ƒCrmCkWzaP9 testqc D 2 1 2022-09-10T18:00:12 2022-09-10T18:00:12
ƒCrmCqzFvgw testqc D 2 1 2022-09-10T18:00:12 2022-09-10T18:00:12
ƒCrmCwTXGzj testqc D 2 1 2022-09-10T18:00:13 2022-09-10T18:00:13
ƒCrmD2xGcas testqc D 2 1 2022-09-10T18:00:13 2022-09-10T18:00:13
ƒCrmD8Q3ycK testqc D 2 1 2022-09-10T18:00:13 2022-09-10T18:00:13
ƒCrmDDqqLdm testqc D 2 1 2022-09-10T18:00:13 2022-09-10T18:00:13
[day36@fluke108:~]$
from flux-accounting.
To cap off some of the discussion from today's coffee call, the internal fetch of the plugin state revealed some information, but probably not all that's required to track down this issue. The internal plugin fetch revealed that both testqc
and testqd
have DNE
entries, which means that for some reason the plugin did not have the user account information for both of these users at time of job submission, so it assigned BANK_INFO_MISSING
information for these users. What is odd, though, is that there exists entries for both of these users in the flux-accounting database as revealed by flux account view-user
and flux account-shares
. Some more digging into the plugin is required.
@grondo had made a good suggestion of adding a debug mode for the plugin where it essentially just records all of the actions it can about submitted jobs, limit assignment, receiving updates, and more, and adding capability to turn it on or off.
@grondo: could you let me know what adding this support to the plugin would look like? Sorry that I am unfamiliar with it. I'm not sure if it just means adding a bunch of print
statements throughout the plugin. Is there an example somewhere in flux-core or flux-sched that does this?
EDIT: Just realized that I think that the issue is that the testqc
jobs are in fact being submitted under the DNE bank after looking at the plugin's internal info:
{
"bank": "DNE",
"fairshare": 0.1,
"max_run_jobs": -9,
"cur_run_jobs": 0,
"max_active_jobs": 0,
"cur_active_jobs": 11, <--
"held_jobs": [],
"active": 0
},
which aligns with the 11 testqc
jobs that are in DEPEND state above.
from flux-accounting.
OK! That sounds like a good plan. I'll try to make some progress on this. Maybe in the process I'll actually be able to figure out how to reproduce this issue in the first place too.
from flux-accounting.
Just want to note a couple of things after an hour or two of digging at this:
I started looking more closely at the flow of submitted jobs when an association (i.e user/bank) does not have an entry in the plugin's internal map. In this case, the plugin is supposed to add a "DNE" entry to the internal map for the user that submitted the job with some arbitrary info regarding job count limits. This information is set as follows:
user gets an entry in internal map under the bank "DNE"
user's default bank gets set to "DNE"
fairshare = 0.1
max_run_jobs = BANK_INFO_MISSING <-- set to -9
cur_run_jobs = 0
max_active_jobs = 0
cur_active_jobs = 0
active = 1
held_jobs = list of long int job ID's
When this first job transitions to job.state.priority
, the callback checks for the value of max_run_jobs
, and if it sees the MISSING_BANK_INFO value (-9), it returns flux_jobtap_priority_unavail ()
and the job is held in PRIORITY state. This is where I found a couple of issues with the plugin, but I think they are only somewhat related to this issue:
If the user continues to submit jobs without having valid user/bank information loaded in the plugin, it will continue to use the "DNE" information it already has when the first job was submitted. But because the max_active_jobs
limit for the "DNE" entry is set to 0, the max_active_jobs
limit is never triggered because of the following check in the job.validate
callback:
// if a user/bank has reached their max_active_jobs limit, subsequently
// submitted jobs will be rejected
if (max_active_jobs > 0 && cur_active_jobs >= max_active_jobs)
return flux_jobtap_reject_job (p, args, "user has max active jobs");
So, above, this check is never satisfied because of the first half of this if-condition. I think this means that as long as a user does not have valid user/bank information loaded to the plugin, they can submit as many jobs as they want. 🤦♂️ The same thing goes for the check in job.state.depend
:
// if user has already hit their max running jobs count, add a job
// dependency to hold job until an already running job has finished
if ((b->max_run_jobs > 0) && (b->cur_run_jobs == b->max_run_jobs)) {
if (flux_jobtap_dependency_add (p,
id,
"max-running-jobs-user-limit") < 0) {
flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB,
"mf_priority", 0, "failed to add " \
"job dependency");
return -1;
}
b->held_jobs.push_back (id);
}
So the values for the "DNE" entry will need to be reworked. I'm not sure what would be a good value for a max active jobs limit though. Should the user be able to submit up to, say, 10 jobs or something? Or just 1 job?
What I am still trying to figure out is why the jobs stuck in DEPEND state show up under the "DNE" entry in the plugin's internal data returned by flux jobtap query
, since just above, I don't think a dependency should be added at all to the submitted jobs:
{
"userid": 61490,
"banks": [
{
"bank": "DNE",
"fairshare": 0.10000000000000001,
"max_run_jobs": -9,
"cur_run_jobs": 0,
"max_active_jobs": 0,
"cur_active_jobs": 11,
"held_jobs": [],
"active": 0
},
{
"bank": "dev",
"fairshare": 0.98742099999999999,
"max_run_jobs": 100,
"cur_run_jobs": 0,
"max_active_jobs": 1000,
"cur_active_jobs": 0,
"held_jobs": [],
"active": 1
}
]
},
from flux-accounting.
However, I think what is perhaps missing from the
mf_priority.so
plugin is that the plugin needs to search for jobs using the DNE bank entry after the bank entries are externally updated, and properly account for those jobs. If there are more active jobs than the newly configured bank allows, then those jobs should perhaps have an exception raised on them. Meanwhile, the jobs should have theirmf_priority:bank_info
pointer updated to the correct bank.
Ah, okay. I think I see what you are saying. Currently, when flux_jobtap_reprioritize_all ()
is called and as every job re-enters job.state.priority
, each job is checked to see if it is associated with a "DNE" entry. If it is, then it performs a lookup in the internal map to see if it can find updated information on the user and any bank information. If it still does not have the information it needs, it continues to return flux_jobtap_priority_unavail ()
until the next flux_jobtap_reprioritize_all ()
. If it does find information, then it will assign the user's correct bank information to the job using flux_jobtap_aux_set ()
. Let me know if this is what you were thinking in your comment or if you were thinking of something in addition (sorry in advance if I misunderstood!).
What the plugin does not do currently is look at the job counts for the newly configured bank information after a flux account-priority-update
, so that would be some functionality to add in priority_cb ()
, along with raising a job exception if those limits are hit (as you mentioned above), since a dependency cannot be added on the job in that state.
from flux-accounting.
After some more digging in the dependencies part of the plugin, I think I've stumbled across what I think might be a cause of the issue:
The root problem listed at the top of this thread was that a number of jobs were stuck in DEPEND state due to a max-running-jobs-limit
: a limit that determines how many running jobs a user can have at any given time. Jobs submitted after this limit is hit have a dependency placed on them (max-running-jobs-limit
) until a currently running job transitions to job.state.inactive
- there, both currently_running_jobs
and currently_active_jobs
are decremented by 1, and a check is performed on the user/bank combo to see if there are 1) any held jobs, and 2) there is room to have another running job (i.e currently_running_jobs < max_running_jobs
).
Where there is potentially incorrect functionality is when a job that is in held DEPEND
state gets cancelled and transitions to job.state.inactive
before running. Currently, this still decrements currently_running_jobs
and currently_active_jobs
, which is wrong because the job never was in RUN state. This ultimately results in an incorrect running jobs count, and could be a reason why a number of jobs got stuck in DEPEND. In this case, the plugin should instead just decrement the user/bank's current_active_jobs
count, remove itself from the list of held_jobs
, and return
.
Another potentially incorrect scenario arises when a job held in DEPEND
transitions to job.state.inactive
but is not the first held job in the user/bank combo's list of held jobs. This results in a held job being released when there are already a max number of running jobs. In this case, the plugin should instead just decrement the user/bank's current_active_jobs
count, remove itself from the list of held_jobs
, and then return
.
from flux-accounting.
During our coffee call on Friday I mentioned that I believed I found a potential reproducer to the issue described at the top of this thread. I'll write it here for some thoughts on how the current plugin might need to add or change functionality to support jobs in DEPEND state after a restart:
In my interactive Docker container (and inside a Flux instance), I created a flux-accounting database and added just one user with a low max_running_jobs
limit but a high max_active_jobs
limit and sent it to a loaded multi-factor priority plugin. I submitted the max number of jobs to the queue, which, when listed with flux jobs -a
, showed a couple of running jobs and then a number of jobs in DEPEND (for more clarity, I added a print
statement to notify me when a job had a dependency added to it because of the max_running_jobs
limit).
While the jobs were still in the queue, I shut down the Flux instance with flux shutdown
. The jobs that were running transitioned to inactive, and the jobs in DEPEND remained in DEPEND. However, the inactive jobs removed dependencies for the held jobs in the queue. These jobs did not transition to RUN, though.
I then started up a Flux instance again and specified a path to the rundir from the last instance. I loaded the multi-factor priority plugin. The job-manager output a message once the plugin was loaded with a message saying something like:
BUG: job id didn't return priority
<-- I think these were the jobs that had their dependencies removed by the running jobs from the last instance.
I then sent the data from the flux-accounting DB to the plugin (which also calls reprioritize_all ()
on all of the current jobs). 2 jobs ran successfully, and the rest of the jobs in DEPEND state remained with a max-running-jobs-user-limit
dependency indefinitely.
I'll try to work on a sharness test that could potentially recreate this scenario, but I just wanted to post this to maybe get some thoughts on where the plugin is mishandling/not handling at all a restarted instance with jobs in DEPEND state.
from flux-accounting.
Unsure if this is helpful or not, but a set of jobs were submitted Mar 09 at 18:00 and several hit this issue. Since Flux has not been restarted in 2d, I don't think a restart fully explains this issue
[root@fluke108:~]# flux jobs -Ao long
JOBID QUEUE USER NAME STATUS NTASKS NNODES T_SUBMIT T_REMAINING TIME INFO
ƒno37shjxwy batch testqb ./buildan+ DEPEND 16 16 Mar09 18:00 - 3.1h depends:max-running-jobs-user-limit
ƒno37yGwGP9 batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38512VWP batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38Akbhuy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
ƒno38GQEyCB batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
ƒno38MySGdM batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38TpwSAK batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38ZjQaFy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
ƒno38ffMhdy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
ƒno38mTttcF batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38sHv4rs batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno38y7wF7V batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
ƒno394tzSoR batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
ƒno39Ag3eVM batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno39GQ8scb batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
ƒno39MyLB3m batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
ƒno39TSbXMZ batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
[root@fluke108:~]# flux uptime
12:32:31 run 2d, owner flux, depth 0, size 101, 5 drained, 6 offline
Here's the output of flux jobtap query mf_priority.so
for user testqb
:
{
"userid": 61489,
"banks": [
{
"bank": "guests",
"fairshare": 0.012500000000000001,
"max_run_jobs": 100,
"cur_run_jobs": 111,
"max_active_jobs": 1000,
"cur_active_jobs": 17,
"held_jobs": [
340140222702422016,
340140226359855104,
340140230117951488,
340140233892825088,
340140237600589824,
340140241258022912,
340140245100005376,
340140248975542272,
340140252867856384,
340140256676284416,
340140260501489664,
340140264326694912,
340140268118345728,
340140271909996544,
340140275668092928,
340140279325526016,
340140282915850240
],
"active": 1
}
]
}
}
Is there an explanation why cur_run_jobs
is 111 when active jobs is 17?
from flux-accounting.
The information returned by flux jobtap query
makes me think that for some reason the cur_run_jobs
count is not getting decremented when jobs are completed from testqb
. It's just a guess at this point, but what comes to mind just off the top of my head is the cur_run_jobs
count for testqb
is not getting decremented when the instance is shut down and currently running jobs are getting cleaned up and transitioning to inactive
. I'm also not sure why it is greater than max_active_jobs
(100)?
from flux-accounting.
Does the mf_priority
plugin carry any state over a shutdown? When an instance is shutdown, any jobtap plugins are unloaded (I suppose that is obvious), then reloaded when the instance starts again. Since Flux does not support recovery of running jobs at this time, it is guaranteed that there were no running jobs when the plugin was loaded.
What is also interesting is that 1 of the jobs submitted by user testqb
on Mar 09 did get scheduled, and it appears all the jobs from the previous day also ran:
$ flux jobs -a --user=testqb --since=-2.1d -o long
JOBID QUEUE USER NAME STATUS NTASKS NNODES T_SUBMIT T_REMAINING TIME INFO
fno37shjxwy batch testqb ./buildan+ DEPEND 16 16 Mar09 18:00 - 3.1h depends:max-running-jobs-user-limit
fno37yGwGP9 batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38512VWP batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38Akbhuy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
fno38GQEyCB batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
fno38MySGdM batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38TpwSAK batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38ZjQaFy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
fno38ffMhdy batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
fno38mTttcF batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38sHv4rs batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno38y7wF7V batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
fno394tzSoR batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
fno39Ag3eVM batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno39GQ8scb batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 6.1h depends:max-running-jobs-user-limit
fno39MyLB3m batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 2.1h depends:max-running-jobs-user-limit
fno39TSbXMZ batch testqb ./buildan+ DEPEND 1 1 Mar09 18:00 - 1.1h depends:max-running-jobs-user-limit
fno37myejpj batch testqb ./buildan+ COMPLETED 54 54 Mar09 18:00 - 1.785h fluke[6-16,18-23,25-60,62]
fnbichWxzFh batch testqb ./buildan+ COMPLETED 54 54 Mar08 18:00 - 1.784h fluke[6-16,18-23,25-60,62]
fnbicoC6EpF batch testqb ./buildan+ COMPLETED 16 16 Mar08 18:00 - 46.12s fluke[63-65,67-71,73-78,80-81]
fnbieLQB2Kh batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 43.97s fluke87
fnbieEyseab batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 43.95s fluke83
fnbie9Y6HZ9 batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 1.026m fluke87
fnbie46JvXh batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 44.15s fluke83
fnbidxbZawZ batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 14.55s fluke87
fnbids5LG55 batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 44.24s fluke83
fnbidmW8xdu batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 43.8s fluke87
fnbidfuTfvP batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 43.93s fluke83
fnbidaELRMq batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 45.27s fluke87
fnbidUFRKR9 batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 15.41s fluke83
fnbidNbn48w batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 44.72s fluke87
fnbidH16mRR batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 46.6s fluke83
fnbid5ug97m batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 44.74s fluke87
fnbidBTPTGb batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 32.94s fluke83
fnbiczSQnoy batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 45.55s fluke83
fnbictmHYFR batch testqb ./buildan+ COMPLETED 1 1 Mar08 18:00 - 45.05s fluke87
Maybe there were some jobs still active after the recent flux restart, but only a few (much less than 100) should have been able to run since the restart, which makes the cur_run_jobs
value for this user suspect.
from flux-accounting.
I've checked in on fluke a couple of times now since #325 has landed and haven't seen that set of jobs get stuck in DEPEND state since. I think I might close this for now, and if it ever pops back up again, we can always re-open this.
from flux-accounting.
Related Issues (20)
- testsuite: fix tests that look at job state HOT 1
- support bank and project updates HOT 1
- `view-bank`: `-t` option does show hierarchy for a sub bank with users in it
- per-queue user limits HOT 2
- plugin: create external `bank_info` class HOT 1
- all pending jobs killed after Flux update HOT 5
- plugin: create new `Association` class
- plugin: improve callback for `job.validate` HOT 1
- error in flux account view-job-records HOT 2
- `plugin.query`: abstract helper functions that create JSON objects of flux-accounting data HOT 1
- `job.new`: use new external functions for user/bank lookups
- plugin: support bypassing limits
- `job.update`/`job.update...queue`: use new external methods for association lookup
- `job.state.priority`: use new external function for association lookup, general function improvement
- plugin: move accounting-specific helper functions to `accounting.cpp`
- plugin: send max nodes information per-association
- plugin: create estimation of node count helper function
- docs: move flux-accounting guide to this repo HOT 1
- create script for crontab tasks HOT 3
- flux account commands hang while fairshare is being updated HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-accounting.