[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs with undefined JobStatus in limbo



Hi Thomas,  

A reboot of the machine is not likely to help here,  what is needed is some surgery on the job_queue.  

You might try using condor_qedit to give the jobs a JobStatus attribute, so you can then remove them using condor_rm. 

condor_qedit 10738840.0 JobStatus=5

This command will need to be run by the job's Owner or by one of the QUEUE_SUPER_USERS,  running
the command as root will probably work depending on your config. 

hope this helps
-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Thomas Hartmann
Sent: Wednesday, July 22, 2020 8:00 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] jobs with undefined JobStatus in limbo

Hi all,

we have a number of jobs, where the JobStatus is undefined like [1].

These jobs were apparently submitted during a short window, where we had
deployed a broken config (nothing 'mayor', merely a broken bracket). The
jobs might be late materialization jobs but not necessarily AFAIS.

Thing is, that we cannot remove or release them from their undefined
limbo as the schedd(?) seem not to know about them at that point. In the
logs, only a job transform for the schedd mentions the job ID [2].
A restart the daemons has not affected these jobs.

Next step would be a reboot of the machine, but maybe somebody has an
idea how to get rid of these jobs?

Cheers,
  Thomas

[1]
> condor_q -l  10738840.0
...
JobStatus = undefined
LastJobStatus = 1
...

[2]
/var/log/condor/SchedLog:07/22/20 12:14:33 (pid:528138) job_transforms
for 10738840.0: 12 considered, 10 applied
(T01SysDefaultProject,T02JobDefaults,T03JobValues,T04JobEnhance,T05JobClasses,T07AccountingStatusHold,T08DefaultToOS,T10BirdResource,T11ShellEnvironment,T12JobHistory)