[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor claims jobs running forever (never terminate)



On Wed, Mar 07, 2007 at 03:24:10PM +0000, Thorsten Lampe wrote:
> Amendment: The job finally came back this time.
> 
> But for curiosity: Does anyone know it that was a known problem in the Condor 6.6 series? I've experienced it a lot of times - otherwise I would not immediately have mailed to the condor users list.
> 
> Sorry for spamming!
[...]
> 
> We're running a Windows pool using a Windows 2003 Server as Central Manager, another Windows 2003 Server as a dedicated submit node and a bunch of XP boxes for job execution. Condor version is 6.8.4 throughout all nodes.
> 
> Now I have submitted 351 jobs of which each should take about 50 minutes. 350 of them executed and terminated properly, while the last one has been kind of "stuck"  for over two hours now. The execution node is still in "claimed" state and the job is marked as executing although all output data has already been transferred back to the submit node and the process is no longer running on the execute node! It seems as if Condor just loves the job and doesn't want to release it :-)


This seems to be a feature of Condor: if an execute node has been claimed,
it will continue to work on jobs by the same user --- until the lease expires.

ALso, CLAIM_WORKLIFE might be the culprit.

Can you extract the config lines in effect (the ones that don't start with 
a hash sign, and are non-empty) so we can have a closer look?

> Does anyone have a clue?

Not really yet...

S

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html