[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_wait running forever



Hi there,

We have an application that in a usual day would call hundreds time condor_submit and usually it would send 20 jobs and have to wait this 20 jobs to finish to proceed to the next step and so on.

So I use condor_wait to monitor condor.log.

What I noticed lately, specially when I moved our application to use NAS disk (not the internal disk anymore), is that I check myself the condor.log, everything went fine, the 20 jobs seems to finish fine but I still see usually just one miserable 'condor_wait' in ps or top that can go for days. And if I simply kill this condor_wait process, my jobs continues nicely usually until the end, if not another 'condor_wait' hangs on again.

This is very difficult for me to debug starting with the fact that this may happen or not happen in a simulation. And if happens, surely never in the same place it happens before.

Besides, discussing with another user here of our condor pools who has a similar problem for his application with condor_wait and he told me he did his own monitoring script that oversee condor_wait, but his implementation works fine if only running one simulation at a time. Mine may have several users calling the application for a simulation at the same time, each simulation using condor many times.

In the end I do suspect that there's something about condor_wait x NAS disk (the other user I talked above doesn't know about where his files are physically, we are checking that) and I would like to know if others here has faced similar problem with condor_wait.

condor_version 
$CondorVersion: 7.2.5 Dec 16 2009 BuildID: 204104 $
$CondorPlatform: X86_64-LINUX_DEBIAN50 $

Thanks,

Alan

--
Alan Wilter S. da Silva, D.Sc. - CCPN Research Associate
Department of Biochemistry, University of Cambridge.
80 Tennis Court Road, Cambridge CB2 1GA, UK.
>>http://www.bio.cam.ac.uk/~awd28<<