[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_wait running forever



Thank you very much.

On Mon, Jun 28, 2010 at 13:43, Cathrin Weiss <cweiss@xxxxxxxxxxx> wrote:
Alan,

thanks for your report. This is a known problem in condor_wait and has
received a fix that will be included in Condor version 7.4.3.
Unfortunately this bug is present in all previous versions.

Thanks,
Cathrin


Alan wrote:
> Hi there,
>
> We have an application that in a usual day would call hundreds time
> condor_submit and usually it would send 20 jobs and have to wait this 20
> jobs to finish to proceed to the next step and so on.
>
> So I use condor_wait to monitor condor.log.
>
> What I noticed lately, specially when I moved our application to use NAS
> disk (not the internal disk anymore), is that I check myself the condor.log,
> everything went fine, the 20 jobs seems to finish fine but I still see
> usually just one miserable 'condor_wait' in ps or top that can go for days.
> And if I simply kill this condor_wait process, my jobs continues nicely
> usually until the end, if not another 'condor_wait' hangs on again.
>
> This is very difficult for me to debug starting with the fact that this may
> happen or not happen in a simulation. And if happens, surely never in the
> same place it happens before.
>
> Besides, discussing with another user here of our condor pools who has a
> similar problem for his application with condor_wait and he told me he did
> his own monitoring script that oversee condor_wait, but his implementation
> works fine if only running one simulation at a time. Mine may have several
> users calling the application for a simulation at the same time, each
> simulation using condor many times.
>
> In the end I do suspect that there's something about condor_wait x NAS disk
> (the other user I talked above doesn't know about where his files are
> physically, we are checking that) and I would like to know if others here
> has faced similar problem with condor_wait.
>
> condor_version
> $CondorVersion: 7.2.5 Dec 16 2009 BuildID: 204104 $
> $CondorPlatform: X86_64-LINUX_DEBIAN50 $

--
Cathrin Weiss
Condor Project
mail: cweiss@xxxxxxxxxxx

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:



--
Alan Wilter S. da Silva, D.Sc. - CCPN Research Associate
Department of Biochemistry, University of Cambridge.
80 Tennis Court Road, Cambridge CB2 1GA, UK.
>>http://www.bio.cam.ac.uk/~awd28<<