[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Help with condor_wait (stuck in Windows using HTCondor 8.4.5 and 8.4.6)



Hello,

We have been testing further and found out that if we call "condor_wait log.condor" in a loop for each job, the condor_wait command works as expected every time. To do that we edited the submission file:

from: Log    = ../log.condor
to:   Log    = log.condor

We understand that may be some error in the condor_wait command when working with log files containing more than one job in Windows.

Someone could tell us the proper way to address this issue?

Thanks,

Marti Coma
CIMNE Aerospace Engineering Group

Date: Tue, 17 May 2016 18:05:52 +0200
From: Mart? Coma Company <mcoma@xxxxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Help with condor_wait (stuck in Windows
	using HTCondor 8.4.5 and 8.4.6)
Message-ID: <ef2d4c4d7008e02601b6cd289c58dbcf@xxxxxxxxxxxxx>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hello,

We are new to HTCondor, have been configuring a little pool since the
last course in Barcelona. The tests we have conducted submitting jobs
manually have been successful, now we are trying to interact with
HTCondor from an in-house optimization code written in C++. We have
encountered problems running in Windows 7 (so far, in Linux is working
as expected).

The problem is that after the "condor_submit submit.condor" system call
from our code, we call "condor_wait log.condor" and SOMETIMES it gets
stuck in the condor_wait (we call the submit and wait commands in a
loop). The log.condor shows that all jobs are terminated, condor_q
returns no jobs at all and the results of all calculations are there and
correct. If we kill the condor_wait from the task manager the process
continues without problems, until it gets stuck again in another loop
iteration. We have been waiting for several hours for condor_wait to
return.

We use initialdirs and relative paths in the submit file, so all jobs
are logged in the same file:

################################
#       Condor submit file     #

Universe   = vanilla

Executable = mathcasesexe_windows.exe

Log    = ../log.condor
Output = out.condor
Error  = err.condor

initialdir = condorInd$(Process)

should_transfer_files = YES
transfer_input_files = Eval.DVs, ../prob.dat
when_to_transfer_output = ON_EXIT
transfer_output_files = Eval.individual, Cons.individual

Queue 24
################################

We have tested both HTCondor 8.4.5 and 8.4.6 in the submission node with
the same issues. We have also tried to delete the log.condor file
between loop iterations, but the problem remains.

We found the same problem for a Linux user in version 7.2.5 and solved
in version 7.4.3 in 2010
(https://lists.cs.wisc.edu/archive/htcondor-users/2010-June/msg00221.shtml).
Could be a bug in condor_wait for Windows?

How can we solve this problem? Any help would be much appreciated.

Thanks in advance,

Mart? Coma
CIMNE Aerospace Engineering Group

End of HTCondor-users Digest, Vol 30, Issue 14
**********************************************