[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Help with condor_wait (stuck in Windows using HTCondor 8.4.5 and 8.4.6)



Hello,

We are new to HTCondor, have been configuring a little pool since the last course in Barcelona. The tests we have conducted submitting jobs manually have been successful, now we are trying to interact with HTCondor from an in-house optimization code written in C++. We have encountered problems running in Windows 7 (so far, in Linux is working as expected).

The problem is that after the "condor_submit submit.condor" system call from our code, we call "condor_wait log.condor" and SOMETIMES it gets stuck in the condor_wait (we call the submit and wait commands in a loop). The log.condor shows that all jobs are terminated, condor_q returns no jobs at all and the results of all calculations are there and correct. If we kill the condor_wait from the task manager the process continues without problems, until it gets stuck again in another loop iteration. We have been waiting for several hours for condor_wait to return.

We use initialdirs and relative paths in the submit file, so all jobs are logged in the same file:

################################
#       Condor submit file     #

Universe   = vanilla

Executable = mathcasesexe_windows.exe

Log    = ../log.condor
Output = out.condor
Error  = err.condor

initialdir = condorInd$(Process)

should_transfer_files = YES
transfer_input_files = Eval.DVs, ../prob.dat
when_to_transfer_output = ON_EXIT
transfer_output_files = Eval.individual, Cons.individual

Queue 24
################################

We have tested both HTCondor 8.4.5 and 8.4.6 in the submission node with the same issues. We have also tried to delete the log.condor file between loop iterations, but the problem remains.

We found the same problem for a Linux user in version 7.2.5 and solved in version 7.4.3 in 2010 (https://lists.cs.wisc.edu/archive/htcondor-users/2010-June/msg00221.shtml). Could be a bug in condor_wait for Windows?

How can we solve this problem? Any help would be much appreciated.

Thanks in advance,

Martà Coma
CIMNE Aerospace Engineering Group