[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] not reproducible run

Dear Condor users,

I am a new user of HTCondor and after lot of tutorials I am not able to understand what I am facing currently.

I am using the following command :

condor_submit jobs_desc_test_condor.cfg

for this condor version :
$CondorVersion: 8.6.6 Sep 12 2017 BuildID: 416237 $
$CondorPlatform: x86_64_RedHat6 $

the config file is very simple (the defaut Universe is Vanilla from what I understand) :

Executable = $(Chunk)/./batchScript.sh
LogÂÂÂÂÂÂÂ = $(Chunk)/condor_job_$(ProcId).log
OutputÂÂÂÂ = $(Chunk)/condor_job_$(ProcId).out
ErrorÂÂÂÂÂ = $(Chunk)/condor_job_$(ProcId).error
queue Chunk matching dirs test_condor/*_Chunk*

my python work environment builds the necessary directories dirs test_condor/*_Chunk* and the batchScript.sh are in in these directories.

This batchScript.sh is mainly making a list of input files to be read by an executable to generate some output log files, and do the proper setups and get back the output files.

I am confident that the executable is working fine interactively and on the batch system (I have even tried to run the remote command locally and it runs nicely).

This executable can have a lot of input files and that is why I split the job in Chunks to speed up the process. For my test I do 10 Chunks.

What I am seeing is that if I run a batch job with the command :

condor_submit jobs_desc_tttt_condor.cfg

I never have the 10 Chunks (sub-jobs) succeeding. And if I redo this exact command I got another set of Chunks succeeding ...

And every Chunks can succeed but not all in the same time. Here is the list of succeeded Chunks for each test.

test -> succeeded Chunks :

1 -> 0, 2, 3, 7

2 -> 1, 6, 7

3 -> 0, 1, 3, 4, 5, 6, 8, 9

4 -> 1, 4, 8

5 -> 3, 5

6 -> 1, 2, 4, 9

7 -> 8

So I can see that each Chunk has the possibility to succeed !! So I conclude that my executable and the input files are safe.

Now I was wondering maybe there are problem with time or cpu limitations, so I have tried to play with :

RequestCpus=4 and/or JobFlavour = "longlunch" or "microcentury" or "espresso" but for any of combinations I can have all the Chunks done successfully.

(I know that each Chunk can run locally in 2 minutes).

And when I use longlunch, I am stuck in idle for very long times (more than 1 hour).

I cannot believe that HTCondor could be so weak to reproduce such easy tasks. So is there any tips I am missing to have all my Chunks successfully done ?

Cheers, David Jamin.