[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [8.6.x] sometimes runs twice the same job ?



Hello,

It happens since 8.6.0 (8.6.1 just tested), some jobs are run twice:

(some grep output on a given cluster node)

startd_history:Arguments = "sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '3"
startd_history:Arguments = "sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '17"
startd_history:Arguments = "sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '29"
startd_history:Arguments = "sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '35"
startd_history:Arguments = "sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '35"
StarterLog.slot1_1:03/03/17 15:43:33 (pid:236572) About to
exec /tmp/condor/execute/dir_236572/condor_exec.exe sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '3
StarterLog.slot1_1:03/03/17 15:43:35 (pid:236581) About to
exec /tmp/condor/execute/dir_236581/condor_exec.exe sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '17
StarterLog.slot1_1:03/03/17 15:43:36 (pid:236590) About to
exec /tmp/condor/execute/dir_236590/condor_exec.exe sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '29
StarterLog.slot1_1:03/03/17 15:43:38 (pid:236599) About to
exec /tmp/condor/execute/dir_236599/condor_exec.exe sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '35
StarterLog.slot1_1:03/03/17 15:54:33 (pid:236753) About to
exec /tmp/condor/execute/dir_236753/condor_exec.exe sh -c '
'/home/applis/anaconda/envs/py3v4.3/bin/python' '-m' 'runlib.condor'
'PYRO:obj_b9aa420a2c7c4522a6737a8ef5bf6bf5@xxxxxxxxxx:5553' 'C' '35

It happened on another node too, with 36th job of the queue. Most jobs
are not duplicated.
We have checked our submit file and there is no duplicate entry at all.

Any idea of where to dig about this problem ?

Thanks,
-- 
Laurent Wandrebeck
HYGEOS, Earth Observation Department / Observation de la Terre
Euratechnologies
165 Avenue de Bretagne
59000 Lille, France
tel: +33 3 20 08 24 98
https://www.hygeos.com