[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor problem : shadow unable to transmit output file



Hello , We are making a test grid in order to harness all our lab computer processing power
and we ran in a problem we are unable to solve.

our pool is for currently made out of
licinfo10.uni LINUX       INTEL  Owner      Idle       0.000   502  0+00:10:02 - ubuntu edgy eft
licinfo11.uni LINUX       INTEL  Owner      Idle       0.000   502  0+00:10:01 - ubuntu edgy eft
vm1@moua      LINUX       INTEL  Owner      Idle       0.060   504  0+00:08:24 - RH FC 6
vm2@moua      LINUX       INTEL  Owner      Idle       0.000   504  0+00:08:25
vm1@nocte     LINUX       INTEL  Owner      Idle       0.270   506  0+00:10:09 - debian sid
vm2@nocte     LINUX       INTEL  Owner      Idle       0.000   506  0+00:10:10
vm1@nous      LINUX       INTEL  Owner      Idle       0.070   505  0+00:10:09 - ubuntu festy fawn
vm2@nous      LINUX       INTEL  Owner      Idle       0.000   505  0+00:10:10

i tested a test submit i had on this ml :

executable = /bin/hostname
universe = vanilla
TransferExecutable = true
transfer_output_files= true
output=results.output.$(Process)
error=results.error.$(Process)
log=results.log.$(Process)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
queue 5

our problem consist in all our jobs going quickly from idle to held state with all our job logs telling :


000 (001.003.000) 06/05 14:07:02 Job submitted from host: < 10.9.185.29:38947>
...
001 (001.003.000) 06/05 14:17:11 Job executing on host: <10.9.185.211:42641>
...
007 (001.003.000) 06/05 14:17:11 Shadow exception!
        Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from < 10.9.185.211:53966>
        0  -  Run Bytes Sent By Job
        8572  -  Run Bytes Received By Job
...
012 (001.003.000) 06/05 14:17:11 Job was held.
        Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from < 10.9.185.211:53966>
        Code 13 Subcode 2
...

i have
LOCAL_DIR        = /condor/$(HOSTNAME)
previously had
#LOCAL_DIR        = $(RELEASE_DIR)/hosts/$(HOSTNAME)

changed it in order to have the local dir local to the nodes as i saw on the ml that remote local dirs could pose some problems if the machines weren't correctly time synchronised (our /home/condor is nfs shared amoung all our nodes)

additionnal info : all our UIDs are shared among our hosts

apparently condor don't manage to create the dirs in $(LOCAL_DIR)/execute (wich i chmoded to be world writable) to sed them back

Hope somebody can Help :)

The USTV Condor Task Force