[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Executable Fails to Transfer



Hi Everyone,

   I have a condor submit host which succeeds in running a simple test job and
a condor submit host the test job fails. The failing submit host is running
this version of condor because that's what comes with ROCKS:

$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
$CondorPlatform: x86_64_rhap_6.3 $

The job returns this set of messages from condor:

000 (096.000.000) 04/17 13:55:20 Job submitted from host:
<129.79.157.90:11015?sock=2507_a415_3>
...
018 (096.000.000) 04/17 14:00:30 Globus job submission failed!
    Reason: 43 the job manager failed to stage the executable
...
009 (096.000.000) 04/17 14:00:30 Job was aborted by the user.
	Globus error 43: the job manager failed to stage the executable
...

The working submit host is running a newer version of condor:

$CondorVersion: 8.1.1 Sep 11 2013 BuildID: 171174 $
$CondorPlatform: x86_64_RedHat6 $

The working job returns these messages from condor:

009 (096.000.000) 04/17 14:00:30 Job was aborted by the user.
	Globus error 43: the job manager failed to stage the executable
...
000 (096.000.000) 04/17 15:11:11 Job submitted from host:
<129.79.157.89:11015?sock=2742_2cf7_4>
...
017 (096.000.000) 04/17 15:11:20 Job submitted to Globus
    RM-Contact: gate04.aglt2.org/jobmanager-condor
    JM-Contact: gate04.aglt2.org/jobmanager-condor
    Can-Restart-JM: 1
...
027 (096.000.000) 04/17 15:11:20 Job submitted to grid resource
    GridResource: gt5 gate04.aglt2.org/jobmanager-condor
    GridJobId: gt5 gate04.aglt2.org/jobmanager-condor
https://gate04.aglt2.org:59832/16361969724494590991/6276480034496635811/
...
001 (096.000.000) 04/17 15:11:55 Job executing on host: gt5
gate04.aglt2.org/jobmanager-condor
...
005 (096.000.000) 04/17 15:12:10 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

The jobs are submitted from the same nfs mounted directory on both submit
hosts. The job commands are:

grid_resource=gt5 gate04.aglt2.org/jobmanager-condor
globusrsl=(jobtype=single)(queue=Tier3Test)
copy_to_spool = True
+Nonessential = True
universe=grid
notify_user=luehring@xxxxxxxxxxx
+MATCH_APF_QUEUE="ANALY_AGLT2_TIER3_TEST"
x509userproxy=$ENV(HOME)/x509_Proxy

executable=foo.sh

Dir=/s/luehring/panda_wrapper
output=$(Dir)/$(Cluster).$(Process).log
error=$(Dir)/$(Cluster).$(Process).log
log=$(Dir)/$(Cluster).log

stream_output=False
stream_error=False
notification=Error
transfer_executable = True
Should_Transfer_Files   = Yes
queue 1

where foo.sh contains this trivial payload:

#!/bin/zsh

/bin/env
/bin/ls -l
/usr/bin/voms-proxy-info -all


Any advice would be appreciated.

Thanks greatly!

Fred

-- 
Fred Luehring Indiana U. HEP mailto:luehring@xxxxxxxxxxx  +1 812 855 1025 IU
http://cern.ch/Fred.Luehring mailto:Fred.Luehring@xxxxxxx +41 22 767 1166 CERN
http://cern.ch/Fred.Luehring/Luehring_pub.asc             +1 812 391 0225 GSM