[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Executable Fails to Transfer



Hi Brian,

   Thanks for the suggestion but I forgot to say that it fails with the
firewall turned completely off. My system manager wanted to stay with that
older version of condor because it is what is included in the ROCKS
distribution that he is using to install SL6.

Fred

On 4/20/14, 10:36 PM, Brian Bockelman wrote:
> Hi Fred,
> 
> Judging from the failure message:
> 
>> 018 (096.000.000) 04/17 14:00:30 Globus job submission failed!
>>    Reason: 43 the job manager failed to stage the executable
> 
> it appears to have nothing to do with the age of HTCondor (although, ahem, time to upgrade ;).  Instead, it appears to be a firewall issue on that submit host.
> 
> Are you aware of the various considerations for combining a firewall and a HTCondor-G submit host?  Can you do a quick comparison between the working and non-working ones?
> 
> Thanks,
> 
> Brian
> 
> On Apr 17, 2014, at 2:27 PM, Frederick Luehring <luehring@xxxxxxxxxxx> wrote:
> 
>> Hi Everyone,
>>
>>   I have a condor submit host which succeeds in running a simple test job and
>> a condor submit host the test job fails. The failing submit host is running
>> this version of condor because that's what comes with ROCKS:
>>
>> $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
>> $CondorPlatform: x86_64_rhap_6.3 $
>>
>> The job returns this set of messages from condor:
>>
>> 000 (096.000.000) 04/17 13:55:20 Job submitted from host:
>> <129.79.157.90:11015?sock=2507_a415_3>
>> ...
>> 018 (096.000.000) 04/17 14:00:30 Globus job submission failed!
>>    Reason: 43 the job manager failed to stage the executable
>> ...
>> 009 (096.000.000) 04/17 14:00:30 Job was aborted by the user.
>> 	Globus error 43: the job manager failed to stage the executable
>> ...
>>
>> The working submit host is running a newer version of condor:
>>
>> $CondorVersion: 8.1.1 Sep 11 2013 BuildID: 171174 $
>> $CondorPlatform: x86_64_RedHat6 $
>>
>> The working job returns these messages from condor:
>>
>> 009 (096.000.000) 04/17 14:00:30 Job was aborted by the user.
>> 	Globus error 43: the job manager failed to stage the executable
>> ...
>> 000 (096.000.000) 04/17 15:11:11 Job submitted from host:
>> <129.79.157.89:11015?sock=2742_2cf7_4>
>> ...
>> 017 (096.000.000) 04/17 15:11:20 Job submitted to Globus
>>    RM-Contact: gate04.aglt2.org/jobmanager-condor
>>    JM-Contact: gate04.aglt2.org/jobmanager-condor
>>    Can-Restart-JM: 1
>> ...
>> 027 (096.000.000) 04/17 15:11:20 Job submitted to grid resource
>>    GridResource: gt5 gate04.aglt2.org/jobmanager-condor
>>    GridJobId: gt5 gate04.aglt2.org/jobmanager-condor
>> https://gate04.aglt2.org:59832/16361969724494590991/6276480034496635811/
>> ...
>> 001 (096.000.000) 04/17 15:11:55 Job executing on host: gt5
>> gate04.aglt2.org/jobmanager-condor
>> ...
>> 005 (096.000.000) 04/17 15:12:10 Job terminated.
>> 	(1) Normal termination (return value 0)
>> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>> 	0  -  Run Bytes Sent By Job
>> 	0  -  Run Bytes Received By Job
>> 	0  -  Total Bytes Sent By Job
>> 	0  -  Total Bytes Received By Job
>> ...
>>
>> The jobs are submitted from the same nfs mounted directory on both submit
>> hosts. The job commands are:
>>
>> grid_resource=gt5 gate04.aglt2.org/jobmanager-condor
>> globusrsl=(jobtype=single)(queue=Tier3Test)
>> copy_to_spool = True
>> +Nonessential = True
>> universe=grid
>> notify_user=luehring@xxxxxxxxxxx
>> +MATCH_APF_QUEUE="ANALY_AGLT2_TIER3_TEST"
>> x509userproxy=$ENV(HOME)/x509_Proxy
>>
>> executable=foo.sh
>>
>> Dir=/s/luehring/panda_wrapper
>> output=$(Dir)/$(Cluster).$(Process).log
>> error=$(Dir)/$(Cluster).$(Process).log
>> log=$(Dir)/$(Cluster).log
>>
>> stream_output=False
>> stream_error=False
>> notification=Error
>> transfer_executable = True
>> Should_Transfer_Files   = Yes
>> queue 1
>>
>> where foo.sh contains this trivial payload:
>>
>> #!/bin/zsh
>>
>> /bin/env
>> /bin/ls -l
>> /usr/bin/voms-proxy-info -all
>>
>>
>> Any advice would be appreciated.
>>
>> Thanks greatly!
>>
>> Fred
>>
>> -- 
>> Fred Luehring Indiana U. HEP mailto:luehring@xxxxxxxxxxx  +1 812 855 1025 IU
>> http://cern.ch/Fred.Luehring mailto:Fred.Luehring@xxxxxxx +41 22 767 1166 CERN
>> http://cern.ch/Fred.Luehring/Luehring_pub.asc             +1 812 391 0225 GSM
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Fred Luehring Indiana U. HEP mailto:luehring@xxxxxxxxxxx  +1 812 855 1025 IU
http://cern.ch/Fred.Luehring mailto:Fred.Luehring@xxxxxxx +41 22 767 1166 CERN
http://cern.ch/Fred.Luehring/Luehring_pub.asc             +1 812 391 0225 GSM