[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] gt4 grid universe problem




On 27 Sep 2006, at 16:37, Jaime Frey wrote:

On Sep 26, 2006, at 8:30 AM, Andrew Walker wrote:

Having recently upgraded from condor 6.6 to 6.8(.0), I'm trying to submit a grid universe gt4 job to a remote gatekeeper in front of a condor pool. Currently my job is failing with the error "Failed to create proxy delegation" (which is Code 0 Subcode 0 in the user log file). Does anybody have any idea how to debug this?

The gatekeeper is running globus 4.0.1 and I can successfully submit jobs using the pre-WS gram (using both the gt2 grid universe and the globus universe). At the moment I have pre-staged the executable and am not attempting to recover the output back to the submit machine - all I want to do is run a shell script on a condor node and return the output to the gatekeeper. I think my problem is with the condor-g submit machine, but I have access to log and configuration files at both ends.


snip...

One possibility is that gridftp is not correctly traversing the firewalls between the gatekeeper and the condor submit machine (I have two firewalls to worry about - both filter traffic in both directions). What are the network requirements for a gt4 resource? I guess the gatekeeper has to connect back to the submitting machine on TCP port 2811. However, I don't think this is the immediate problem as I'm not seeing any activity (or failing outbound network connections) from the gatekeeper.

The problem is not with the gridftp server, but with delegating your proxy to the Delegation service on the gatekeeper machine. The best way to debug this is to try Globus' WS GRAM client to submit an equivalent job. Try this:

globusrun-ws -submit -job-delegate -factory cartman.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/date

This will delegate a credential, then submit a job that uses that credential. If this fails, then you know that the problem is not related to Condor-G.

A couple other notes:

The 'globus_rsl' attribute doesn't work for WS GRAM jobs. Instead, there's a globus_xml attribute, for use with WS GRAM's XML-based RSL description.

The gridftp server Condor-G starts up for WS GRAM file transfers listens on a dynamic port, not 2811. If you have a hole in your firewall and LOWPORT/HIGHPORT set appropriately in your Condor config file, then the gridftp server shouldn't have any problems.




Jaime, 

Thanks for the info - it turned out that this was a firewall issue resolved by moving my tests to a new pair of machines. However, I have now run up against a new problem. (I'm now submitting from a 6.8.1 condor machine to a gatekeeper running globus 4.0.2 in front of a 6.8.1 condor pool; firewalls between the two machines have been set to allow any traffic in either direction free access).

I have simplified my script a bit too in order to try and work out what is going on - all I want to see is the hostname of the execute node on the remote condor pool:

Universe        = grid
grid_resource = gt4 cete.niees.group.cam.ac.uk Condor
Executable      = /bin/hostname
Notification    = NEVER
Output          = host_$(PROCESS).out
Error           = host.err
Log             = host.log
Queue 1


Again the job enters the local queue, the gridftp server starts up and then the job fails and enters the held state. This time I have a different error in the log (Globus error: Staging error for RSL element fileStageIn):


000 (192.000.000) 09/29 16:31:55 Job submitted from host: <131.111.20.163:9661>
...
017 (192.000.000) 09/29 16:32:50 Job submitted to Globus
    RM-Contact: cete.niees.group.cam.ac.uk
    JM-Contact: https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f
    Can-Restart-JM: 0
...
027 (192.000.000) 09/29 16:32:50 Job submitted to grid resource
    GridResource: gt4 cete.niees.group.cam.ac.uk Condor
    GridJobId: gt4 https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f
...
012 (192.000.000) 09/29 16:32:53 Job was held.
        Globus error: Staging error for RSL element fileStageIn.
        Code 0 Subcode 0
...


However, running the equivalent command using the globus client works (and the returned output file shows that the job ran on a condor execute node): 

globusrun-ws -streaming -stdout-file testout -submit -job-delegate -factory cete.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:3cc015da-4faa-11db-8c27-00042388e7a7
Termination time: 09/30/2006 11:04 GMT
Current job state: Pending
Current job state: Active
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.


Using condor's GT2 interface also works as expected:

Universe        = grid
grid_resource = gt2 cete.niees.group.cam.ac.uk/jobmanager-condor
Executable      = /bin/hostname
Notification    = NEVER
Output          = host_$(PROCESS).out
Error           = host.err
Log             = host.log
Queue 1



And I see exactly the same behavior replacing all the condor jobmanager commands with fork commands. Again I'm after some help finding a starting place for debugging. Does anybody have any idea where to start?


Cheers,

Andrew






Dr Andrew Walker

Department of Earth Sciences
University of Cambridge
Downing Street
Cambridge 
CB2 3EQ
UK

phone +44 (0)1223 333432