[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] gt4 grid universe problem



On Sep 26, 2006, at 8:30 AM, Andrew Walker wrote:

Having recently upgraded from condor 6.6 to 6.8(.0), I'm trying to submit a grid universe gt4 job to a remote gatekeeper in front of a condor pool. Currently my job is failing with the error "Failed to create proxy delegation" (which is Code 0 Subcode 0 in the user log file). Does anybody have any idea how to debug this?

The gatekeeper is running globus 4.0.1 and I can successfully submit jobs using the pre-WS gram (using both the gt2 grid universe and the globus universe). At the moment I have pre-staged the executable and am not attempting to recover the output back to the submit machine - all I want to do is run a shell script on a condor node and return the output to the gatekeeper. I think my problem is with the condor-g submit machine, but I have access to log and configuration files at both ends.

Using the following submit file:


Universe        = grid
grid_resource = gt4 cartman.niees.group.cam.ac.uk Condor
Executable      = /home/andreww/globus_tests/test_9_mins.sh
Notification    = NEVER
GlobusRSL       = (condorsubmit=(initialdir /home/andreww/globus_tests)(transfer_files always))
Transfer_Executable = false
Transfer_Output = false

Stream_Output   = false
Stream_Error    = false

Output          = /home/andreww/globus_tests/task_$(PROCESS).out
Error           = job.err
Log             = job.log

Queue 1


Once I submit the job I see it sit idle for a couple of minutes and a gridftp server starts locally (also visible in condor's queue). After three minutes or so the main job fails and goes into a held state with the following in job.log:


000 (292.000.000) 09/26 13:47:22 Job submitted from host: <193.62.125.72:45828>
...
012 (292.000.000) 09/26 13:50:41 Job was held.
        Failed to create proxy delegation
        Code 0 Subcode 0
...  


One possibility is that gridftp is not correctly traversing the firewalls between the gatekeeper and the condor submit machine (I have two firewalls to worry about - both filter traffic in both directions). What are the network requirements for a gt4 resource? I guess the gatekeeper has to connect back to the submitting machine on TCP port 2811. However, I don't think this is the immediate problem as I'm not seeing any activity (or failing outbound network connections) from the gatekeeper.

The problem is not with the gridftp server, but with delegating your proxy to the Delegation service on the gatekeeper machine. The best way to debug this is to try Globus' WS GRAM client to submit an equivalent job. Try this:

globusrun-ws -submit -job-delegate -factory cartman.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/date

This will delegate a credential, then submit a job that uses that credential. If this fails, then you know that the problem is not related to Condor-G.

A couple other notes:

The 'globus_rsl' attribute doesn't work for WS GRAM jobs. Instead, there's a globus_xml attribute, for use with WS GRAM's XML-based RSL description.

The gridftp server Condor-G starts up for WS GRAM file transfers listens on a dynamic port, not 2811. If you have a hole in your firewall and LOWPORT/HIGHPORT set appropriately in your Condor config file, then the gridftp server shouldn't have any problems.

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+