[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G and Globus-RSL



I've two question concerning jobs submitted in the Globus-Universe via condor_submit:

GT: 4.0.2
Condor: 6.7.19


1) Is there a possibility to influence the created Globus-RSL and change some of the settings created by Condor-G and not just insert additional name-value pairs via globus_xml at the end
  of the XML job description?

Why this question:
I'm doing Throughput-Testing with WS-GRAM and submit 3500 jobs via condor_submit
to a GT4-Container.

Condor Job description:
####################
Universe        = grid
Grid_Type       = gt4
Jobmanager_Type = Condor
GlobusScheduler = osg-test1.unl.edu:9443
Executable      = mysleep
Arguments       = test_output$(Process) test_input$(Process)
when_to_transfer_output = ON_EXIT
transfer_input_files = test_input
transfer_output_files = test_output$(Process)
Output          = job_sleep_io.output$(Process)
Error           = job_sleep_io.error$(Process)
Log             = job_sleep_io.log
Queue 3500



Sometimes (!) 1-5 of the 3500 jobs keep hanging in state StageInResponse and do not continue. Quite often there are errors during stageIn process. I had a look at the RSL created by condor-g
and  found that there are 4 transfers during stageIn:

Globus Job Description (stageIn-part):
###############################
...
<ns2:fileStageIn>
<ns13:maxAttempts xmlns:ns13="http://www.globus.org/namespaces/2004/10/rft";>5</ns13:maxAttempts>
<ns14:transferCredentialEndpoint...>
   ...
</ns14:transferCredentialEndpoint>
<ns18:transfer ...>
<ns18:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns18:sourceUrl>
<ns18:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch</ns18:destinationUrl>
</ns18:transfer>

<ns19:transfer ...> <ns19:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns19:sourceUrl>
<ns19:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/</ns19:destinationUrl>
</ns19:transfer>

<ns20:transfer ...>
<ns20:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/mysleep</ns20:sourceUrl>
<ns20:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/mysleep</ns20:destinationUrl>
</ns20:transfer>

<ns21:transfer ...>
<ns21:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/test_input</ns21:sourceUrl>
<ns21:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/test_input</ns21:destinationUrl>
</ns21:transfer>

</ns2:fileStageIn>
...

I assume the following occurs and is responsible for the errors:
If the container is busy (and it is busy with 3500 jobs) sometimes the first two transfers are not finished when the third one starts. In this case the directory job_8ff6c880-0cc4-11db-b248-a9807d8bba43 (created during transfer two) does not exist
and this results in a staging Exception in the GT4 Container.
The same error occurs for the fourth transfer, but much less frequently.

To reproduce jobs staying in state stageInResponse without continuing in the GT4 container or even better to help all jobs to finish reliably, I would like to try setting maxAttempts to a different
value and have a look, if this value has some impact on these jobs.


2) How is the Globus job-ID (e.g. 8ff6c880-0cc4-11db-b248-a9807d8bba43) created by Condor_G?

I found such an ID in each Globus-RSL and this ID seems to correspond to the Globus job IDs. Sometimes it occurs that two of the 3500 Condor-jobs are mapped to the same Globus-ID.
One of these two jobs does not finish successfully.

Thanks in advance for any explanation or advice!

Martin