[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down



Hi All

Some more testing of Condor-C grid resource stuff.

I can specify multiple grid resources OK, as well as limit the number
of jobs submitted to each resource.

Submit file (excerpt) on originating schedd:

universe = grid
resource_name = condor $RANDOM_CHOICE(condorsubmit1.csiro.au, condorsubmit2.csiro.au, \
                                      condorsubmit3.csiro.au, condorsubmit4.csiro.au, \
                                      condorsubmit5.csiro.au, condorsubmit6.csiro.au) \
                                      condor-centralmanager.csiro.au

Config_file (excerpt) on originating schedd:

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5

With this in place the jobs submitted from the originating schedd are nicely
spread across the (in this example) 6 grid resources, which are 6 other
remote schedds. When the job limit is also used then the jobs are nicely
fed to each remote schedd and kept at max 5 (in this example).

If I deliberately disable one of the 6 remote schedds though, the gridmanager
notices and logs that the resource is down, but how can I tell it to retry
on another grid resource?

I thought of using periodic_hold and periodic_release for a job that's been in
the Idle state for > say 30mins but this will not work as the grid_resource in the
job classads has already been generated at submit time using $RANDOM_CHOICE

I was hoping for something a bit more elegant/simple rather than having to run a
separate script running condor_q with a constraint looking for jobs idle > 30mins,
extracting the job cluster.process numbers, and looping through each using
condor_qedit to modify the GridResource job classad.

Thanks for any info/help.

Cheers

Greg