Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down

Date: Tue, 19 Aug 2014 01:32:16 +0000
From: <Greg.Hitchen@xxxxxxxx>
Subject: [HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down

Hi All

Some more testing of Condor-C grid resource stuff.

I can specify multiple grid resources OK, as well as limit the number
of jobs submitted to each resource.

Submit file (excerpt) on originating schedd:

universe = grid
resource_name = condor $RANDOM_CHOICE(condorsubmit1.csiro.au, condorsubmit2.csiro.au, \
                                      condorsubmit3.csiro.au, condorsubmit4.csiro.au, \
                                      condorsubmit5.csiro.au, condorsubmit6.csiro.au) \
                                      condor-centralmanager.csiro.au

Config_file (excerpt) on originating schedd:

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5

With this in place the jobs submitted from the originating schedd are nicely
spread across the (in this example) 6 grid resources, which are 6 other
remote schedds. When the job limit is also used then the jobs are nicely
fed to each remote schedd and kept at max 5 (in this example).

If I deliberately disable one of the 6 remote schedds though, the gridmanager
notices and logs that the resource is down, but how can I tell it to retry
on another grid resource?

I thought of using periodic_hold and periodic_release for a job that's been in
the Idle state for > say 30mins but this will not work as the grid_resource in the
job classads has already been generated at submit time using $RANDOM_CHOICE

I was hoping for something a bit more elegant/simple rather than having to run a
separate script running condor_q with a constraint looking for jobs idle > 30mins,
extracting the job cluster.process numbers, and looping through each using
condor_qedit to modify the GridResource job classad.

Thanks for any info/help.

Cheers

Greg

References:
- [HTCondor-users] condor_ssh_to_job
  - From: Keith Brown
- Re: [HTCondor-users] condor_ssh_to_job
  - From: Rich Pieri
- Re: [HTCondor-users] condor_ssh_to_job
  - From: Keith Brown
- Re: [HTCondor-users] condor_ssh_to_job
  - From: Todd Tannenbaum
- [HTCondor-users] Grid Universe with Condor as Grid Resource - Bug with run times?
  - From: Greg.Hitchen

Prev by Date: Re: [HTCondor-users] Java Universe: list jar files
Next by Date: [HTCondor-users] defrag killing schedd?
Previous by thread: [HTCondor-users] Grid Universe with Condor as Grid Resource - Bug with run times?
Next by thread: Re: [HTCondor-users] condor_ssh_to_job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down