[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G fun

I have run a few jobs using condor-g now, but even though the jobs
run OK, I frequently get the following error. When the error occurs, the
jobs never
seem to recover (although I give up after about 40 mins):

020 (023.000.000) 11/09 11:08:03 Detected Down Globus Resource
    RM-Contact: <grid-resource>/jobmanager-fork
026 (023.000.000) 11/09 11:08:03 Detected Down Grid Resource
    GridResource: gt2 <grid-resource>/jobmanager-fork

The 2 most obvious reasons for this are:
a) Machine is down
b) Machine never existed (i.e. name spelled wrong)

Since I can cut and paste the machine name and successfully run Grid
to that machine, any ideas what else it can be?

I have ruled out the following (or believe I have):
1. Firewall issue (this has now been opened) since this would prevent
   globus-job-runs running, and in any case, I'd get an error about
inability to
   transfer files.
2. Don't have valid proxy - since globus commands work
3. condor daemons can't see firewall settings - since re-run with them
in the
   environment, and some jobs do run.

Are there any DEBUG settings I can do to get further info?

condor-q -anal 
doesn't help for non-matchmaking condor-g

Below is the submit file (hopefully there is a bug in there somewhere)




# Maybe I should try with "old" syntax, using globus universe
universe = grid

# Just try with for for now
grid_resource = gt2 <grid-resource>/jobmanager-fork

notification = never

# This exists in /bin on all grid resources I use
executable = /bin/hostname
transfer_executable = false

# No common storage

# Do these make sense in combination with previous 2 file transfer
stream_input = false
stream_error = false
stream_output = false

output = glob$(PROCESS).out
error = glob$(PROCESS).err
log = glob.log

# Is something like this needed or should I just omit it?
REQUIREMENTS = (OpSys == LINUX && (Arch != "Windows51"))