Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G fun

Date: Fri, 9 Nov 2007 11:36:32 -0000
From: "Kewley, J \(John\)" <j.kewley@xxxxxxxx>
Subject: [Condor-users] Condor-G fun

I have run a few jobs using condor-g now, but even though the jobs
sometimes
run OK, I frequently get the following error. When the error occurs, the
jobs never
seem to recover (although I give up after about 40 mins):

---------------
020 (023.000.000) 11/09 11:08:03 Detected Down Globus Resource
    RM-Contact: <grid-resource>/jobmanager-fork
...
026 (023.000.000) 11/09 11:08:03 Detected Down Grid Resource
    GridResource: gt2 <grid-resource>/jobmanager-fork
---------------

The 2 most obvious reasons for this are:
a) Machine is down
b) Machine never existed (i.e. name spelled wrong)

Since I can cut and paste the machine name and successfully run Grid
jobs
to that machine, any ideas what else it can be?

I have ruled out the following (or believe I have):
1. Firewall issue (this has now been opened) since this would prevent
the
   globus-job-runs running, and in any case, I'd get an error about
inability to
   transfer files.
2. Don't have valid proxy - since globus commands work
3. condor daemons can't see firewall settings - since re-run with them
in the
   environment, and some jobs do run.

Are there any DEBUG settings I can do to get further info?

condor-q -anal 
doesn't help for non-matchmaking condor-g

Below is the submit file (hopefully there is a bug in there somewhere)

Cheers

JK

---------------------

# Maybe I should try with "old" syntax, using globus universe
universe = grid

# Just try with for for now
grid_resource = gt2 <grid-resource>/jobmanager-fork

notification = never

# This exists in /bin on all grid resources I use
executable = /bin/hostname
transfer_executable = false

# No common storage
SHOULD_TRANSFER_FILES = YES
WHEN_TO_TRANSFER_OUTPUT = ON_EXIT

# Do these make sense in combination with previous 2 file transfer
settings?
stream_input = false
stream_error = false
stream_output = false

output = glob$(PROCESS).out
error = glob$(PROCESS).err
log = glob.log

# Is something like this needed or should I just omit it?
REQUIREMENTS = (OpSys == LINUX && (Arch != "Windows51"))

queue

Follow-Ups:
- Re: [Condor-users] Condor-G fun
  - From: Steven Timm

References:
- Re: [Condor-users] Condor errno 10061
  - From: Finch, Ralph

Prev by Date: Re: [Condor-users] avoiding vanilla job eviction
Next by Date: Re: [Condor-users] Condor-G fun
Previous by thread: Re: [Condor-users] Condor errno 10061
Next by thread: Re: [Condor-users] Condor-G fun
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Condor-G fun