[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G fun



John, this error sometimes happens when your condor-g client
does not have an up-to-date list of Certificate Authorities
and Certificate Revocation Lists.  If these are missing it
can't make the GSS handshake to the remote grid resource
and it will appear to condor-g that the resource is down.
If you give a couple more details about how your machine is set
up, maybe we can give you some ideas on how to get a correct
set of CA and CRL's.  In particular, are you using some VDT-based
version of Condor-G and Globus, or did you roll your own?

Steve Timm


On Fri, 9 Nov 2007, Kewley, J (John) wrote:

I have run a few jobs using condor-g now, but even though the jobs
sometimes
run OK, I frequently get the following error. When the error occurs, the
jobs never
seem to recover (although I give up after about 40 mins):

---------------
020 (023.000.000) 11/09 11:08:03 Detected Down Globus Resource
   RM-Contact: <grid-resource>/jobmanager-fork
...
026 (023.000.000) 11/09 11:08:03 Detected Down Grid Resource
   GridResource: gt2 <grid-resource>/jobmanager-fork
---------------

The 2 most obvious reasons for this are:
a) Machine is down
b) Machine never existed (i.e. name spelled wrong)

Since I can cut and paste the machine name and successfully run Grid
jobs
to that machine, any ideas what else it can be?

I have ruled out the following (or believe I have):
1. Firewall issue (this has now been opened) since this would prevent
the
  globus-job-runs running, and in any case, I'd get an error about
inability to
  transfer files.
2. Don't have valid proxy - since globus commands work
3. condor daemons can't see firewall settings - since re-run with them
in the
  environment, and some jobs do run.

Are there any DEBUG settings I can do to get further info?

condor-q -anal
doesn't help for non-matchmaking condor-g

Below is the submit file (hopefully there is a bug in there somewhere)

Cheers

JK

---------------------

# Maybe I should try with "old" syntax, using globus universe
universe = grid

# Just try with for for now
grid_resource = gt2 <grid-resource>/jobmanager-fork

notification = never

# This exists in /bin on all grid resources I use
executable = /bin/hostname
transfer_executable = false

# No common storage
SHOULD_TRANSFER_FILES = YES
WHEN_TO_TRANSFER_OUTPUT = ON_EXIT

# Do these make sense in combination with previous 2 file transfer
settings?
stream_input = false
stream_error = false
stream_output = false

output = glob$(PROCESS).out
error = glob$(PROCESS).err
log = glob.log

# Is something like this needed or should I just omit it?
REQUIREMENTS = (OpSys == LINUX && (Arch != "Windows51"))

queue

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.