Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problem with periodic_release and globus_resubmit
- Date: Thu, 12 Feb 2009 12:23:59 -0600 (CST)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: Re: [Condor-users] Problem with periodic_release and globus_resubmit
Patrick--we are doing almost the same thing here at FermiGrid.
On the jobs that are idle:
1) what does the UserLog say
2) What does condor_q -ana say
3) what does condor_q -globus say
In our experience, for all Grid universe jobs,
GridJobStatus is never undefined. If it is undefined, something is very
wrong. I just did a query on my queue with more than 1000 jobs and it
is defined for all. I don't see how the GlobusResubmit statement
you have below could ever be true.
Also you mentioned a periodic_hold statement below but you
did not actually show us what it was.
Steve Timm
On Thu, 12 Feb 2009, Patrick Armstrong wrote:
Hi there.
I've been using condor to submit jobs to gt4 resources, and I'd like
condor to resubmit jobs to a different resource when they fail. To
test this, I set up three Globus resources. One is deliberately
broken, so jobs sent there will always fail, and two resources are good.
I've been using the condor-g documentation as a guide, and I've got it
working for the most part with a combination of periodic_release,
globus_resubmit, and lastmatchname, but I always seem to have one or
two jobs get stuck in the idle state. I can give the final job a nudge
by submitting another job.
My periodic_hold and globus_resubmit expressions are as follows:
PeriodicRelease = (NumSystemHolds >= NumJobMatches) &&
(NumGlobusSubmits < 4) && (HoldReason != "via condor_hold (by user
(USER))") && ((CurrentTime - EnteredCurrentStatus) > 60)
GlobusResubmit = (GridJobStatus =?= UNDEFINED) && (NumSystemHolds >
NumJobMatches)
Now, having submitted 20 jobs, all but two have completed
successfully. These two jobs are in the idle state, but they seem to
match the classad expression in my GlobusResubmit expression:
[root@ms-gavia-testing ~]# condor_q -constraint "(GridJobStatus =?=
UNDEFINED) && (NumSystemHolds > NumJobMatches)"
-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:64055> :
ms-gavia-testing.phys.UVic.CA
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2269.0 dev07 2/10 16:38 0+00:00:00 I 0 0.0 run-run.sh
2273.0 dev07 2/10 16:38 0+00:00:00 I 0 0.0 run-run.sh
Here is an example of the logs the Negotiator is printing while in
this state:
2/11 10:09:20 ---------- Started Negotiation Cycle ----------
2/11 10:09:20 Phase 1: Obtaining ads from collector ...
2/11 10:09:20 Getting all public ads ...
2/11 10:09:20 Sorting 8 ads ...
2/11 10:09:20 Getting startd private ads ...
2/11 10:09:20 Got ads: 8 public and 1 private
2/11 10:09:20 Public ads include 1 submitter, 4 startd
2/11 10:09:20 Phase 2: Performing accounting ...
2/11 10:09:20 Phase 3: Sorting submitter ads by priority ...
2/11 10:09:20 Phase 4.1: Negotiating with schedds ...
2/11 10:09:20 Negotiating with dev07@xxxxxxxxxxxx at
<142.104.63.16:52155>
2/11 10:09:20 0 seconds so far
2/11 10:09:20 Got NO_MORE_JOBS; done negotiating
2/11 10:09:20 ---------- Finished Negotiation Cycle ----------
Why aren't these two jobs being rescheduled, and why does submitting
another job get them scheduled? I've also attached an example of a
full job description here: http://pastie.org/386200.txt
Any pointers would be very helpful.
--patrick
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.