[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with periodic_release and globus_resubmit




Patrick--we are doing almost the same thing here at FermiGrid.
On the jobs that are idle:

1) what does the UserLog say
2) What does condor_q -ana say
3) what does condor_q -globus say

In our experience, for all Grid universe jobs,
GridJobStatus is never undefined. If it is undefined, something is very wrong. I just did a query on my queue with more than 1000 jobs and it
is defined for all.  I don't see how the GlobusResubmit statement
you have below could ever be true.

Also you mentioned a periodic_hold statement below but you
did not actually show us what it was.

Steve Timm



On Thu, 12 Feb 2009, Patrick Armstrong wrote:

Hi there.

I've been using condor to submit jobs to gt4 resources, and I'd like
condor to resubmit jobs to a different resource when they fail. To
test this, I set up three Globus resources. One is deliberately
broken, so jobs sent there will always fail, and two resources are good.

I've been using the condor-g documentation as a guide, and I've got it
working for the most part with a combination of periodic_release,
globus_resubmit, and lastmatchname, but I always seem to have one or
two jobs get stuck in the idle state. I can give the final job a nudge
by submitting another job.

My periodic_hold and globus_resubmit expressions are as follows:

	PeriodicRelease = (NumSystemHolds >= NumJobMatches) &&
(NumGlobusSubmits < 4) && (HoldReason != "via condor_hold (by user
(USER))") && ((CurrentTime - EnteredCurrentStatus) > 60)

	GlobusResubmit = (GridJobStatus =?= UNDEFINED) && (NumSystemHolds >
NumJobMatches)


Now, having submitted 20 jobs, all but two have completed
successfully. These two jobs are in the idle state, but they seem to
match the classad expression in my GlobusResubmit expression:

[root@ms-gavia-testing ~]#  condor_q  -constraint "(GridJobStatus =?=
UNDEFINED) && (NumSystemHolds > NumJobMatches)"

-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:64055> :
ms-gavia-testing.phys.UVic.CA
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
2269.0   dev07           2/10 16:38   0+00:00:00 I  0   0.0  run-run.sh
2273.0   dev07           2/10 16:38   0+00:00:00 I  0   0.0  run-run.sh


Here is an example of the logs the Negotiator is printing while in
this state:

2/11 10:09:20 ---------- Started Negotiation Cycle ----------
2/11 10:09:20 Phase 1:  Obtaining ads from collector ...
2/11 10:09:20   Getting all public ads ...
2/11 10:09:20   Sorting 8 ads ...
2/11 10:09:20   Getting startd private ads ...
2/11 10:09:20 Got ads: 8 public and 1 private
2/11 10:09:20 Public ads include 1 submitter, 4 startd
2/11 10:09:20 Phase 2:  Performing accounting ...
2/11 10:09:20 Phase 3:  Sorting submitter ads by priority ...
2/11 10:09:20 Phase 4.1:  Negotiating with schedds ...
2/11 10:09:20   Negotiating with dev07@xxxxxxxxxxxx at
<142.104.63.16:52155>
2/11 10:09:20 0 seconds so far
2/11 10:09:20     Got NO_MORE_JOBS;  done negotiating
2/11 10:09:20 ---------- Finished Negotiation Cycle ----------


Why aren't these two jobs being rescheduled, and why does submitting
another job get them scheduled? I've also attached an example of a
full job description here: http://pastie.org/386200.txt

Any pointers would be very helpful.

--patrick
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.