[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with periodic_release and globus_resubmit



Hi there.

I've been using condor to submit jobs to gt4 resources, and I'd like
condor to resubmit jobs to a different resource when they fail. To
test this, I set up three Globus resources. One is deliberately
broken, so jobs sent there will always fail, and two resources are good.

I've been using the condor-g documentation as a guide, and I've got it
working for the most part with a combination of periodic_release,
globus_resubmit, and lastmatchname, but I always seem to have one or
two jobs get stuck in the idle state. I can give the final job a nudge
by submitting another job.

My periodic_hold and globus_resubmit expressions are as follows:

	PeriodicRelease = (NumSystemHolds >= NumJobMatches) &&
(NumGlobusSubmits < 4) && (HoldReason != "via condor_hold (by user
(USER))") && ((CurrentTime - EnteredCurrentStatus) > 60)

	GlobusResubmit = (GridJobStatus =?= UNDEFINED) && (NumSystemHolds >
NumJobMatches)


Now, having submitted 20 jobs, all but two have completed
successfully. These two jobs are in the idle state, but they seem to
match the classad expression in my GlobusResubmit expression:

[root@ms-gavia-testing ~]#  condor_q  -constraint "(GridJobStatus =?=
UNDEFINED) && (NumSystemHolds > NumJobMatches)"

-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:64055> : ms-gavia-testing.phys.UVic.CA
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
2269.0   dev07           2/10 16:38   0+00:00:00 I  0   0.0  run-run.sh
2273.0   dev07           2/10 16:38   0+00:00:00 I  0   0.0  run-run.sh


Here is an example of the logs the Negotiator is printing while in
this state:

2/11 10:09:20 ---------- Started Negotiation Cycle ----------
2/11 10:09:20 Phase 1:  Obtaining ads from collector ...
2/11 10:09:20   Getting all public ads ...
2/11 10:09:20   Sorting 8 ads ...
2/11 10:09:20   Getting startd private ads ...
2/11 10:09:20 Got ads: 8 public and 1 private
2/11 10:09:20 Public ads include 1 submitter, 4 startd
2/11 10:09:20 Phase 2:  Performing accounting ...
2/11 10:09:20 Phase 3:  Sorting submitter ads by priority ...
2/11 10:09:20 Phase 4.1:  Negotiating with schedds ...
2/11 10:09:20   Negotiating with dev07@xxxxxxxxxxxx at
<142.104.63.16:52155>
2/11 10:09:20 0 seconds so far
2/11 10:09:20     Got NO_MORE_JOBS;  done negotiating
2/11 10:09:20 ---------- Finished Negotiation Cycle ----------


Why aren't these two jobs being rescheduled, and why does submitting
another job get them scheduled? I've also attached an example of a
full job description here: http://pastie.org/386200.txt

Any pointers would be very helpful.

--patrick