[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Negotiator problem? Jobs not assigned to idlemachines.



Hi,

Setting NEGOTIATOR_CONSIDER_PREEMPTION = True seems to work. However, at
first jobs would begin to run, then some of the jobs would get stuck as
"match but reject the job for unknown reasons" for about 15mins and then
start running. Now it is stuck for 2 hours. I've attach SchedLog and
NegotiatorLog below.

8/1 22:06:02       Rejected 93.0 malikr@xxxxxxxx <172.26.30.23:3179>: no
match found

Above line is strange in that previous jobs have identical submit file
except file paths.

We have v6.8 on Windows XP machines.

condor_config

PREEMPT = False
PREEMPTION_REQUIREMENTS = False
RANK = 0
NEGOTIATOR_CONSIDER_PREEMPTION = True
CLAIM_WORKLIFE = 300


SchedLog

8/1 22:06:01 (pid:176) Activity on stashed negotiator socket
8/1 22:06:01 (pid:176) Negotiating for owner: malikr@xxxxxxxx
8/1 22:06:01 (pid:176) Checking consistency running and runnable jobs
8/1 22:06:01 (pid:176) Tables are consistent
8/1 22:06:01 (pid:176) Out of servers - 0 jobs matched, 7 jobs idle, 1
jobs rejected
8/1 22:06:01 (pid:176) Activity on stashed negotiator socket
8/1 22:06:01 (pid:176) Negotiating for owner: malikr@xxxxxxxx
8/1 22:06:01 (pid:176) Checking consistency running and runnable jobs
8/1 22:06:01 (pid:176) Tables are consistent
8/1 22:06:01 (pid:176) Out of servers - 0 jobs matched, 7 jobs idle, 1
jobs rejected


NegotiatorLog

8/1 22:06:01 ---------- Started Negotiation Cycle ----------
8/1 22:06:01 Phase 1:  Obtaining ads from collector ...
8/1 22:06:01   Getting all public ads ...
8/1 22:06:01 Trying to query collector <172.26.21.99:9618>
8/1 22:06:02   Sorting 99 ads ...
8/1 22:06:02   Getting startd private ads ...
8/1 22:06:02 Trying to query collector <172.26.21.99:9618>
8/1 22:06:02 Got ads: 99 public and 50 private
8/1 22:06:02 Public ads include 2 submitter, 50 startd
8/1 22:06:02 Entering compute_signficant_attrs()
8/1 22:06:02 Leaving compute_signficant_attrs() -
result=JobUniverse,LastCheckpointPlatform,NumCkpts
8/1 22:06:02 Phase 2:  Performing accounting ...
8/1 22:06:02 Phase 3:  Sorting submitter ads by priority ...
8/1 22:06:02 Phase 4.1:  Negotiating with schedds ...
8/1 22:06:02     NumStartdAds = 50
8/1 22:06:02     NormalFactor = 4.424248
8/1 22:06:02     MaxPrioValue = 3.453689
8/1 22:06:02     NumScheddAds = 2
8/1 22:06:02   Negotiating with agarwam@xxxxxxxx skipped because no idle
jobs
8/1 22:06:02   Schedd agarwam@xxxxxxxx got all it wants; removing it.
8/1 22:06:02   Negotiating with malikr@xxxxxxxx at <172.26.30.23:3179>
8/1 22:06:02 0 seconds so far
8/1 22:06:02   Calculating schedd limit with the following parameters
8/1 22:06:02     ScheddPrio       = 3.453689
8/1 22:06:02     ScheddPrioFactor = 1.000000
8/1 22:06:02     scheddShare      = 0.226027
8/1 22:06:02     scheddAbsShare   = 0.500000
8/1 22:06:02     ScheddUsage      = 23
8/1 22:06:02     scheddLimit      = 0
8/1 22:06:02     MaxscheddLimit   = 27
8/1 22:06:02 Socket to <172.26.30.23:3179> already in cache, reusing
8/1 22:06:02     Over submitter resource limit (0) ... only consider
startd ranks
8/1 22:06:02     Sending SEND_JOB_INFO/eom
8/1 22:06:02     Getting reply from schedd ...
8/1 22:06:02     Got JOB_INFO command; getting classad/eom
8/1 22:06:02     Request 00093.00000:
8/1 22:06:02       Rejected 93.0 malikr@xxxxxxxx <172.26.30.23:3179>: no
match found
8/1 22:06:02     Sending SEND_JOB_INFO/eom
8/1 22:06:02     Getting reply from schedd ...
8/1 22:06:02     Got NO_MORE_JOBS;  done negotiating
8/1 22:06:02   This schedd hit its scheddlimit.
8/1 22:06:02 Phase 4.2:  Negotiating with schedds ...
8/1 22:06:02     NumStartdAds = 50
8/1 22:06:02     NormalFactor = 1.000000
8/1 22:06:02     MaxPrioValue = 3.453689
8/1 22:06:02     NumScheddAds = 1
8/1 22:06:02   Negotiating with malikr@xxxxxxxx at <172.26.30.23:3179>
8/1 22:06:02 0 seconds so far
8/1 22:06:02   Calculating schedd limit with the following parameters
8/1 22:06:02     ScheddPrio       = 3.453689
8/1 22:06:02     ScheddPrioFactor = 1.000000
8/1 22:06:02     scheddShare      = 1.000000
8/1 22:06:02     scheddAbsShare   = 1.000000
8/1 22:06:02     ScheddUsage      = 23
8/1 22:06:02     scheddLimit      = 27
8/1 22:06:02     MaxscheddLimit   = 27
8/1 22:06:02 Socket to <172.26.30.23:3179> already in cache, reusing
8/1 22:06:02     Sending SEND_JOB_INFO/eom
8/1 22:06:02     Getting reply from schedd ...
8/1 22:06:02     Got JOB_INFO command; getting classad/eom
8/1 22:06:02     Request 00093.00000:
8/1 22:06:02 Attempting to use cached MatchList: Failed (MatchList
length: 0, Autocluster: 0, Schedd Name: malikr@xxxxxxxx, Schedd Address:
<172.26.30.23:3179>)
8/1 22:06:02       Rejected 93.0 malikr@xxxxxxxx <172.26.30.23:3179>: no
match found
8/1 22:06:02     Sending SEND_JOB_INFO/eom
8/1 22:06:02     Getting reply from schedd ...
8/1 22:06:02     Got NO_MORE_JOBS;  done negotiating
8/1 22:06:02   Schedd malikr@xxxxxxxx got all it wants; removing it.
8/1 22:06:02 ---------- Finished Negotiation Cycle ---------- 


Thanks,
Rick

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: Monday, July 31, 2006 10:44 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Negotiator problem? Jobs not assigned to
idlemachines.

Rick,

It appears to me that there is a bug in the handling for
NEGOTIATOR_CONSIDER_PREEMPTION=False.  I recommend turning it back to
true (or just comment it out) until we work this out.

--Dan

On Jul 29, 2006, at 12:33 PM, Rick Lan wrote:

> Hi all,
>
> We have v6.8 on Windows XP machines. Preemption is disabled (3.6.10.5 
> of the manual):
>
> PREEMPT = False
> PREEMPTION_REQUIREMENTS = False
> RANK = 0
> NEGOTIATOR_CONSIDER_PREEMPTION = False
>
> and
>
> CLAIM_WORKLIFE = 300
>
>
> We have 32 machines and submitted about 40 cpusoak (from v6.6) jobs, 
> all from simulation@xxxxxxxxx Only 19 of them run and the rest are in 
> the queue sitting idle. Why do those jobs sit idle when there are no 
> other user with jobs and there are idle machines? How do we fill up 
> the pool?
>
> We have NEGOTIATOR_DEBUG = D_FULLDEBUG. Why is the NegotiatorLOG
> (below)
> saying that of the 13 remaining idle startd's, none of them is 
> assigned to run simulation's jobs, i.e. "This schedd hit its
scheddlimit."?
> condor_q -analyze says that the jobs are  "match but reject the job 
> for unknown reasons".
>
>
> Thanks,
> Rick
>
>
>
> 7/29 10:18:01 ---------- Started Negotiation Cycle ----------
> 7/29 10:18:01 Phase 1:  Obtaining ads from collector ...
> 7/29 10:18:01   Getting all public ads ...
> 7/29 10:18:01 Trying to query collector <172.25.4.150:9618>
> 7/29 10:18:01   Sorting 70 ads ...
> 7/29 10:18:01   Getting startd private ads ...
> 7/29 10:18:01 Trying to query collector <172.25.4.150:9618>
> 7/29 10:18:01 Got ads: 70 public and 32 private
> 7/29 10:18:01 Public ads include 1 submitter, 32 startd
> 7/29 10:18:01 Entering compute_signficant_attrs()
> 7/29 10:18:01 Leaving compute_signficant_attrs() - 
> result=JobUniverse,LastCheckpointPlatform,NumCkpts
> 7/29 10:18:01 Phase 2:  Performing accounting ...
> 7/29 10:18:01 Trimmed out 19 startd ads not Unclaimed
> 7/29 10:18:01 Phase 3:  Sorting submitter ads by priority ...
> 7/29 10:18:01 Phase 4.1:  Negotiating with schedds ...
> 7/29 10:18:01     NumStartdAds = 13
> 7/29 10:18:01     NormalFactor = 1.000000
> 7/29 10:18:01     MaxPrioValue = 0.990605
> 7/29 10:18:01     NumScheddAds = 1
> 7/29 10:18:01   Negotiating with simulation@xxxxxxxx at
> <172.25.4.150:4557>
> 7/29 10:18:01 0 seconds so far
> 7/29 10:18:01   Calculating schedd limit with the following parameters
> 7/29 10:18:01     ScheddPrio       = 0.990605
> 7/29 10:18:01     ScheddPrioFactor = 1.000000
> 7/29 10:18:01     scheddShare      = 1.000000
> 7/29 10:18:01     scheddAbsShare   = 1.000000
> 7/29 10:18:01     ScheddUsage      = 19
> 7/29 10:18:01     scheddLimit      = 0
> 7/29 10:18:01     MaxscheddLimit   = 0
> 7/29 10:18:01 Socket to <172.25.4.150:4557> already in cache, reusing
> 7/29 10:18:01     Reached submitter resource limit: 0 ... stopping
> 7/29 10:18:01   This schedd hit its scheddlimit.
> 7/29 10:18:01 ---------- Finished Negotiation Cycle ----------
>
>
>
>
>
> ********************** Legal Disclaimer **************************** 
> "This email may contain confidential and privileged material for the 
> sole use of the intended recipient.  Any unauthorized review, use or 
> distribution by others is strictly prohibited.  If you have received 
> the message in error, please advise the sender by reply email and 
> delete the message. Thank you."
> **********************************************************************
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR