[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job evicted



negotiator.log

-----------
3/31 16:21:08       Rejected 3511.21 micro01@clap64 <10.143.64.1:50415>: insufficient priority
3/31 16:21:08     Got NO_MORE_JOBS;  done negotiating
3/31 16:21:08 ---------- Finished Negotiation Cycle ----------
3/31 16:22:05 ---------- Started Negotiation Cycle ----------
3/31 16:22:05 Phase 1:  Obtaining ads from collector ...
3/31 16:22:05   Getting all public ads ...
3/31 16:22:05   Sorting 132 ads ...
3/31 16:22:05   Getting startd private ads ...
3/31 16:22:05 Got ads: 132 public and 89 private
3/31 16:22:05 Public ads include 3 submitter, 89 startd
3/31 16:22:05 Phase 2:  Performing accounting ...
3/31 16:22:05 Phase 3:  Sorting submitter ads by priority ...
3/31 16:22:05 Phase 4.1:  Negotiating with schedds ...
3/31 16:22:05   Negotiating with micro01@clap64 at <10.143.64.1:50415>
3/31 16:22:05 0 seconds so far
3/31 16:22:05     Request 03512.00000:
3/31 16:22:05       Matched 3512.0 micro01@clap64 <10.143.64.1:50415> preempting none <10.143.65.1:55186> slot1@xxxxxxxxxxxxxx
3/31 16:22:05       Successfully matched with slot1@xxxxxxxxxxxxxx
3/31 16:22:05     Got NO_MORE_JOBS;  done negotiating
3/31 16:22:05   Negotiating with lyrasce01@clap64 at <10.143.64.1:50415>
3/31 16:22:05 0 seconds so far
3/31 16:22:05     Request 03388.00000:
3/31 16:22:05       Matched 3388.0 lyrasce01@clap64 <10.143.64.1:50415> preempting none <10.143.65.3:57033> slot1@xxxxxxxxxxxxx
3/31 16:22:05       Successfully matched with slot1@xxxxxxxxxxxxx
3/31 16:22:05     Request 03389.00000:
3/31 16:22:05       Rejected 3389.0 lyrasce01@clap64 <10.143.64.1:50415>: no match found
3/31 16:22:05     Got NO_MORE_JOBS;  done negotiating
3/31 16:22:05 ---------- Finished Negotiation Cycle ----------
3/31 16:22:34 ---------- Started Negotiation Cycle ----------
3/31 16:22:34 Phase 1:  Obtaining ads from collector ...

---------------

startd log


-------------
3/31 16:21:28 slot4: State change: IS_OWNER is false
3/31 16:21:28 slot4: Changing state: Owner -> Unclaimed
3/31 16:22:05 slot1: match_info called
3/31 16:22:05 slot1: Received match <10.143.65.3:57033>#1267466586#520#...
3/31 16:22:06 slot1: State change: match notification protocol successful
3/31 16:22:06 slot1: Changing state: Unclaimed -> Matched
3/31 16:22:06 slot1: Request accepted.
3/31 16:22:06 slot1: Remote owner is lyrasce01@clap64
3/31 16:22:06 slot1: State change: claiming protocol successful
3/31 16:22:06 slot1: Changing state: Matched -> Claimed
3/31 16:22:06 slot1: Got activate_claim request from shadow (<10.143.64.1:36432>)
3/31 16:22:06 slot1: Remote job ID is 3388.0
3/31 16:22:06 slot1: Got universe "VANILLA" (5) from request classad
3/31 16:22:06 slot1: State change: claim-activation protocol successful
3/31 16:22:06 slot1: Changing activity: Idle -> Busy
3/31 16:26:20 slot2: match_info called
3/31 16:26:20 slot2: Received match <10.143.65.3:57033>#1267466586#522#...
3/31 16:26:20 slot2: State change: match notification protocol successful
3/31 16:26:20 slot2: Changing state: Unclaimed -> Matched
3/31 16:26:20 slot3: match_info called
3/31 16:26:20 slot3: Received match <10.143.65.3:57033>#1267466586#519#...
--------------



On Mar 24, 2010, at 12:33 PM, Matt Hope wrote:

> do you have a Machine based RANK? This will trigger pre-emption regardless (the only recourse is to use retirement to avoid this kicking the job).
> 
> You should look in the negotiator logs for the reason, it should let you know why. If not the startd logs on the executing machine will help.
> 
> Matt
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Nicola Frignani
> Sent: 24 March 2010 08:21
> To: condor-users@xxxxxxxxxxx
> Cc: Antonella Cirigliano
> Subject: [Condor-users] job evicted
> 
> I configure my cluster as follows:
> 
> WANT_SUSPEND 		= FALSE
> SUSPEND			= FALSE
> PREEMPT			= FALSE
> KILL			= FALSE
> 
> so I assume that when a job starts on a machine the job will never be evicted and runs till its normal ends.
> Most of jobs run without problem, but in a specific case condor produce following log file:
> 
> 
> 006 (3300.000.000) 03/22 11:57:48 Image size of job updated: 22067772
> ...
> 004 (3300.000.000) 03/23 14:41:18 Job was evicted.
> 	(0) Job was not checkpointed.
> 		Usr 11 12:02:45, Sys 1 05:47:21  -  Run Remote Usage
> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
> 	0  -  Run Bytes Sent By Job
> 	0  -  Run Bytes Received By Job
> ...
> 001 (3300.000.000) 03/23 14:51:15 Job executing on host:
> <10.143.65.2:39003>
> 
> 
> What can I look for to find the problem??
> 
> Thank you in advance
> 
> Nicola
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
> 
> ----
> Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
> The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
> All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
> Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
> Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
> ----
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 

--
Nicola Frignani
nfrignani@xxxxxxxxxxxxxx
tel. +39 051 2095432
fax. +39 051 2095410
mob. +39 335 6204576


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.