[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Tracing why nodes reject jobs?



Try setting these parameters,
SCHEDD_DEBUG and watch the log closely with the cluster.pid and check
to see if you have prememption on.






http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#param:SubsysDebug



On Fri, Jul 16, 2010 at 10:03 AM, Jonathan D. Proulx <jon@xxxxxxxxxxxxx> wrote:
> On Fri, Jul 16, 2010 at 11:10:03AM +0100, Ian Cottam wrote:
> :Is it memory? Try adding the Requirement Memory > 0
> :and also
> :Rank = Memory
>
> Nope, memopry was my first guess too, but the job size is small enough
> to fit on any of our nodes and some have 8x enough for them...
>
> -Jon
>
> :
> :-Ian
> :
> :[currently out of office]
> :
> :On 15 Jul 2010, at 16:52, "Jonathan D. Proulx" <jon@xxxxxxxxxxxxx> wrote:
> :
> :> Hi All,
> :>
> :> I have a user who queued a couple hundred identical Standard Universe
> :> jobs (well the parameters were a little different but the class ads
> :> were the same), most completed but 15 are hanging aroundin idle state
> :> after having accumulated some runtime, but will no longer match any
> :> execute nodes:
> :>
> :> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> :> ---
> :> 78745.005:  Run analysis summary.  Of 429 machines,
> :>       19 are rejected by your job's requirements
> :>      410 reject your job because of their own requirements
> :>    0 match but are serving users with a better priority
> :> in the pool
> :>        0 match but reject the job for unknown reasons
> :>        0 match but will not currently preempt their existing job
> :>    0 match but are currently offline
> :>    ) are available to run your job
> :>    Last successful match: Tue Jul  617:33:30 2010
> :>        Last failed match: Thu Jul 15 11:46:30 2010
> :>    Reason for last match failure: no match found
> :>
> :> The 19 rejected for Job requirements are clear (wrong ARCH), the 410
> :> for node rrequirements is odd in several ways:
> :>
> :> 1) there are 410 total systems available and 344 are currently claimed
> :> so I'd expect those to be either "match but are serving users with a
> :> better priority in the pool" or "match but will not currently preempt
> :> their existing job"
> :>
> :> 2) clearly they used to match some or the job wouldn't have had runtime
> :>
> :> Where/how can I see why a specific node rejects a given job?
> :>
> :> -Jon
> :> _______________________________________________
> :> Condor-users mailing list
> :> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> :> subject: Unsubscribe
> :> You can also unsubscribe by visiting
> :> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> :>
> :> The archives can be found at:
> :> https://lists.cs.wisc.edu/archive/condor-users/
> :_______________________________________________
> :Condor-users mailing list
> :To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> :subject: Unsubscribe
> :You can also unsubscribe by visiting
> :https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> :
> :The archives can be found at:
> :https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>