[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why does machine reject job for unknown reasons



Hi,


On 5/15/07, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
Alex,

What does your requirements _expression_ for your jobs look like? And your ImageSize and DiskUsage values?


here is an output of condor_q, because I do not set any special requirements:
> condor_q -better-analyze 1082626.0


-- Submitter: XXXX : <192.168.101.214:32776> : XXXX
---
1082626.000:  Run analysis summary.  Of 152 machines,
      2 are rejected by your job's requirements
      0 reject your job because of their own requirements
      8 match but are serving users with a better priority in the pool
    142 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        No successful match recorded.
        Last failed match: Tue May 15 16:13:43 2007
        Reason for last match failure: no match found

The Requirements _expression_ for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) ) &&
( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Disk >= 10000 )          150
2   ( target.Arch == "X86_64" )       152
3   ( target.OpSys == "LINUX" )       152
4   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) )
                                      152
5   ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) )
                                      152
6   ( ( 1024 * target.Memory ) >= 10000 )152

So the requirements look fine to me...

Cheers
  Alex

If you run a condor_status -const "<blah>" command where <blah> is replaced with your job's requirements _expression_ does it match more machines than you're getting?

 

- Ian

On 5/15/07, Alexander Dietz <Alexander.Dietz@xxxxxxxxxxxxxx > wrote:
Hi,

finally I have located the log-files and here are two lines from ithe SchedLog:

5/15 15:45:11 Job 1082422.0 rejected: no match found
5/15 15:45:11 Out of servers - 0 jobs matched, 127 jobs idle, 1 jobs rejected

So it looks like no machine is able to run the job. BUT when running condor_status I get the following output:

        Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX   152     0      10       142       0          0        0

               Total   152     0      10       142       0          0        0

There are 142 unclaimed machines in this table! I found out that the cluster is almost full with idle jobs. So when new jobs are put into the cluster the idle jobs have higher priority?
But the same issue (with jobs are not going to be executed) is still valid when every other job is being held. Shouldn't the jobs start then, almost immediately? Also the SchedLog does not update for several minutes and still I have no idea why these jobs are not starting...


Cheers
  Alex




On 5/15/07, Kewley, J (John) < j.kewley@xxxxxxxx> wrote:
Apart from the log file referred to in the submit file (which isn't the one I was meaning),
you need to look in (depending on configuration of course):
 
$CONDOR_CONFIG
 
and look for LOCAL_DIR
 
this may be redefined in LOCAL_CONFIG_FILE, again in $CONDOR_CONFIG
 
this should be something along the lines of
 
$CONDOR_CONFIG/local.${HOSTNAME}
 
and then look in the log directory
 
JK
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Alexander Dietz
Sent: Tuesday, May 15, 2007 3:29 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Why does machine reject job for unknown reasons

Hi,

thanks for the quick reply, but this problem is not related to some network issues; I am submitting the DAG directly on the condor-pool. The jobs evetually gets executed on the pool, but the time until this happens can be really large (~hours).

Also, where can I find the log files to look for whats going on?

Thanks
  Alex

On 5/15/07, Johan Bengtsson < johan.bengtsson@xxxxxxxxxxxxx> wrote:
On tis, 2007-05-15 at 14:53 +0100, Alexander Dietz wrote:
> Hi,
>
> sorry to bother you again with my question, but this problem still
> persists. I have recieved so far no idea how to find out why
> condor-jobs are rejected ...

Hi Alex,
Have you checked that both forward and backward name resolving works for
the machines in your cluster? I think that every time this problem has
occured in my pool, name resolution has been the cause.

        / Johan


> Cheers
>   Alex
>
> On 5/14/07, Alexander Dietz <Alexander.Dietz@xxxxxxxxxxxxxx> wrote:
>         Hi,
>
>         thanks for this suggestion, but the output really does not
>         help me further (see below). It looks like that 150 machine
>         are good to run the jobs on, but still they are rejected for
>         unknown reasons! I need them to start immediately because of a
>         timely limited online-demonstration for the work I am doing.
>         Any other suggestions?
>
>         Cheers
>           Alex
>
>         > condor_q -better-analyze 1082109.0
>
>         1082109.000:  Run analysis summary.  Of 152 machines,
>               2 are rejected by your job's requirements
>               0 reject your job because of their own requirements
>               0 match but are serving users with a better priority in
>         the pool
>             150 match but reject the job for unknown reasons
>               0 match but will not currently preempt their existing
>         job
>               0 are available to run your job
>
>         The Requirements _expression_ for your job is:
>
>         ( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
>         ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
>         undefined ) ) &&
>         ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys
>         is undefined ) ) &&
>         ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >=
>         ImageSize )
>
>             Condition                         Machines Matched
>         Suggestion
>             ---------                         ----------------
>         ----------
>         1   ( target.Disk >= 10000 )          150
>         2   ( target.Arch == "X86_64" )       152
>         3   ( target.OpSys == "LINUX" )       152
>         4   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch
>         is undefined ) )
>                                               152
>         5   ( ( target.CkptOpSys == target.OpSys ) ||
>         ( target.CkptOpSys is undefined ) )
>                                               152
>         6   ( ( 1024 * target.Memory ) >= 10000 )152
>
>
>
>
>
>         On 5/14/07, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
>
>
>                 On 5/14/07, Alexander Dietz
>                 <Alexander.Dietz@xxxxxxxxxxxxxx> wrote:
>                         Hi,
>
>                         I have a problems when sumbitting a DAG to
>                         condor; before any of the jobs gets executed
>                         they are rejected for unknown reasons, like
>                         the following messages suggest:
>
>                         > condor_q -analyze 1076700.0
>
>                 Alex,
>
>                 If you're running 6.8.x on Linux you can use the
>                 -better-analyze option which is infinitely more
>                 helpful than -analyze:
>
>                 condor_q -better-analyze 1076700.0
>
>                 - Ian
>
>
>
>
>                 _______________________________________________
>                 Condor-users mailing list
>                 To unsubscribe, send a message to
>                 condor-users-request@xxxxxxxxxxx with a
>                 subject: Unsubscribe
>                 You can also unsubscribe by visiting
>                 https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>                 The archives can be found at either
>                 https://lists.cs.wisc.edu/archive/condor-users/
>                 http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR