[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] my parallel universe error , 4 match but reject the job for unknown reasons



Dear All,


With a special thanks of Steve Timm for his attention,

I restart both of mpi0 and mpi1, and clean their remained jobs.
Then I submit exactly the same parallel job on mpi1, again.
In the following you can see the result of
condor_q -better -analyze
,and
condor_q -l -ana
,respectively.
It seems in the first one there is no more information, but in the second
one,

  "vm4@xxxxxxxxxxx Failed rank condition: MY.Rank > MY.CurrentRank"

is some strangely.

In addition I attached the last part of all my logfiles in mpi0 and mpi1,
for more information.
___________________________________________________________________________


mpi1:condor:~/log> condor_q -better -analyze


-- Submitter: mpi1.y.y.y : <x.x.x.55:49320> : mpi1.y.y.y
---
013.000:  Run analysis summary.  Of 19 machines,
     15 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      4 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

The Requirements expression for your job is:

( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.FileSystemDomain == "mpi1.y.y.y" )
                                      4
2   ( target.OpSys == "LINUX" )       14
3   ( target.Arch == "INTEL" )        19
4   ( target.Disk >= 10000 )          19
5   ( ( 1024 * target.Memory ) >= 10000 )19


____________________________________________________________________________
_______

mpi1:condor:~/log> q -l -ana


-- Submitter: mpi1.y.y.y : <x.x.x.55:49320> : mpi1.y.y.y
vm4@xxxxxxxxxxx Failed rank condition: MY.Rank > MY.CurrentRank
---
013.000:  Run analysis summary.  Of 19 machines,
     15 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      4 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job


Regard,
Arash


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
Sent: Monday, January 14, 2008 6:03 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] my parallel universe error , 4 match but reject
the job for unknown reasons

Hi Arash--
condor_q -better-analyze
may tell you more information,
so will
condor_q -l -ana

You say that you set up mpi0 as dedicated scheduler and
mpi0 and mpi1 as dedicated resources--what is the value of the START
macro for those two machines and what is the value of the
Requirements for your job.  Are you sure that the DedicatedScheduler
attribute is in your job classad?
\
Steve Timm

------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group
Leader.

On Mon, 14 Jan 2008, Arash noorghorbani wrote:

> Dear All,
>
> I am a new user of condor. I have two computers two quad-core computers
> with Linux ubuntu 7.10 and condor 6.8.8, and I am trying to add them to a
> condor pool to running parallel jobs. I set both of them as dedicated
> resource (which called mpi0 and mpi1) and mpi0 is,in addition, dedicated
> scheduler. (our last condor pool has no dedicated scheduler.)
>
> but parallel jobs only run on the dedicated scheduler (mpi0).
> and in mpi1 I get the error:
>
> "4 match but reject the job for unknown reasons"
>
> I think this problem may be appear because my scheduler
> is a quad-core machine. but I don't know how to fix it.
>
> In the following you can see some detail of one of my try:
>
> submitted file:
> _________________________________________________________
> universe = parallel
> executable =/bin/sleep
> arguments = 30
> machine_count = 3
> log    = logfile
> error  = err
> queue
> ________________________________________________________
>
>
> mpi1@.....$ condor_q -analyze
>
> -- Submitter: mpi1.x.x.x : <x.x.x.x:46536> : mpi1.x.x.x
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 008.000:  Run analysis summary.  Of 25 machines,
>     21 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match but are serving users with a better priority in the pool
>      4 match but reject the job for unknown reasons
>      0 match but will not currently preempt their existing job
>      0 are available to run your job
>
> 1 jobs; 1 idle, 0 running, 0 held
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

StartLog :

1/15 12:17:24 ******************************************************
1/15 12:17:25 ** condor_startd (CONDOR_STARTD) STARTING UP
1/15 12:17:25 ** /usr/local/condor/sbin/condor_startd
1/15 12:17:25 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:17:25 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:17:25 ** PID = 6610
1/15 12:17:25 ** Log last touched 1/14 11:08:19
1/15 12:17:25 ******************************************************
1/15 12:17:25 Using config source: /home/condor/condor_config
1/15 12:17:25 Using local config sources:
1/15 12:17:25    /home/condor/condor_config.local
1/15 12:17:25 DaemonCore: Command Socket at <x.x.x.8:46053>
1/15 12:17:36 vm1: New machine resource allocated
1/15 12:17:36 vm2: New machine resource allocated
1/15 12:17:36 vm3: New machine resource allocated
1/15 12:17:36 vm4: New machine resource allocated
1/15 12:17:36 About to run initial benchmarks.
1/15 12:17:43 Completed initial benchmarks.
1/15 12:17:43 vm1: State change: IS_OWNER is false
1/15 12:17:43 vm1: Changing state: Owner -> Unclaimed
1/15 12:17:43 vm2: State change: IS_OWNER is false
1/15 12:17:43 vm2: Changing state: Owner -> Unclaimed
1/15 12:17:43 vm3: State change: IS_OWNER is false
1/15 12:17:43 vm3: Changing state: Owner -> Unclaimed
1/15 12:17:43 vm4: State change: IS_OWNER is false
1/15 12:17:43 vm4: Changing state: Owner -> Unclaimed

___________________________________________________________________________________



SchedLog :

1/15 12:42:30 (pid:6614) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:42:30 (pid:6614) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:42:32 (pid:6614) Inserting new attribute Scheduler into non-active cluster cid=12 acid=-1
1/15 12:42:32 (pid:6614) Trying to satisfy job with group scheduling
1/15 12:42:32 (pid:6614) Job requested parallel scheduling groups, but no groups found

___________________________________________________________________________________



MasterLog:

1/15 09:46:40 ******************************************************
1/15 09:46:40 ** condor_master (CONDOR_MASTER) STARTING UP
1/15 09:46:40 ** /usr/local/condor/sbin/condor_master
1/15 09:46:40 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 09:46:40 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 09:46:40 ** PID = 5382
1/15 09:46:40 ** Log last touched 1/14 11:07:59
1/15 09:46:40 ******************************************************
1/15 09:46:40 Using config source: /home/condor/condor_config
1/15 09:46:40 Using local config sources:
1/15 09:46:40    /home/condor/condor_config.local
1/15 09:46:40 Failed to bind to command ReliSock
1/15 09:46:40 (Make sure your IP address is correct in /etc/hosts.)
1/15 09:46:40 ERROR "BindAnyCommandPort failed" at line 7057 in file daemon_core.C
1/15 12:17:23 ******************************************************
1/15 12:17:23 ** condor_master (CONDOR_MASTER) STARTING UP
1/15 12:17:23 ** /usr/local/condor/sbin/condor_master
1/15 12:17:23 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:17:23 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:17:23 ** PID = 6608
1/15 12:17:23 ** Log last touched 1/15 09:46:40
1/15 12:17:23 ******************************************************
1/15 12:17:23 Using config source: /home/condor/condor_config
1/15 12:17:23 Using local config sources:
1/15 12:17:23    /home/condor/condor_config.local
1/15 12:17:23 DaemonCore: Command Socket at <x.x.x.8:60795>
1/15 12:17:24 Started DaemonCore process "/usr/local/condor/sbin/condor_startd", pid and pgroup = 6610
1/15 12:17:24 Started DaemonCore process "/usr/local/condor/sbin/condor_schedd", pid and pgroup = 6614


StartLog:

1/15 12:29:48 ******************************************************
1/15 12:29:48 ** condor_startd (CONDOR_STARTD) STARTING UP
1/15 12:29:48 ** /usr/local/condor/sbin/condor_startd
1/15 12:29:48 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:29:48 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:29:48 ** PID = 6441
1/15 12:29:48 ** Log last touched 1/14 11:08:19
1/15 12:29:48 ******************************************************
1/15 12:29:48 Using config source: /home/condor/condor_config
1/15 12:29:48 Using local config sources:
1/15 12:29:48    /home/condor/condor_config.local
1/15 12:29:48 DaemonCore: Command Socket at <x.x.x.55:42095>
1/15 12:29:53 vm1: New machine resource allocated
1/15 12:29:53 vm2: New machine resource allocated
1/15 12:29:53 vm3: New machine resource allocated
1/15 12:29:53 vm4: New machine resource allocated
1/15 12:29:53 About to run initial benchmarks.
1/15 12:29:57 Completed initial benchmarks.
1/15 12:29:57 vm1: State change: IS_OWNER is false
1/15 12:29:57 vm1: Changing state: Owner -> Unclaimed
1/15 12:29:57 vm2: State change: IS_OWNER is false
1/15 12:29:57 vm2: Changing state: Owner -> Unclaimed
1/15 12:29:57 vm3: State change: IS_OWNER is false
1/15 12:29:57 vm3: Changing state: Owner -> Unclaimed
1/15 12:29:57 vm4: State change: IS_OWNER is false
1/15 12:29:57 vm4: Changing state: Owner -> Unclaimed
_____________________________________________________________________



SchedLog:

1/15 12:29:48 (pid:6445) ******************************************************
1/15 12:29:48 (pid:6445) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
1/15 12:29:48 (pid:6445) ** /usr/local/condor/sbin/condor_schedd
1/15 12:29:48 (pid:6445) ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:29:48 (pid:6445) ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:29:48 (pid:6445) ** PID = 6445
1/15 12:29:48 (pid:6445) ** Log last touched 1/14 11:08:44
1/15 12:29:48 (pid:6445) ******************************************************
1/15 12:29:48 (pid:6445) Using config source: /home/condor/condor_config
1/15 12:29:48 (pid:6445) Using local config sources:
1/15 12:29:48 (pid:6445)    /home/condor/condor_config.local
1/15 12:29:48 (pid:6445) DaemonCore: Command Socket at <x.x.x.55:49320>
1/15 12:29:48 (pid:6445) History file rotation is enabled.
1/15 12:29:48 (pid:6445)   Maximum history file size is: 20971520 bytes
1/15 12:29:48 (pid:6445)   Number of rotated history files is: 2
1/15 12:29:54 (pid:6445) Cleaning job queue...
1/15 12:29:54 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:29:54 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:29:56 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=12 acid=-1
1/15 12:33:07 (pid:6445) DaemonCore: Command received via TCP from host <x.x.x.55:43251>
1/15 12:33:07 (pid:6445) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/15 12:33:45 (pid:6445) DaemonCore: Command received via UDP from host <x.x.x.55:32777>
1/15 12:33:45 (pid:6445) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
1/15 12:33:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:33:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:33:45 (pid:6445) Called reschedule_negotiator()
1/15 12:33:48 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 12:38:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:38:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:38:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 12:43:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:43:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:43:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 12:48:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:48:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:48:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 12:53:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:53:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:53:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 12:58:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 12:58:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 12:58:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 13:03:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 13:03:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 13:03:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1
1/15 13:08:45 (pid:6445) Sent ad to central manager for condor@xxxxxxxxxx
1/15 13:08:45 (pid:6445) Sent ad to 1 collectors for condor@xxxxxxxxxx
1/15 13:08:47 (pid:6445) Inserting new attribute Scheduler into non-active cluster cid=13 acid=-1


______________________________________________________________________


MasterLog:

1/15 09:46:55 ******************************************************
1/15 09:46:55 ** condor_master (CONDOR_MASTER) STARTING UP
1/15 09:46:55 ** /usr/local/condor/sbin/condor_master
1/15 09:46:55 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 09:46:55 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 09:46:55 ** PID = 5475
1/15 09:46:55 ** Log last touched 1/14 11:08:14
1/15 09:46:55 ******************************************************
1/15 09:46:55 Using config source: /home/condor/condor_config
1/15 09:46:55 Using local config sources:
1/15 09:46:55    /home/condor/condor_config.local
1/15 09:46:55 Failed to bind to command ReliSock
1/15 09:46:55 (Make sure your IP address is correct in /etc/hosts.)
1/15 09:46:55 ERROR "BindAnyCommandPort failed" at line 7057 in file daemon_core.C
1/15 12:29:27 ******************************************************
1/15 12:29:27 ** condor_master (CONDOR_MASTER) STARTING UP
1/15 12:29:27 ** /usr/local/condor/sbin/condor_master
1/15 12:29:27 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:29:27 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:29:27 ** PID = 5458
1/15 12:29:27 ** Log last touched 1/15 09:46:54
1/15 12:29:27 ******************************************************
1/15 12:29:27 Using config source: /home/condor/condor_config
1/15 12:29:27 Using local config sources:
1/15 12:29:27    /home/condor/condor_config.local
1/15 12:29:27 Failed to bind to command ReliSock
1/15 12:29:27 (Make sure your IP address is correct in /etc/hosts.)
1/15 12:29:27 ERROR "BindAnyCommandPort failed" at line 7057 in file daemon_core.C
1/15 12:29:46 ******************************************************
1/15 12:29:46 ** condor_master (CONDOR_MASTER) STARTING UP
1/15 12:29:46 ** /usr/local/condor/sbin/condor_master
1/15 12:29:46 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/15 12:29:46 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/15 12:29:46 ** PID = 6439
1/15 12:29:46 ** Log last touched 1/15 12:29:26
1/15 12:29:46 ******************************************************
1/15 12:29:46 Using config source: /home/condor/condor_config
1/15 12:29:46 Using local config sources:
1/15 12:29:46    /home/condor/condor_config.local
1/15 12:29:46 DaemonCore: Command Socket at <x.x.x.55:50086>
1/15 12:29:46 Started DaemonCore process "/usr/local/condor/sbin/condor_startd", pid and pgroup = 6441
1/15 12:29:47 Started DaemonCore process "/usr/local/condor/sbin/condor_schedd", pid and pgroup = 6445