[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] scheduling problem?



Can anyone help....  Please?

Condor was running. I can't remember why I tried to restart it but I did and now I'm in trouble.

condor status says all the nodes are claimed and idle.
condor_userprio -all says user ann has all of the 16 resources.

'condor_q -analyze 84901.59 -l'   gives

"""
-- Submitter: galaxy : <192.168.0.40:52329> : galaxy
vm1@leo6        Insufficient priority to preempt ann@localdomain localhost
---
84901.059:  Run analysis summary.  Of 16 machines,
     0 are rejected by your job's requirements
     0 reject your job because of their own requirements
    16 match, but are serving users with a better priority in the pool
     0 match, match, but reject the job for unknown reasons
     0 match, but will not currently preempt their existing job
     0 are available to run your job
"""
But there are no other users in the que and I even reset my priority to 0.5 without effect.

Some jobs run, about 50 in the last 12 hours. These jobs should take about 2 mins each X 16 processors X 12 hours means I'd expect ~ 5,000 jobs to have run.

Below are the logs off the head node. It would be great if someone could tell me what's happened but failing that is there a list where I can lookup what "died due to signal #" and "EXITING WITH STATUS ###" mean?

Thanks in advance.
Ann

The MarterLog (end of)
5/24 06:08:08 The SCHEDD (pid 30865) died due to signal 25
5/24 06:08:08 Sending obituary for "/home/condor/condor/sbin/condor_schedd"
5/24 06:08:08 restarting /home/condor/condor/sbin/condor_schedd in 10 seconds 5/24 06:08:18 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 30916
5/24 06:31:19 The SCHEDD (pid 30916) died due to signal 25
5/24 06:31:19 Sending obituary for "/home/condor/condor/sbin/condor_schedd"
5/24 06:31:19 restarting /home/condor/condor/sbin/condor_schedd in 10 seconds 5/24 06:31:29 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 31000

The ScheduleLog
5/24 06:44:38 Sent ad to central manager for ann@localdomain localhost
5/24 06:49:24 DaemonCore: Command received via TCP from host <192.168.0.40:52416> 5/24 06:49:24 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
5/24 06:49:24 Negotiating for owner: ann@localdomain localhost
5/24 06:49:24 Checking consistency running and runnable jobs
5/24 06:49:24 Tables are consistent
5/24 06:49:24 Out of servers - 0 jobs matched, 341 jobs idle, 1 jobs rejected
5/24 06:49:38 Sent ad to central manager for ann@localdomain localhost
5/24 06:54:24 Activity on stashed negotiator socket
5/24 06:54:24 Negotiating for owner: ann@localdomain localhost
5/24 06:54:24 Checking consistency running and runnable jobs
5/24 06:54:24 Tables are consistent
5/24 06:54:24 Out of servers - 0 jobs matched, 341 jobs idle, 1 jobs rejected
5/24 06:54:38 Sent ad to central manager for ann@localdomain localhost
5/24 06:59:38 Sent ad to central manager for ann@localdomain localhost

The ShadowLog (end of)
5/24 06:44:28 ******************************************************
5/24 06:44:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
5/24 06:44:28 ** /home/condor/condor/sbin/condor_shadow
5/24 06:44:28 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/24 06:44:28 ** $CondorPlatform: I386-LINUX_RH9 $
5/24 06:44:28 ** PID = 31040
5/24 06:44:28 ******************************************************
5/24 06:44:28 Using config file: /users/condor/condor_config
5/24 06:44:28 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/24 06:44:28 DaemonCore: Command Socket at <192.168.0.40:52323>
5/24 06:44:28 (84901.58) (31039): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
5/24 06:44:31 getpeername failed so connect must have failed
5/24 06:49:29 Connect failed for 300 seconds; returning FALSE
5/24 06:49:29 Can't connect to queue manager
CEDAR:6001:Failed to connect to <192.168.0.40:52226>
5/24 06:49:29 ERROR "Failed to connect to schedd!" at line 102 in file shadow_initializer.C

The StartLog (end of)
5/23 11:06:19 ******************************************************
5/23 11:06:19 ** condor_startd (CONDOR_STARTD) STARTING UP
5/23 11:06:19 ** /home/condor/condor/sbin/condor_startd
5/23 11:06:19 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/23 11:06:19 ** $CondorPlatform: I386-LINUX_RH9 $
5/23 11:06:19 ** PID = 25311
5/23 11:06:19 ******************************************************
5/23 11:06:19 Using config file: /users/condor/condor_config
5/23 11:06:19 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/23 11:06:19 DaemonCore: Command Socket at <192.168.0.40:44638>
5/23 11:06:19 **** condor_startd (condor_STARTD) EXITING WITH STATUS 1
5/23 11:07:08 ******************************************************
5/23 11:07:08 ** condor_startd (CONDOR_STARTD) STARTING UP
5/23 11:07:08 ** /home/condor/condor/sbin/condor_startd
5/23 11:07:08 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/23 11:07:08 ** $CondorPlatform: I386-LINUX_RH9 $
5/23 11:07:08 ** PID = 25315
5/23 11:07:08 ******************************************************
5/23 11:07:08 Using config file: /users/condor/condor_config
5/23 11:07:08 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/23 11:07:08 DaemonCore: Command Socket at <192.168.0.40:44639>
5/23 11:07:16 vm1: New machine resource allocated
5/23 11:07:16 vm2: New machine resource allocated
5/23 11:07:16 About to run initial benchmarks.
5/23 11:07:22 Completed initial benchmarks.
5/23 11:07:22 vm1: State change: IS_OWNER is false
5/23 11:07:22 vm1: Changing state: Owner -> Unclaimed
5/23 11:07:22 vm2: State change: IS_OWNER is false
5/23 11:07:22 vm2: Changing state: Owner -> Unclaimed

The StarterLog (end of)
2/16 12:32:51 ******************************************************
2/16 12:32:51 ** condor_starter (CONDOR_STARTER) STARTING UP
2/16 12:32:51 ** /home/condor/condor/sbin/condor_starter
2/16 12:32:51 ** $CondorVersion: 6.6.6 Jul 26 2004 $
2/16 12:32:51 ** $CondorPlatform: I386-LINUX_RH9 $
2/16 12:32:51 ** PID = 2773
2/16 12:32:51 ******************************************************
2/16 12:32:51 Using config file: /users/condor/condor_config
2/16 12:32:51 Using local config files: /users/condor/hosts/galaxy/condor_config.local
2/16 12:32:51 DaemonCore: Command Socket at <192.168.0.40:32960>
2/16 12:32:51 argc = 1
2/16 12:32:51 argv[0] = /users/condor/condor/sbin/condor_starter
2/16 12:32:51 usage: condor_starter initiating_host
2/16 12:32:51    or: condor_starter -job-keyword keyword
2/16 12:32:51                       -job-input-ad path
2/16 12:32:51                       -job-cluster number
2/16 12:32:51                       -job-proc    number
2/16 12:32:51                       -job-subproc number
2/16 12:32:51 **** condor_starter (condor_STARTER) EXITING WITH STATUS 1


The NegotiatorLog (end of)
5/24 07:07:33 Public ads include 1 submitter, 16 startd
5/24 07:07:33 Phase 2:  Performing accounting ...
5/24 07:07:33 Phase 3:  Sorting submitter ads by priority ...
5/24 07:07:33 Phase 4.1:  Negotiating with schedds ...
5/24 07:07:33 Negotiating with ann@localdomain localhost at <192.168.0.40:52329>
5/24 07:07:33     Request 84901.00059:
5/24 07:07:33 Matched 84901.59 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.10:32773>
5/24 07:07:33       Successfully matched with vm1@leo1
5/24 07:07:33     Request 84901.00060:
5/24 07:07:33 Matched 84901.60 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.10:32773>
5/24 07:07:33       Successfully matched with vm2@leo1
5/24 07:07:33     Request 84901.00061:
5/24 07:07:33 Matched 84901.61 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.11:32887>
5/24 07:07:33       Successfully matched with vm1@leo2
5/24 07:07:33     Request 84901.00062:
5/24 07:07:33 Matched 84901.62 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.11:32887>
5/24 07:07:33       Successfully matched with vm2@leo2
5/24 07:07:33     Request 84901.00063:
5/24 07:07:33 Matched 84901.63 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.12:39038>
5/24 07:07:33       Successfully matched with vm1@leo3
5/24 07:07:33     Request 84901.00064:
5/24 07:07:33 Matched 84901.64 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.12:39038>
5/24 07:07:33       Successfully matched with vm2@leo3
5/24 07:07:33     Request 84901.00065:
5/24 07:07:33 Matched 84901.65 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.13:32771>
5/24 07:07:33       Successfully matched with vm1@leo4
5/24 07:07:33     Request 84901.00066:
5/24 07:07:33 Matched 84901.66 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.13:32771>
5/24 07:07:33       Successfully matched with vm2@leo4
5/24 07:07:33     Request 84901.00067:
5/24 07:07:33 Matched 84901.67 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.14:32775>
5/24 07:07:33       Successfully matched with vm1@leo5
5/24 07:07:33     Request 84901.00068:
5/24 07:07:33 Matched 84901.68 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.14:32775>
5/24 07:07:33       Successfully matched with vm2@leo5
5/24 07:07:34     Request 84901.00069:
5/24 07:07:34 Matched 84901.69 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.15:32776>
5/24 07:07:34       Successfully matched with vm1@leo6
5/24 07:07:34     Request 84901.00070:
5/24 07:07:34 Matched 84901.70 ann@localdomain localhost <192.168.0.40:52329> preempting none <192.168.0.15:32776>
5/24 07:07:34       Successfully matched with vm2@leo6
5/24 07:07:34 Over submitter resource limit (12) ... only consider startd ranks
5/24 07:07:34     Request 84901.00071:
5/24 07:07:34 Rejected 84901.71 ann@localdomain localhost <192.168.0.40:52329>: no match found
5/24 07:07:34     Got NO_MORE_JOBS;  done negotiating
5/24 07:07:34 Phase 4.2:  Negotiating with schedds ...
5/24 07:07:34 Negotiating with ann@localdomain localhost at <192.168.0.40:52329>
5/24 07:07:34 ---------- Finished Negotiation Cycle ----------
5/24 07:12:34 ---------- Started Negotiation Cycle ----------
5/24 07:12:34 Phase 1:  Obtaining ads from collector ...
5/24 07:12:34   Getting all public ads ...
5/24 07:12:34   Sorting 35 ads ...
5/24 07:12:34   Getting startd private ads ...
5/24 07:12:34 Got ads: 35 public and 16 private
5/24 07:12:34 Public ads include 1 submitter, 16 startd
5/24 07:12:34 Phase 2:  Performing accounting ...
5/24 07:12:34 Phase 3:  Sorting submitter ads by priority ...
5/24 07:12:34 Phase 4.1:  Negotiating with schedds ...
5/24 07:12:34 Negotiating with ann@localdomain localhost at <192.168.0.40:52450> 5/24 07:12:34 Over submitter resource limit (0) ... only consider startd ranks
5/24 07:12:34     Request 84901.00059:
5/24 07:12:34 Rejected 84901.59 ann@localdomain localhost <192.168.0.40:52450>: no match found
5/24 07:12:34     Got NO_MORE_JOBS;  done negotiating
5/24 07:12:34 ---------- Finished Negotiation Cycle ----------
5/24 07:17:34 ---------- Started Negotiation Cycle ----------
5/24 07:17:34 Phase 1:  Obtaining ads from collector ...
5/24 07:17:34   Getting all public ads ...

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/