[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem getting condor up and running




HI All,

 I am having some trouble getting condor up and running on a couple of machines that are running rhel 4 with kernel  2.6.9-22.0.1.ELsmp, they have 8 cpus each and 16 Gb of memory each.  Due to some limitation with 2.6 kernels i had to define the total amount of memory in my local config files.  I also added a (RESERVED_SWAP = 0) as per a suggestion in a log file when i first started my run attempts.  I added the LOCK stmt to give each machine a local lock file to work with since all the other files are on NFS:

NUM_CPUS = 8
MEMORY = 16000
LOCK = /tmp/condor
RESERVED_SWAP = 0

I am able to Match i simple test job:

Successfully matched with vm2@xxxxxxxxxxxxxxxxxxxxxxxxxx
The Negotation Cycle:
6/26 09:46:32 ---------- Started Negotiation Cycle ----------
6/26 09:46:32 Phase 1:  Obtaining ads from collector ...
6/26 09:46:32   Getting all public ads ...
6/26 09:46:32   Sorting 24 ads ...
6/26 09:46:32   Getting startd private ads ...
6/26 09:46:32 Got ads: 24 public and 19 private
6/26 09:46:32 Public ads include 1 submitter, 19 startd
6/26 09:46:32 Phase 2:  Performing accounting ...
6/26 09:46:32 Phase 3:  Sorting submitter ads by priority ...
6/26 09:46:32 Phase 4.1:  Negotiating with schedds ...
6/26 09:46:32   Negotiating with asgdev@mydomainname at <x.x.x.x:40628>
6/26 09:46:32     Request 00008.00000:
6/26 09:46:32       Matched 8.0 asgdev@xxxxxxxxxxxxxxxxx <x.x.x.x:40628> preempting none <x.x.x.x:36098>
6/26 09:46:32       Successfully matched with vm3@xxxxxxxxxxxxxxxxxxxxxxxxxx
6/26 09:46:32     Got NO_MORE_JOBS;  done negotiating
6/26 09:46:32 ---------- Finished Negotiation Cycle ----------

But the job goes from running to idle in a fraction of a second and then just sits there.  Below are some of the relevent output.  I have put 'mydomainname' in place of my real domain and x'ed out my ip's.  Does anyone have a similiar issue, ideas to try?  -Ali

In ShadowLog:
6/26 09:41:34 Using local config files: /public/murex/home/asgdev/condor/etc/snycmfnedad24.local
6/26 09:41:34 DaemonCore: Command Socket at <x.x.x.x:40656>
6/26 09:41:35 Initializing a VANILLA shadow
6/26 09:41:35 (8.0) (2880): Not enough reserved swap space
6/26 09:41:35 (8.0) (2880): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 105

 condor_q -analyze:
-- Submitter: snycmfnedad24.mydomainname : <x.x.x.x:40628> : snycmfnedad24.mydomainname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
---
008.000:  Run analysis summary.  Of 19 machines,
      3 are rejected by your job's requirements
      8 reject your job because of their own requirements
      2 match, but are serving users with a better priority in the pool
      6 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Mon Jun 26 09:41:32 2006

1 jobs; 1 idle, 0 running, 0 held