[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job remains in idle (worked until I increased pool size)




I have a pool of 200 cores with various OS for Windows machines. Yesterday, I expanded the pool from 100 cores to 200 cores and since than any jobs that I submit remain in the idle state. I have evaluated the SchedLod and NegotiatorLog and there is no indication that a problem exists. I have tried to remove machines from the pool, but the problem persists. Does anyone have any ideas as to what the problem might be. I am testing a job that I know will work (ran on friday)?

thank you,
Mike

condor_q -better-analyze:
005.000:  Run analysis summary.  Of 203 machines,
    203 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job
        No successful match recorded.
        Last failed match: Wed Feb 09 11:58:49 2011
        Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements _expression_ for your job is:

( ( target.OpSys == "WINNT51" || target.OpSys == "WINNT61" || target.OpSys == "WINNT60" ) ) &&
( target.Arch == "INTEL" ) && ( target.Disk >= DiskUsage ) &&
( ( ( target.Memory * 1024 ) >= ImageSize ) &&
( ( RequestMemory * 1024 ) >= ImageSize ) ) && ( target.HasFileTransfer ) &&
( target.HasWindowsRunAsOwner &&
( target.LocalCredd is "CM" ) )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( ( 1024 * target.Memory ) >= 4500 ) && ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory,4.394
531250000000E+000)) ) >= 4500 ) )
                                      0                   REMOVE
2   ( target.HasWindowsRunAsOwner && ( target.LocalCredd is "CM" ) )
                                      0                   REMOVE
3   ( ( target.OpSys == "WINNT51" || target.OpSys == "WINNT61" || target.OpSys == "WINNT60" ) )
                                      203
4   ( target.Arch == "INTEL" )        203
5   ( target.Disk >= 4500 )           203
6   ( target.HasFileTransfer )        203





An excerpt from my SchedLog:
02/09 10:56:18 (pid:2844) SECMAN: startCommand succeeded.
02/09 10:56:18 (pid:2844) Authorizing server '*/IP'.
02/09 10:56:18 (pid:2844) SEND [1000] <IP> <IP>
02/09 10:56:18 (pid:2844) SEND [1000] <IP> <IP>
02/09 10:56:18 (pid:2844) SEND [199] <IP> <IP>
02/09 10:56:18 (pid:2844) Sent ad to 1 collectors for odonnellm@xxxxxxxxxx
02/09 10:56:18 (pid:2844) ============ Begin clean_shadow_recs =============
02/09 10:56:18 (pid:2844) ============ End clean_shadow_recs =============
02/09 10:56:18 (pid:2844) Return from Timer handler 7 (timeout)
02/09 10:56:25 (pid:2844) Calling Handler <<Negotiator Command>> (4)
02/09 10:56:25 (pid:2844) Activity on stashed negotiator socket
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=89,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844)
02/09 10:56:25 (pid:2844) Entered negotiate
02/09 10:56:25 (pid:2844) ## 2. Negotiating with CM
02/09 10:56:25 (pid:2844) *** SwapSpace = 2147483647
02/09 10:56:25 (pid:2844) *** ReservedSwap = 0
02/09 10:56:25 (pid:2844) *** Shadow Size Estimate = 800
02/09 10:56:25 (pid:2844) *** Start Limit For Swap = 2684354
02/09 10:56:25 (pid:2844) *** Current num of active shadows = 0
02/09 10:56:25 (pid:2844) Negotiating for owner: odonnellm@xxxxxxxxxx
02/09 10:56:25 (pid:2844) AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts) invoked
02/09 10:56:25 (pid:2844) Reusing prioritized runnable job list because nothing has changed.
02/09 10:56:25 (pid:2844) Job 4.0: is runnable
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=8,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_write(fd=1228 <IP2>,,size=4096,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_write(fd=1228 <IP>,,size=2077,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) Sent job 4.0 (autocluster=0)
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=31,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) Job 4.0 rejected: no match found
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228 <IP>,,size=8,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read(): fd=1228
02/09 10:56:25 (pid:2844) condor_read(): select returned 1
02/09 10:56:25 (pid:2844) condor_write(fd=1228 <IP>,,size=29,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
02/09 10:56:25 (pid:2844) Return from Handler <<Negotiator Command>>
02/09 10:56:30 (pid:2844) Calling Timer handler 316 (HandleFlushJobQueueLogTimer)
02/09 10:56:30 (pid:2844) Return from Timer handler 316 (HandleFlushJobQueueLogTimer)
02/09 10:56:31 (pid:2844) ACCEPT from=<IP> newfd=1276 to=<IP>
02/09 10:56:31 (pid:2844) condor_read(fd=1276 <IP>,,size=4,timeout=1,flags=2)
02/09 10:56:31 (pid:2844) condor_read(): fd=1276
02/09 10:56:31 (pid:2844) condor_read(): select returned 1
02/09 10:56:31 (pid:2844) condor_read(fd=1276 <IP>,,size=5,timeout=1,flags=0)
02/09 10:56:31 (pid:2844) condor_read(): fd=1276
02/09 10:56:31 (pid:2844) condor_read(): select returned 1
02/09 10:56:31 (pid:2844) condor_read(fd=1276 <IP>,,size=391,timeout=1,flags=0)
02/09 10:56:31 (pid:2844) condor_read(): fd=1276
02/09 10:56:31 (pid:2844) condor_read(): select returned 1
02/09 10:56:31 (pid:2844) DC_AUTHENTICATE: received DC_AUTHENTICATE from <159.189.162.84:2749>
02/09 10:56:31 (pid:2844) DC_AUTHENTICATE: received following ClassAd:


An excerpt from my NegotiatorLog:
02/09 12:08:43 condor_read(fd=596 collector at <IP>,,size=3012,timeout=60,flags=0)
02/09 12:08:43 condor_read(): fd=596
02/09 12:08:43 condor_read(): select returned 1
02/09 12:08:43 CLOSE <159.189.160.37:62623> fd=596
02/09 12:08:43 Got ads: 292 public and 202 private
02/09 12:08:43 Public ads include 1 submitter, 202 startd
02/09 12:08:43 Entering compute_significant_attrs()
02/09 12:08:43 Leaving compute_significant_attrs() - result=JobUniverse,LastCheckpointPlatform,NumCkpts
02/09 12:08:43 Phase 2:  Performing accounting ...
02/09 12:08:43 Phase 3:  Sorting submitter ads by priority ...
02/09 12:08:43 Phase 4.1:  Negotiating with schedds ...
02/09 12:08:43     numSlots = 202
02/09 12:08:43     slotWeightTotal = 202.000000
02/09 12:08:43     pieLeft = 202.000
02/09 12:08:43     NormalFactor = 1.000000
02/09 12:08:43     MaxPrioValue = 0.500000
02/09 12:08:43     NumSubmitterAds = 1
02/09 12:08:43   Negotiating with odonnellm@xxxxxxxxxx at <IP>
02/09 12:08:43 0 seconds so far
02/09 12:08:43   Calculating submitter limit with the following parameters
02/09 12:08:43     SubmitterPrio       = 0.500000
02/09 12:08:43     SubmitterPrioFactor = 1.000000
02/09 12:08:43     submitterShare      = 1.000000
02/09 12:08:43     submitterAbsShare   = 1.000000
02/09 12:08:43     submitterLimit    = 202.000000
02/09 12:08:43     submitterUsage    = 0.000000
02/09 12:08:43 Socket to odonnellm@xxxxxxxxxx (<IP>) already in cache, reusing
02/09 12:08:43 condor_write(fd=604 schedd odonnellm@xxxxxxxxxx,,size=110,timeout=30,flags=0)
02/09 12:08:43     Sending SEND_JOB_INFO/eom
02/09 12:08:43 condor_write(fd=604 schedd odonnellm@xxxxxxxxxx,,size=29,timeout=30,flags=0)
02/09 12:08:43     Getting reply from schedd ...
02/09 12:08:43 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select returned 1
02/09 12:08:43 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=4075,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select returned 1
02/09 12:08:43 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select returned 1
02/09 12:08:43 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=2056,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select returned 1
02/09 12:08:43     Got JOB_INFO command; getting classad/eom
02/09 12:08:43     Request 00005.00000:
02/09 12:08:43 matchmakingAlgorithm: limit 202.000000 used 0.000000 pieLeft 202.000000
02/09 12:08:43 FileLock::obtain(1) - @1297278523.987000 lock on C:\Condor\log\negotiator_sql.log now WRITE
02/09 12:08:44 FileLock::obtain(2) - @1297278524.003000 lock on C:\Condor\log\negotiator_sql.log now UNLOCKED
02/09 12:08:44       Rejected 5.0 odonnellm@xxxxxxxxxx <IP>: no match found
02/09 12:08:44 condor_write(fd=604 schedd odonnellm@xxxxxxxxxx,,size=52,timeout=30,flags=0)
02/09 12:08:44     Sending SEND_JOB_INFO/eom
02/09 12:08:44 condor_write(fd=604 schedd odonnellm@xxxxxxxxxx,,size=29,timeout=30,flags=0)
02/09 12:08:44     Getting reply from schedd ...
02/09 12:08:44 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:44 condor_read(): fd=604
02/09 12:08:44 condor_read(): select returned 1
02/09 12:08:44 condor_read(fd=604 schedd odonnellm@xxxxxxxxxx,,size=8,timeout=30,flags=0)
02/09 12:08:44 condor_read(): fd=604
02/09 12:08:44 condor_read(): select returned 1
02/09 12:08:44     Got NO_MORE_JOBS;  done negotiating
02/09 12:08:44   Submitter odonnellm@xxxxxxxxxx got all it wants; removing it.
02/09 12:08:44 ---------- Finished Negotiation Cycle ----------