[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs stay Idle ... been looking for 24 hours....



Hi,

My jobs stay idle forever...

here are the stats:

1) condor_status:

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   378  0+00:10:09
vm2@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   378  0+00:10:10
vm3@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   378  0+00:10:11
vm4@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   378  0+00:10:12
comparch.bing LINUX       INTEL  Owner      Idle       0.000   241  0+00:10:04
vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.030   504  0+00:10:09
vm2@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   504  0+00:10:10
vm3@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.000   504  0+00:10:11
vm1@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.890   250  0+00:35:10
vm2@xxxxxxxxx LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:00:05
vm3@xxxxxxxxx LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:00:06
vm4@xxxxxxxxx LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:00:07
vm1@clouseau. LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:10:04
vm2@clouseau. LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:10:05
vm3@clouseau. LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:10:06
vm4@clouseau. LINUX       X86_64 Unclaimed  Idle       0.000   250  0+00:10:07
vm1@dogmatix. LINUX       X86_64 Owner      Idle       0.110   501  0+00:10:10
vm2@dogmatix. LINUX       X86_64 Owner      Idle       0.000   501  0+00:10:11
vm3@dogmatix. LINUX       X86_64 Owner      Idle       0.000   501  0+00:10:12
vm4@dogmatix. LINUX       X86_64 Owner      Idle       0.000   501  0+00:10:13
vm1@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.000   250  0+00:10:10
vm2@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.000   250  0+00:10:11
vm3@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.000   250  0+00:10:12
vm4@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.000   250  0+00:10:13

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX     8     8       0         0       0          0        0
        X86_64/LINUX    16     9       0         7       0          0        0

               Total    24    17       0         7       0          0        0

NOTE: no problem here...all machines are recognized by central manager..

2) condor_q - analyze 2.0
-- Submitter: comparch.binghamton.edu : <128.226.128.31:39183> : comparch.binghamton.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
002.000:  Run analysis summary.  Of 24 machines,
     16 are rejected by your job's requirements
      8 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        No successful match recorded.
        Last failed match: Thu Apr 26 14:32:52 2007
        Reason for last match failure: no match found
---------------------------------------------------------------------------------------------------------------------------------------------------------
NOTE: 8 reject your job because of their own requirements

3) condor_q -better 2.0

-- Submitter: comparch.binghamton.edu : < 128.226.128.31:39183> : comparch.binghamton.edu
---
002.000:  Run analysis summary.  Of 24 machines,
     16 are rejected by your job's requirements
      8 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        No successful match recorded.
        Last failed match: Thu Apr 26 14:32:52 2007
        Reason for last match failure: no match found

The Requirements _expression_ for your job is:

( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) ) &&
( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Arch == "INTEL" )        8
2   ( target.OpSys == "LINUX" )       24
3   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) )
                                      24
4   ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) )
                                      24
5   ( target.Disk >= 10000 )          24
6   ( ( 1024 * target.Memory ) >= 10000 )24
----------------------------------------------------------------------------------------------------------------------------------

4) SchedLog on central manager:

4/26 14:24:50 (pid:1888) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) Called reschedule_negotiator()
4/26 14:24:50 (pid:1888) DaemonCore: Command received via TCP from host <128.226.128.31:42297>
4/26 14:24:50 (pid:1888) DaemonCore: received command 493 (NEGOTIATE_WITH_SIGATTRS), calling handler (doNegotiate)
4/26 14:24:50 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts
4/26 14:24:50 (pid:1888) Checking consistency running and runnable jobs
4/26 14:24:50 (pid:1888) Tables are consistent
4/26 14:24:50 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
4/26 14:29:50 (pid:1888) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Activity on stashed negotiator socket
4/26 14:29:50 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Checking consistency running and runnable jobs
4/26 14:29:50 (pid:1888) Tables are consistent
4/26 14:29:50 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
4/26 14:32:48 (pid:1888) DaemonCore: Command received via TCP from host < 128.226.128.31:54711>
4/26 14:32:48 (pid:1888) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
4/26 14:32:52 (pid:1888) DaemonCore: Command received via UDP from host < 128.226.128.31:35612>
4/26 14:32:52 (pid:1888) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
4/26 14:32:52 (pid:1888) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Called reschedule_negotiator()
4/26 14:32:52 (pid:1888) Activity on stashed negotiator socket
4/26 14:32:52 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Checking consistency running and runnable jobs
4/26 14:32:52 (pid:1888) Tables are consistent
4/26 14:32:52 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected

5) StartLog on central manager:

4/26 13:57:49 ******************************************************
4/26 13:57:49 ** condor_startd (CONDOR_STARTD) STARTING UP
4/26 13:57:49 ** /home/condor/condor/sbin/condor_startd
4/26 13:57:49 ** $CondorVersion: 6.8.4 Feb  1 2007 $
4/26 13:57:49 ** $CondorPlatform: I386-LINUX_RHEL3 $
4/26 13:57:49 ** PID = 1887
4/26 13:57:49 ** Log last touched 4/26 13:57:43
4/26 13:57:49 ******************************************************
4/26 13:57:49 Using config source: /home/condor/condor/etc/condor_config
4/26 13:57:49 Using local config sources:
4/26 13:57:49    /home/condor/hosts/comparch/condor_config.local
4/26 13:57:49 DaemonCore: Command Socket at <128.226.128.31:34245 >
4/26 13:57:56 New machine resource allocated
4/26 13:57:56 About to run initial benchmarks.
4/26 13:58:00 Completed initial benchmarks.
4/26 14:13:00 State change: IS_OWNER is false
4/26 14:13:00 Changing state: Owner -> Unclaimed
4/26 14:23:00 State change: IS_OWNER is TRUE
4/26 14:23:00 Changing state: Unclaimed -> Owner

6) condor_q -l 2.0
-- Submitter: comparch.binghamton.edu : < 128.226.128.31:39183> : comparch.binghamton.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 2
QDate = 1177612372
CompletionDate = 0
Owner = "condor"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.4 Feb  1 2007 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/home/condor"
JobUniverse = 1
Cmd = "/home/condor/leftouts"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = TRUE
WantCheckpoint = TRUE
JobStatus = 1
EnteredCurrentStatus = 1177612372
JobPrio = 0
User = " condor@xxxxxxxxxxxxxxxxxxxxxxx"
NiceUser = FALSE
MaxJobRetirementTime = 0
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/home/condor/leftouts.log"
CoreSize = 0
KillSig = "SIGTSTP"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "leftouts.out"
StreamOut = FALSE
Err = "/dev/null"
TransferErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 4
ImageSize = 10000
ExecutableSize_RAW = 4
ExecutableSize = 10000
DiskUsage_RAW = 4
DiskUsage = 10000
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
FileSystemDomain = "comparch.binghamton.edu"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
>
LeaveJobInQueue = FALSE
Arguments = ""
GlobalJobId = "comparch.binghamton.edu#1177612372#2.0"
ProcId = 0
AutoClusterId = 1
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,Requirements,NiceUser"
WantMatchDiagnostics = TRUE
LastRejMatchReason = "no match found"
LastRejMatchTime = 1177612672
ServerTime = 1177612764
---------------------------------------------------------------

I think these are all the stats needed to debug ..
I haven't specified any Requirements in the Job submit file.
I don't have any PERMISSION_DENIED errors either...
My condor_config file is correct...its all set...

I have been tryin to debug this for 24 hours now...

Any help would be appreciated ...

thanks,
Askar