[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Another jobs stuck in idle issue




Hi

I just updated my condor cluster from Fedora 20 to Fedora 22 (x86_64).

Condor is installed from the Fedora repos and went from 8.1.1 to 8.3.1.

I kept the same configuration files.

Now my jobs sit in the Idle state on the submit node.

SchedLog on the submit host has:
06/18/15 12:14:57 Matched 6.0 aaa@xxxxxxx <0.0.0.0:57923> preempting none <0.0.0.0:47591> slot1@xxxxxxxxxxx

On zzz.xxx.xxx

06/18/15 12:16:58 slot1_1: Request to claim resource refused.
06/18/15 12:16:58 slot1_1: Claiming protocol failed
06/18/15 12:16:58 slot1_1: Changing state: Owner -> Delete
06/18/15 12:16:58 Trying to update collector <yyy.xxx.xxx:9618>
06/18/15 12:16:58 Attempting to send update via UDP to collector yyy.xxx.xxx <0.0.0.0:9618>
06/18/15 12:16:58 slot1_1: Resource no longer needed, deleting
06/18/15 12:16:58 slot1: Total execute space: 11476772

Output from

$ condor_q -better-analyse is below.

I'd appreciate any thoughts on how to diagnose this further.

I have D_FULLDEBUG set for STARTD and STARTER on the execute node and SCHEDD, COLLECTOR and NEGOTIATOR on the submit node so there is in principle lots of info but I couldn't see anything obviously relevant, although StartLog on the execute node does have entries like:

06/18/15 12:38:35 /proc format unknown for kernel version 4.0.4

Thanks

Roderick Johnstone



-- Submitter: yyy.xxx.xxx : <x.x.x.x:57923> : yyy.xxx.xxx
---
006.000:  Request has not yet been considered by the matchmaker.

User priority for aaa@xxxxxxx is not available, attempting to analyze without it.
---
006.000:  Run analysis summary.  Of 12 machines,
     11 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are available to run your job

The Requirements expression for your job is:

    ( Machine == "zzz.xxx.xxx" ) && ( TARGET.Arch == "X86_64" ) &&
    ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

Your job defines the following attributes:

    FileSystemDomain = "xxx.xxx"
    DiskUsage = 1
    RequestDisk = 1
    RequestMemory = 10

The Requirements expression for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           1  Machine == "zzz.xxx.xxx"
[9]          12  TARGET.HasFileTransfer

Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( Machine == "zzz.xxx.xxx" )0                   REMOVE
2   ( TARGET.Arch == "X86_64" )       12
3   ( TARGET.OpSys == "LINUX" )       12
4   ( TARGET.Disk >= 1 )              12
5   ( TARGET.Memory >= 10 )           12
6 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "xxx.xxx" ) )
                                      12