[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Node matched and able to run, but the job is idle



Dear Colleagues,

I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)

So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.


> condor_q -better-analyze

htcondor: Wed Mar  6 17:32:43 2019

-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...
The Requirements _expression_ for job 5.000 is

    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||
      (TARGET.HasFileTransfer))

Job 5.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "htcondor.localdomain"
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements _expression_ for job 5.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           1  OpSys == "WINDOWS"
[8]           5  TARGET.HasFileTransfer

Last successful match: Wed Mar  6 17:32:00 2019

005.000:  Run analysis summary ignoring user priority.  Of 5 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job



Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:

> condor_status                                                                   

Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03
slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18
slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX     4     0       0         4       0          0        0      0
X86_64/WINDOWS     1     0       0         1       0          0        0      0

         Total     5     0       0         5       0          0        0      0

All the best,
Alexander A. Prokhorov