[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Node matched and able to run, but the job is idle



I just figured out there is something bad happens on the Windows node.

The file StarterLog show the following, this text is added to the log it every single minute:

03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP
03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe
03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $
03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $
03/06/19 17:57:02 (pid:5900) ** PID = 5900
03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02
03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config
03/06/19 17:57:02 (pid:5900) Using local config sources: 
03/06/19 17:57:02 (pid:5900)    C:\condor\condor_config.local
03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776
03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF
03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created.
03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring
03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared"
03/06/19 17:57:02 (pid:5900) setting the orig job name in starter
03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter
03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true.
03/06/19 17:57:02 (pid:5900) Initialized IO Proxy.
03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented!
03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 
03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed
03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp
03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs.
03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2).
03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>.
03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header
03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect
03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0


Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?

All the best,
Alexander A. Prokhorov



On 6 Mar 2019, at 17:42, Alexander Prokhorov <prokher@xxxxxxxxx> wrote:

Dear Colleagues,

I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)

So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.


> condor_q -better-analyze

htcondor: Wed Mar  6 17:32:43 2019

-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...
The Requirements _expression_ for job 5.000 is

    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||
      (TARGET.HasFileTransfer))

Job 5.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "htcondor.localdomain"
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements _expression_ for job 5.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           1  OpSys == "WINDOWS"
[8]           5  TARGET.HasFileTransfer

Last successful match: Wed Mar  6 17:32:00 2019

005.000:  Run analysis summary ignoring user priority.  Of 5 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job



Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:

> condor_status                                                                   

Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03
slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18
slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX     4     0       0         4       0          0        0      0
X86_64/WINDOWS     1     0       0         1       0          0        0      0

         Total     5     0       0         5       0          0        0      0

All the best,
Alexander A. Prokhorov