[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Node matched and able to run, but the job is idle



error 10054 is  “An existing connection was forcibly closed by the remote host.”

 

So either the condor_shadow or a firewall forcibly closed the connection.   You should look in the ShadowLog on the submit machine

at 03/06/19 17:57:02 to see if it was the shadow that closed the connection.   If it did should give a reason.  

 

Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?

how can that be the case when you say you are submitting from linux but running on Windows?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 9:18 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle

 

I just figured out there is something bad happens on the Windows node.

 

The file StarterLog show the following, this text is added to the log it every single minute:

 

03/06/19 17:57:02 (pid:5900) ******************************************************

03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP

03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe

03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)

03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $

03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $

03/06/19 17:57:02 (pid:5900) ** PID = 5900

03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02

03/06/19 17:57:02 (pid:5900) ******************************************************

03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config

03/06/19 17:57:02 (pid:5900) Using local config sources: 

03/06/19 17:57:02 (pid:5900)    C:\condor\condor_config.local

03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776

03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF

03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR

03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created.

03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>

03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>

03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring

03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>

03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared"

03/06/19 17:57:02 (pid:5900) setting the orig job name in starter

03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter

03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true.

03/06/19 17:57:02 (pid:5900) Initialized IO Proxy.

03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented!

03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 

03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed

03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp

03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs.

03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2).

03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>.

03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header

03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect

03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting

03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0

03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0

 

 

Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?

 

All the best,

Alexander A. Prokhorov

 

 



On 6 Mar 2019, at 17:42, Alexander Prokhorov <prokher@xxxxxxxxx> wrote:

 

Dear Colleagues,

 

I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)

 

So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.

 

 

> condor_q -better-analyze

 

htcondor: Wed Mar  6 17:32:43 2019

 

-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...

The Requirements _expression_ for job 5.000 is

 

    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&

    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&

    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||

      (TARGET.HasFileTransfer))

 

Job 5.000 defines the following attributes:

 

    DiskUsage = 1

    FileSystemDomain = "htcondor.localdomain"

    ImageSize = 1

    RequestDisk = DiskUsage

    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

 

The Requirements _expression_ for job 5.000 reduces to these conditions:

 

         Slots

Step    Matched  Condition

-----  --------  ---------

[0]           1  OpSys == "WINDOWS"

[8]           5  TARGET.HasFileTransfer

 

Last successful match: Wed Mar  6 17:32:00 2019

 

005.000:  Run analysis summary ignoring user priority.  Of 5 machines,

      4 are rejected by your job's requirements

      0 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

      1 are able to run your job

 

 

 

Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:

 

> condor_status                                                                   

 

Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

 

Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03

slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18

slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

 

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

 

  X86_64/LINUX     4     0       0         4       0          0        0      0

X86_64/WINDOWS     1     0       0         1       0          0        0      0

 

         Total     5     0       0         5       0          0        0      0

 

All the best,

Alexander A. Prokhorov