[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Node matched and able to run, but the job is idle



So your condor_shadow is advertising <127.0.0.1:9618â&sock=1327422_a214_245> as its address.  But this will only work if the startd is on the same machine as the shadow.

 

try running this command on your submit machine.

 

condor_config_val -dump NETWORK

 

Iâm expecting that you have something like

NETWORK_INTERFACE=127.0.0.1

 

which would be causing the shadow to advertise that as its primary address.

 

By the way, if this is the case, I would expect that you could not run jobs from this submit node on a Linux execute node either.

Your first email implied that running jobs on Linux execute node from this submit machine works, Have you tested this? Can you run

jobs anywhere but on the local machine from this submit node?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 3:52 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle

 

Dear Joh,

 

Thank you for a quick response.

 

 

Here is the all the lines in the log appeared at 17:57:02:

 

03/06/19 17:57:02 ******************************************************

03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP

03/06/19 17:57:02 ** /usr/sbin/condor_shadow

03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $

03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $

03/06/19 17:57:02 ** PID = 1420616

03/06/19 17:57:02 ** Log last touched 3/6 17:56:02

03/06/19 17:57:02 ******************************************************

03/06/19 17:57:02 Using config source: /etc/condor/condor_config

03/06/19 17:57:02 Using local config sources:

03/06/19 17:57:02    /etc/condor/config.d/00debconf

03/06/19 17:57:02    /etc/condor/config.d/10parallel

03/06/19 17:57:02    /etc/condor/config.d/20allifaces

03/06/19 17:57:02    /etc/condor/condor_config.local

03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352

03/06/19 17:57:02 CLASSAD_CACHING is OFF

03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR

03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_245

03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>

03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>

03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0

03/06/19 17:57:02 (5.0) (1420616): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was ACCEPTED

03/06/19 17:57:02 (5.0) (1420616): ERROR "Error from Win7: Could not initiate file transfer" at line 565 in file /slots/01/dir_13152/userdir/.tmp95luG3/condor-8.8.1/src/condor_shadow.V6.1/pseudo_ops.cpp

03/06/19 17:57:02 ******************************************************

03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP

03/06/19 17:57:02 ** /usr/sbin/condor_shadow

03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $

03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $

03/06/19 17:57:02 ** PID = 1420620

03/06/19 17:57:02 ** Log last touched 3/6 17:57:02

03/06/19 17:57:02 ******************************************************

03/06/19 17:57:02 Using config source: /etc/condor/condor_config

03/06/19 17:57:02 Using local config sources:

03/06/19 17:57:02    /etc/condor/config.d/00debconf

03/06/19 17:57:02    /etc/condor/config.d/10parallel

03/06/19 17:57:02    /etc/condor/config.d/20allifaces

03/06/19 17:57:02    /etc/condor/condor_config.local

03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352

03/06/19 17:57:02 CLASSAD_CACHING is OFF

03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR

03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_246

03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>

03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>

03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0

03/06/19 17:57:02 (5.0) (1420620): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was REFUSED

03/06/19 17:57:02 (5.0) (1420620): Job 5.0 is being evicted from Win7

03/06/19 17:57:02 (5.0) (1420620): logEvictEvent with unknown reason (108), not logging.

03/06/19 17:57:02 (5.0) (1420620): **** condor_shadow (condor_SHADOW) pid 1420620 EXITING WITH STATUS 108

 

 

Speaking of firewalls, I already disabled it completely on the Windows machine, andI use freshly installed Ubuntu 18.04 as a main machine and I did not set up any firewalls there yet.

 

Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?

how can that be the case when you say you are submitting from linux but running on Windows?

 

That is interesting. Indeed, I submit the job from a Linux machine, I do not understand how is this possible. What can I check?

 

All the best,

Alexander A. Prokhorov

 

 



On 6 Mar 2019, at 19:03, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

 

error 10054 is  âAn existing connection was forcibly closed by the remote host.â

 

So either the condor_shadow or a firewall forcibly closed the connection.   You should look in the ShadowLog on the submit machine

at 03/06/19 17:57:02 to see if it was the shadow that closed the connection.   If it did should give a reason.  

 

Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?

how can that be the case when you say you are submitting from linux but running on Windows?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 9:18 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle

 

I just figured out there is something bad happens on the Windows node.

 

The file StarterLog show the following, this text is added to the log it every single minute:

 

03/06/19 17:57:02 (pid:5900) ******************************************************

03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP

03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe

03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)

03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $

03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $

03/06/19 17:57:02 (pid:5900) ** PID = 5900

03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02

03/06/19 17:57:02 (pid:5900) ******************************************************

03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config

03/06/19 17:57:02 (pid:5900) Using local config sources: 

03/06/19 17:57:02 (pid:5900)    C:\condor\condor_config.local

03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776

03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF

03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR

03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created.

03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>

03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>

03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring

03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>

03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared"

03/06/19 17:57:02 (pid:5900) setting the orig job name in starter

03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter

03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true.

03/06/19 17:57:02 (pid:5900) Initialized IO Proxy.

03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented!

03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 

03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed

03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp

03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs.

03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2).

03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>.

03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header

03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect

03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting

03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0

03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0

 

 

Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?

 

All the best,

Alexander A. Prokhorov

 

 




On 6 Mar 2019, at 17:42, Alexander Prokhorov <prokher@xxxxxxxxx> wrote:

 

Dear Colleagues,

 

I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)

 

So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.

 

 

> condor_q -better-analyze

 

htcondor: Wed Mar  6 17:32:43 2019

 

-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...

The Requirements _expression_ for job 5.000 is

 

    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&

    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&

    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||

      (TARGET.HasFileTransfer))

 

Job 5.000 defines the following attributes:

 

    DiskUsage = 1

    FileSystemDomain = "htcondor.localdomain"

    ImageSize = 1

    RequestDisk = DiskUsage

    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

 

The Requirements _expression_ for job 5.000 reduces to these conditions:

 

         Slots

Step    Matched  Condition

-----  --------  ---------

[0]           1  OpSys == "WINDOWS"

[8]           5  TARGET.HasFileTransfer

 

Last successful match: Wed Mar  6 17:32:00 2019

 

005.000:  Run analysis summary ignoring user priority.  Of 5 machines,

      4 are rejected by your job's requirements

      0 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

      1 are able to run your job

 

 

 

Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:

 

> condor_status                                                                   

 

Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

 

Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03

slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18

slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46

 

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

 

  X86_64/LINUX     4     0       0         4       0          0        0      0

X86_64/WINDOWS     1     0       0         1       0          0        0      0

 

         Total     5     0       0         5       0          0        0      0

 

All the best,

Alexander A. Prokhorov

 

 

 

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to 
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/