[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Node matched and able to run, but the job is idle



John,

Here is the output you asked about:

> condor_config_val -dump NETWORK
# Parameters with names that match NETWORK:
NETWORK_HOSTNAME =
NETWORK_INTERFACE = 127.0.0.1
NETWORK_MAX_PENDING_CONNECTS = 0
OPENMPI_EXCLUDE_NETWORK_INTERFACES = docker0,virbr0
PRIVATE_NETWORK_INTERFACE =
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)
VM_NETWORKING = false
VM_NETWORKING_DEFAULT_TYPE =
VM_NETWORKING_MAC_PREFIX =
VM_NETWORKING_TYPE =
VMWARE_NETWORKING_TYPE =
# Contributing configuration file(s):
# /etc/condor/condor_config
# /etc/condor/config.d/00debconf
# /etc/condor/config.d/10parallel
# /etc/condor/config.d/20allifaces
# /etc/condor/condor_config.local


Actually, initially Windows machine could not connect to the main Linux HTCondor server at all that is why I added the following file:

> cat /etc/condor/config.d/20allifaces
BIND_ALL_INTERFACES = TRUE

After this, connection happens and I see my Windows node in the condor_status output.

Can you please tell what shall I do further, or just point me to the relevant part of the documentation. Thanks in advance.

All the best,
Alexander A. Prokhorov



On 7 Mar 2019, at 01:03, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

So your condor_shadow is advertising <127.0.0.1:9618â&sock=1327422_a214_245> as its address.  But this will only work if the startd is on the same machine as the shadow.
 
try running this command on your submit machine.
 
condor_config_val -dump NETWORK
 
Iâm expecting that you have something like
NETWORK_INTERFACE=127.0.0.1
 
which would be causing the shadow to advertise that as its primary address.
 
By the way, if this is the case, I would expect that you could not run jobs from this submit node on a Linux execute node either.
Your first email implied that running jobs on Linux execute node from this submit machine works, Have you tested this? Can you run
jobs anywhere but on the local machine from this submit node?
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 3:52 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle
 
Dear Joh,
 
Thank you for a quick response.
 
 
Here is the all the lines in the log appeared at 17:57:02:
 
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
03/06/19 17:57:02 ** /usr/sbin/condor_shadow
03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $
03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $
03/06/19 17:57:02 ** PID = 1420616
03/06/19 17:57:02 ** Log last touched 3/6 17:56:02
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 Using config source: /etc/condor/condor_config
03/06/19 17:57:02 Using local config sources:
03/06/19 17:57:02    /etc/condor/config.d/00debconf
03/06/19 17:57:02    /etc/condor/config.d/10parallel
03/06/19 17:57:02    /etc/condor/config.d/20allifaces
03/06/19 17:57:02    /etc/condor/condor_config.local
03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352
03/06/19 17:57:02 CLASSAD_CACHING is OFF
03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_245
03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0
03/06/19 17:57:02 (5.0) (1420616): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was ACCEPTED
03/06/19 17:57:02 (5.0) (1420616): ERROR "Error from Win7: Could not initiate file transfer" at line 565 in file /slots/01/dir_13152/userdir/.tmp95luG3/condor-8.8.1/src/condor_shadow.V6.1/pseudo_ops.cpp
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
03/06/19 17:57:02 ** /usr/sbin/condor_shadow
03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $
03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $
03/06/19 17:57:02 ** PID = 1420620
03/06/19 17:57:02 ** Log last touched 3/6 17:57:02
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 Using config source: /etc/condor/condor_config
03/06/19 17:57:02 Using local config sources:
03/06/19 17:57:02    /etc/condor/config.d/00debconf
03/06/19 17:57:02    /etc/condor/config.d/10parallel
03/06/19 17:57:02    /etc/condor/config.d/20allifaces
03/06/19 17:57:02    /etc/condor/condor_config.local
03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352
03/06/19 17:57:02 CLASSAD_CACHING is OFF
03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_246
03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>
03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>
03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0
03/06/19 17:57:02 (5.0) (1420620): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was REFUSED
03/06/19 17:57:02 (5.0) (1420620): Job 5.0 is being evicted from Win7
03/06/19 17:57:02 (5.0) (1420620): logEvictEvent with unknown reason (108), not logging.
03/06/19 17:57:02 (5.0) (1420620): **** condor_shadow (condor_SHADOW) pid 1420620 EXITING WITH STATUS 108
 
 
Speaking of firewalls, I already disabled it completely on the Windows machine, andI use freshly installed Ubuntu 18.04 as a main machine and I did not set up any firewalls there yet.
 
Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?
how can that be the case when you say you are submitting from linux but running on Windows?
 
That is interesting. Indeed, I submit the job from a Linux machine, I do not understand how is this possible. What can I check?
 
All the best,
Alexander A. Prokhorov
 
 


On 6 Mar 2019, at 19:03, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
 
error 10054 is  âAn existing connection was forcibly closed by the remote host.â
 
So either the condor_shadow or a firewall forcibly closed the connection.   You should look in the ShadowLog on the submit machine
at 03/06/19 17:57:02 to see if it was the shadow that closed the connection.   If it did should give a reason.  
 
Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?
how can that be the case when you say you are submitting from linux but running on Windows?
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 9:18 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle
 
I just figured out there is something bad happens on the Windows node.
 
The file StarterLog show the following, this text is added to the log it every single minute:
 
03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP
03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe
03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $
03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $
03/06/19 17:57:02 (pid:5900) ** PID = 5900
03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02
03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config
03/06/19 17:57:02 (pid:5900) Using local config sources: 
03/06/19 17:57:02 (pid:5900)    C:\condor\condor_config.local
03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776
03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF
03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created.
03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring
03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared"
03/06/19 17:57:02 (pid:5900) setting the orig job name in starter
03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter
03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true.
03/06/19 17:57:02 (pid:5900) Initialized IO Proxy.
03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented!
03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 
03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed
03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp
03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs.
03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2).
03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>.
03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header
03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect
03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0
 
 
Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?
 
All the best,
Alexander A. Prokhorov
 
 



On 6 Mar 2019, at 17:42, Alexander Prokhorov <prokher@xxxxxxxxx> wrote:
 
Dear Colleagues,
 
I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)
 
So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.
 
 
> condor_q -better-analyze
 
htcondor: Wed Mar  6 17:32:43 2019
 
-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...
The Requirements _expression_ for job 5.000 is
 
    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||
      (TARGET.HasFileTransfer))
 
Job 5.000 defines the following attributes:
 
    DiskUsage = 1
    FileSystemDomain = "htcondor.localdomain"
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
 
The Requirements _expression_ for job 5.000 reduces to these conditions:
 
         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           1  OpSys == "WINDOWS"
[8]           5  TARGET.HasFileTransfer
 
Last successful match: Wed Mar  6 17:32:00 2019
 
005.000:  Run analysis summary ignoring user priority.  Of 5 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job
 
 
 
Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:
 
> condor_status                                                                   
 
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
 
Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03
slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18
slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
 
               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
 
  X86_64/LINUX     4     0       0         4       0          0        0      0
X86_64/WINDOWS     1     0       0         1       0          0        0      0
 
         Total     5     0       0         5       0          0        0      0
 
All the best,
Alexander A. Prokhorov
 
 
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to 
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/