[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Node matched and able to run, but the job is idle



Dear Joh,

Thank you for a quick response.


Here is the all the lines in the log appeared at 17:57:02:

03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
03/06/19 17:57:02 ** /usr/sbin/condor_shadow
03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $
03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $
03/06/19 17:57:02 ** PID = 1420616
03/06/19 17:57:02 ** Log last touched 3/6 17:56:02
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 Using config source: /etc/condor/condor_config
03/06/19 17:57:02 Using local config sources:
03/06/19 17:57:02    /etc/condor/config.d/00debconf
03/06/19 17:57:02    /etc/condor/config.d/10parallel
03/06/19 17:57:02    /etc/condor/config.d/20allifaces
03/06/19 17:57:02    /etc/condor/condor_config.local
03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352
03/06/19 17:57:02 CLASSAD_CACHING is OFF
03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_245
03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0
03/06/19 17:57:02 (5.0) (1420616): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was ACCEPTED
03/06/19 17:57:02 (5.0) (1420616): ERROR "Error from Win7: Could not initiate file transfer" at line 565 in file /slots/01/dir_13152/userdir/.tmp95luG3/condor-8.8.1/src/condor_shadow.V6.1/pseudo_ops.cpp
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
03/06/19 17:57:02 ** /usr/sbin/condor_shadow
03/06/19 17:57:02 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
03/06/19 17:57:02 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
03/06/19 17:57:02 ** $CondorVersion: 8.8.1 Feb 19 2019 BuildID: Debian-8.8.1-1 PackageID: 8.8.1-1 Debian-8.8.1-1 $
03/06/19 17:57:02 ** $CondorPlatform: X86_64-Ubuntu_18.04 $
03/06/19 17:57:02 ** PID = 1420620
03/06/19 17:57:02 ** Log last touched 3/6 17:57:02
03/06/19 17:57:02 ******************************************************
03/06/19 17:57:02 Using config source: /etc/condor/condor_config
03/06/19 17:57:02 Using local config sources:
03/06/19 17:57:02    /etc/condor/config.d/00debconf
03/06/19 17:57:02    /etc/condor/config.d/10parallel
03/06/19 17:57:02    /etc/condor/config.d/20allifaces
03/06/19 17:57:02    /etc/condor/condor_config.local
03/06/19 17:57:02 config Macros = 80, Sorted = 80, StringBytes = 2256, TablesBytes = 1352
03/06/19 17:57:02 CLASSAD_CACHING is OFF
03/06/19 17:57:02 Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 SharedPortEndpoint: waiting for connections to named socket 1327422_a214_246
03/06/19 17:57:02 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>
03/06/19 17:57:02 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_246>
03/06/19 17:57:02 Initializing a VANILLA shadow for job 5.0
03/06/19 17:57:02 (5.0) (1420620): Request to run on Win7 <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=2380_3d04_3> was REFUSED
03/06/19 17:57:02 (5.0) (1420620): Job 5.0 is being evicted from Win7
03/06/19 17:57:02 (5.0) (1420620): logEvictEvent with unknown reason (108), not logging.
03/06/19 17:57:02 (5.0) (1420620): **** condor_shadow (condor_SHADOW) pid 1420620 EXITING WITH STATUS 108


Speaking of firewalls, I already disabled it completely on the Windows machine, andI use freshly installed Ubuntu 18.04 as a main machine and I did not set up any firewalls there yet.

Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?
how can that be the case when you say you are submitting from linux but running on Windows?

That is interesting. Indeed, I submit the job from a Linux machine, I do not understand how is this possible. What can I check?

All the best,
Alexander A. Prokhorov



On 6 Mar 2019, at 19:03, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

error 10054 is  âAn existing connection was forcibly closed by the remote host.â
 
So either the condor_shadow or a firewall forcibly closed the connection.   You should look in the ShadowLog on the submit machine
at 03/06/19 17:57:02 to see if it was the shadow that closed the connection.   If it did should give a reason.  
 
Also, the address <127.0.0.1:9618>, indicates that the submit machine is the same as the execute machine, is that the case?
how can that be the case when you say you are submitting from linux but running on Windows?
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Wednesday, March 6, 2019 9:18 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Node matched and able to run, but the job is idle
 
I just figured out there is something bad happens on the Windows node.
 
The file StarterLog show the following, this text is added to the log it every single minute:
 
03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP
03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe
03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $
03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $
03/06/19 17:57:02 (pid:5900) ** PID = 5900
03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02
03/06/19 17:57:02 (pid:5900) ******************************************************
03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config
03/06/19 17:57:02 (pid:5900) Using local config sources: 
03/06/19 17:57:02 (pid:5900)    C:\condor\condor_config.local
03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776
03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF
03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created.
03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125>
03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring
03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245>
03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared"
03/06/19 17:57:02 (pid:5900) setting the orig job name in starter
03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter
03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true.
03/06/19 17:57:02 (pid:5900) Initialized IO Proxy.
03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented!
03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 
03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed
03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp
03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs.
03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2).
03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>.
03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header
03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect
03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting
03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0
 
 
Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?
 
All the best,
Alexander A. Prokhorov
 
 


On 6 Mar 2019, at 17:42, Alexander Prokhorov <prokher@xxxxxxxxx> wrote:
 
Dear Colleagues,
 
I am evaluating HTCondor as a resource management system for a piece of software I am in charge of. First I studied the docs and it seems exactly what we need, so I went to the experiments. (Great job, impressive!)
 
So I am performing experiments to check if HTCondor capabilities match our needs in the reality. One of the key features of HTCondor I find attractive is a Windows support. (Our software is cross-platform, so Windows support is a strong requirement.) So I am trying to submit a Windows job from a Linux machine. Eventually, I have faced rather strange case I cannot explain by myself so I am asking for your help. The job I submit keeps idle in spite of `condor_q` reports that there is a node able to run the job.
 
 
> condor_q -better-analyze
 
htcondor: Wed Mar  6 17:32:43 2019
 
-- Schedd: htcondor.localdomain : <127.0.0.1:9618?...
The Requirements _expression_ for job 5.000 is
 
    (OpSys == "WINDOWS") && (TARGET.Arch == "X86_64") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    ((TARGET.FileSystemDomain == MY.FileSystemDomain) ||
      (TARGET.HasFileTransfer))
 
Job 5.000 defines the following attributes:
 
    DiskUsage = 1
    FileSystemDomain = "htcondor.localdomain"
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
 
The Requirements _expression_ for job 5.000 reduces to these conditions:
 
         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           1  OpSys == "WINDOWS"
[8]           5  TARGET.HasFileTransfer
 
Last successful match: Wed Mar  6 17:32:00 2019
 
005.000:  Run analysis summary ignoring user priority.  Of 5 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job
 
 
 
Frankly, I am stuck here. I am not sure if it is useful, but here is also an output of condor_status:
 
> condor_status                                                                   
 
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
 
Win7                       WINDOWS    X86_64 Unclaimed Idle      0.000 2047  0+00:00:03
slot1@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:18
slot2@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot3@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
slot4@xxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  244  0+01:48:46
 
               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
 
  X86_64/LINUX     4     0       0         4       0          0        0      0
X86_64/WINDOWS     1     0       0         1       0          0        0      0
 
         Total     5     0       0         5       0          0        0      0
 
All the best,
Alexander A. Prokhorov
 
 
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/