Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)

Date: Wed, 30 Jul 2014 06:30:31 +0200
From: tiho.can@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)

Just figured it out after more random testing: it works if the installation directory is C:\Condor instead of D:\Condor (even though the latter was the default install dir).

-=- Olivier

> Message du 30/07/14 05:22
> De : "Olivier Delalleau"
> A : htcondor-users@xxxxxxxxxxx
> Copie à :
> Objet : [HTCondor-users] Jobs not starting on Windows (condor_read() failed)
>
>
Hi,

>
I've been trying to install HTCondor (condor-8.2.1-256063-Windows-x86.msi) on my Windows 7 computer at work, and I've been stuck at the point where jobs I submit never start.

>
Before giving more details on the problem, I just want to point out there is a typo in the default condor_config file, which is written:
CONDOR_HOST = $$(FULL_HOSTNAME)
instead of:

CONDOR_HOST = $(FULL_HOSTNAME)
(with a single $)

>
Now, for the main issue, it seems to be the same as http://stackoverflow.com/questions/24647062/condor-on-win7-connection-issue-errno-10054. Here are more details. First, in the job's log file, it says:

>
000 (009.000.000) 07/29 22:52:49 Job submitted from host: <10.128.20.195:63107>
...

022 (009.000.000) 07/29 22:52:50 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly

Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.128.20.195:63109>

...
024 (009.000.000) 07/29 22:52:50 Job reconnection failed
Job not found at execution machine

Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

>
In the ShadowLog, I have this error:

>
07/29/14 22:52:49 ******************************************************
07/29/14 22:52:49 Using config source: D:\condor\condor_config

07/29/14 22:52:49 Using local config sources:
07/29/14 22:52:49 D:\condor\condor_config.local

07/29/14 22:52:49 config Macros = 42, Sorted = 42, StringBytes = 743, TablesBytes = 360
07/29/14 22:52:49 CLASSAD_CACHING is OFF

07/29/14 22:52:49 Daemon Log is logging: D_ALWAYS D_ERROR
07/29/14 22:52:49 DaemonCore: command socket at <10.128.20.195:63235>

07/29/14 22:52:49 DaemonCore: private command socket at <10.128.20.195:63235>
07/29/14 22:52:49 Initializing a VANILLA shadow for job 9.0

07/29/14 22:52:49 (9.0) (1272): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.128.20.195:63109> was ACCEPTED

07/29/14 22:52:50 (9.0) (1272): condor_read() failed: recv(fd=636) returned -1, errno = 10054 , reading 5 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.

07/29/14 22:52:50 (9.0) (1272): IO: Failed to read packet header
07/29/14 22:52:50 (9.0) (1272): Can no longer talk to condor_starter <10.128.20.195:63109>

07/29/14 22:52:50 (9.0) (1272): Trying to reconnect to disconnected job
07/29/14 22:52:50 (9.0) (1272): LastJobLeaseRenewal: 1406688769 Tue Jul 29 22:52:49 2014

07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration: 1200 seconds
07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration remaining: 1199

07/29/14 22:52:50 (9.0) (1272): Attempting to locate disconnected starter
07/29/14 22:52:50 (9.0) (1272): locateStarter(): ClaimId (<10.128.20.195:63109>#1406688549#18#c23a6392bd34c97659a3880901f2fa3aa84a3a0b) and GlobalJobId ( my-computer-name.mydomain.org#9.0#1406688769 ) not found

07/29/14 22:52:50 (9.0) (1272): Reconnect FAILED: Job not found at execution machine
07/29/14 22:52:50 (9.0) (1272): **** condor_shadow (condor_SHADOW) pid 1272 EXITING WITH STATUS 107

07/29/14 22:53:49 ******************************************************

>
Finally, I enabled D_ALL logs in the StartLog which I suspect is the main one of interest, and here is what I see around the error (which occurs near bottom of this excerpt):

>
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=4096,timeout=0,flags=0,non_blocking=0)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=738,timeout=0,flags=0,non_blocking=0)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore::Create_Process(D:\condor\bin\condor_starter.exe,...)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) LISTEN <10.128.20.195:63244> fd=628

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) InitCommandSocket(IPv4, 1, want UDP, non-fatal errors) created <10.128.20.195:63244>

07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: created non-negotiated security session 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5 for 0 (inf) seconds.

07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: exporting session info for 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5: [CurrentTime=time();Encryption="NO";Integrity="NO";CryptoMethods="3DES";]

07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) About to register family for PID 8160 with the ProcD
07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) Result of "register_subfamily" operation from ProcD: SUCCESS

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) Child Process: pid 8160 at <10.128.20.195:63244> (0.00 sec)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63244> fd=1276
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63243> fd=1256

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Got universe "VANILLA" (5) from request classad
07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: State change: claim-activation protocol successful

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Changing activity: Idle -> Busy
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) in DaemonCore NewTimer()

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=67
07/29/14 22:52:49 (fd:3) (pid:1428) (D_COMMAND) Return from HandleReq (handler: 0.016s, sec: 0.000s, payload: 0.000s)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63109> fd=688
07/29/14 22:52:49 (fd:3) (pid:1428) (D_PRIV) PRIV_CONDOR --> PRIV_CONDOR at c:\condor\execute\dir_18128\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4101

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore Timeout()
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) DaemonCore Timeout() Complete, returning 4

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) PERF: entering select
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) PERF: leaving select

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) State = FDS_READY
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) max_fd = 1156

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Selection FD's
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {576 600 684 1156 } = 4

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Ready FD's
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {684 } = 1

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Timeout = 4.000000 seconds
07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Calling Handler for Socket

07/29/14 22:52:50 (fd:3) (pid:1428) (D_COMMAND) Calling Handler (2)
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) condor_read(fd=684 <127.0.0.1:63243>,,size=5,timeout=10,flags=0,non_blocking=0)

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) condor_read() failed: recv(fd=684) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:63243>.

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) IO: Failed to read packet header
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) Stream::get(int) failed to read padding

07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Cancel_Socket: cancelled socket 2 01D51760
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63242> fd=684

>
I've tried tons of config variants, toying with options COLLECTOR_HOST, use SECURITY, NO_DNS, NETWORK_INTERFACE, UID_DOMAIN, DEFAULT_DOMAIN_NAME, BIND_ALL_INTERFACES, UPDATE_COLLECTOR_WITH_TCP... but the error is basically always the same.

>
This is on a single computer which I setup to be both a submit and an execute node.

>
I wonder if the problem might be because it's using 127.0.0.1, I'm not sure why it uses it instead of 10.128.20.195, which is the computer's IP on the network. I'm just saying that because if I try to force the IP to 127.0.0.1 through NETWORK_INTERFACE then nothing works at all (I can't even submit a job). It's just a wild guess though.

>
Thanks for any help,

>
-=- Olivier
>
> [ (pas de nom de fichier) (0.3 Ko) ]

References:
- [HTCondor-users] Jobs not starting on Windows (condor_read() failed)
  - From: Olivier Delalleau

Prev by Date: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)
Next by Date: Re: [HTCondor-users] Command to know list of pools
Previous by thread: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)
Next by thread: [HTCondor-users] Question about kerberos authentication - keytab required?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)