[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)



Just figured it out after more random testing: it works if the installation directory is C:\Condor instead of D:\Condor (even though the latter was the default install dir).


-=- Olivier


> Message du 30/07/14 05:22
> De : "Olivier Delalleau"
> A : htcondor-users@xxxxxxxxxxx
> Copie à :
> Objet : [HTCondor-users] Jobs not starting on Windows (condor_read() failed)
>
>

Hi,


>

I've been trying to install HTCondor (condor-8.2.1-256063-Windows-x86.msi) on my Windows 7 computer at work, and I've been stuck at the point where jobs I submit never start.


>

Before giving more details on the problem, I just want to point out there is a typo in the default condor_config file, which is written:

CONDOR_HOST = $$(FULL_HOSTNAME)

instead of:

CONDOR_HOST = $(FULL_HOSTNAME)

(with a single $)


>

Now, for the main issue, it seems to be the same as http://stackoverflow.com/questions/24647062/condor-on-win7-connection-issue-errno-10054. Here are more details. First, in the job's log file, it says:


>

000 (009.000.000) 07/29 22:52:49 Job submitted from host: <10.128.20.195:63107>

...

022 (009.000.000) 07/29 22:52:50 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.128.20.195:63109>

...

024 (009.000.000) 07/29 22:52:50 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job


>

In the ShadowLog, I have this error:


>

07/29/14 22:52:49 ******************************************************

07/29/14 22:52:49 Using config source: D:\condor\condor_config

07/29/14 22:52:49 Using local config sources: 

07/29/14 22:52:49    D:\condor\condor_config.local

07/29/14 22:52:49 config Macros = 42, Sorted = 42, StringBytes = 743, TablesBytes = 360

07/29/14 22:52:49 CLASSAD_CACHING is OFF

07/29/14 22:52:49 Daemon Log is logging: D_ALWAYS D_ERROR

07/29/14 22:52:49 DaemonCore: command socket at <10.128.20.195:63235>

07/29/14 22:52:49 DaemonCore: private command socket at <10.128.20.195:63235>

07/29/14 22:52:49 Initializing a VANILLA shadow for job 9.0

07/29/14 22:52:49 (9.0) (1272): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.128.20.195:63109> was ACCEPTED

07/29/14 22:52:50 (9.0) (1272): condor_read() failed: recv(fd=636) returned -1, errno = 10054 , reading 5 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.

07/29/14 22:52:50 (9.0) (1272): IO: Failed to read packet header

07/29/14 22:52:50 (9.0) (1272): Can no longer talk to condor_starter <10.128.20.195:63109>

07/29/14 22:52:50 (9.0) (1272): Trying to reconnect to disconnected job

07/29/14 22:52:50 (9.0) (1272): LastJobLeaseRenewal: 1406688769 Tue Jul 29 22:52:49 2014

07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration: 1200 seconds

07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration remaining: 1199

07/29/14 22:52:50 (9.0) (1272): Attempting to locate disconnected starter

07/29/14 22:52:50 (9.0) (1272): locateStarter(): ClaimId (<10.128.20.195:63109>#1406688549#18#c23a6392bd34c97659a3880901f2fa3aa84a3a0b) and GlobalJobId ( my-computer-name.mydomain.org#9.0#1406688769 ) not found

07/29/14 22:52:50 (9.0) (1272): Reconnect FAILED: Job not found at execution machine

07/29/14 22:52:50 (9.0) (1272): **** condor_shadow (condor_SHADOW) pid 1272 EXITING WITH STATUS 107

07/29/14 22:53:49 ******************************************************


>

Finally, I enabled D_ALL logs in the StartLog which I suspect is the main one of interest, and here is what I see around the error (which occurs near bottom of this excerpt):


>

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=4096,timeout=0,flags=0,non_blocking=0)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=738,timeout=0,flags=0,non_blocking=0)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore::Create_Process(D:\condor\bin\condor_starter.exe,...)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) LISTEN <10.128.20.195:63244> fd=628

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) InitCommandSocket(IPv4, 1, want UDP, non-fatal errors) created <10.128.20.195:63244>

07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: created non-negotiated security session 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5 for 0 (inf) seconds.

07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: exporting session info for 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5: [CurrentTime=time();Encryption="NO";Integrity="NO";CryptoMethods="3DES";]

07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) About to register family for PID 8160 with the ProcD

07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) Result of "register_subfamily" operation from ProcD: SUCCESS

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) Child Process: pid 8160 at <10.128.20.195:63244> (0.00 sec)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63244> fd=1276

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63243> fd=1256

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Got universe "VANILLA" (5) from request classad

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: State change: claim-activation protocol successful

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Changing activity: Idle -> Busy

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) in DaemonCore NewTimer()

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=67

07/29/14 22:52:49 (fd:3) (pid:1428) (D_COMMAND) Return from HandleReq (handler: 0.016s, sec: 0.000s, payload: 0.000s)

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63109> fd=688

07/29/14 22:52:49 (fd:3) (pid:1428) (D_PRIV) PRIV_CONDOR --> PRIV_CONDOR at c:\condor\execute\dir_18128\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4101

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore Timeout()

07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) DaemonCore Timeout() Complete, returning 4 

07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) PERF: entering select

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) PERF: leaving select

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) State = FDS_READY

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) max_fd = 1156

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Selection FD's

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {576 600 684 1156 } = 4

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Ready FD's

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {684 } = 1

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Timeout = 4.000000 seconds

07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Calling Handler for Socket

07/29/14 22:52:50 (fd:3) (pid:1428) (D_COMMAND) Calling Handler (2)

07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) condor_read(fd=684 <127.0.0.1:63243>,,size=5,timeout=10,flags=0,non_blocking=0)

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) condor_read() failed: recv(fd=684) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:63243>.

07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) IO: Failed to read packet header

07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) Stream::get(int) failed to read padding

07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Cancel_Socket: cancelled socket 2 01D51760

07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63242> fd=684


>

I've tried tons of config variants, toying with options COLLECTOR_HOST, use SECURITY, NO_DNS, NETWORK_INTERFACE, UID_DOMAIN, DEFAULT_DOMAIN_NAME, BIND_ALL_INTERFACES, UPDATE_COLLECTOR_WITH_TCP... but the error is basically always the same.


>

This is on a single computer which I setup to be both a submit and an execute node.


>

I wonder if the problem might be because it's using 127.0.0.1, I'm not sure why it uses it instead of 10.128.20.195, which is the computer's IP on the network. I'm just saying that because if I try to force the IP to 127.0.0.1 through NETWORK_INTERFACE then nothing works at all (I can't even submit a job). It's just a wild guess though.


>

Thanks for any help,


>

-=- Olivier

>
> [ (pas de nom de fichier) (0.3 Ko) ]