[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs not starting on Windows (condor_read() failed)



It looks like the shadow and the starter are disagreeing about what IP address to use to talk to each other.
If this is a a single-node HTCondor pool, then it should be a possible to get things to work by configuring

MY_IP = 10.128.20.195
FULL_HOSTNAME = my-computer-name.mydomain.org
UID_DOMAIN = mydomain.org
NETWORK_INTERFACE = $(MY_IP)
CONDOR_HOST = $(FULL_HOSTNAME)


Of course, my-computer-name.mydomain.org needs to be the actual DNS name for 10.128.20.195
you can run nslookup to make sure.

   nslookup 10.128.20.195

also make sure that either the MY_IP or the FULL_HOSTNAME (or both) is authorized in the various ALLOW_* configuration values.

On 7/29/2014 10:21 PM, Olivier Delalleau wrote:
Hi,

I've been trying to install HTCondor (condor-8.2.1-256063-Windows-x86.msi) on my Windows 7 computer at work, and I've been stuck at the point where jobs I submit never start.

Before giving more details on the problem, I just want to point out there is a typo in the default condor_config file, which is written:
CONDOR_HOST = $$(FULL_HOSTNAME)
instead of:
CONDOR_HOST = $(FULL_HOSTNAME)
(with a single $)

Now, for the main issue, it seems to be the same as http://stackoverflow.com/questions/24647062/condor-on-win7-connection-issue-errno-10054. Here are more details. First, in the job's log file, it says:

000 (009.000.000) 07/29 22:52:49 Job submitted from host: <10.128.20.195:63107>
...
022 (009.000.000) 07/29 22:52:50 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
...
024 (009.000.000) 07/29 22:52:50 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

In the ShadowLog, I have this error:

07/29/14 22:52:49 ******************************************************
07/29/14 22:52:49 Using config source: D:\condor\condor_config
07/29/14 22:52:49 Using local config sources: 
07/29/14 22:52:49    D:\condor\condor_config.local
07/29/14 22:52:49 config Macros = 42, Sorted = 42, StringBytes = 743, TablesBytes = 360
07/29/14 22:52:49 CLASSAD_CACHING is OFF
07/29/14 22:52:49 Daemon Log is logging: D_ALWAYS D_ERROR
07/29/14 22:52:49 DaemonCore: command socket at <10.128.20.195:63235>
07/29/14 22:52:49 DaemonCore: private command socket at <10.128.20.195:63235>
07/29/14 22:52:49 Initializing a VANILLA shadow for job 9.0
07/29/14 22:52:49 (9.0) (1272): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.128.20.195:63109> was ACCEPTED
07/29/14 22:52:50 (9.0) (1272): condor_read() failed: recv(fd=636) returned -1, errno = 10054 , reading 5 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
07/29/14 22:52:50 (9.0) (1272): IO: Failed to read packet header
07/29/14 22:52:50 (9.0) (1272): Can no longer talk to condor_starter <10.128.20.195:63109>
07/29/14 22:52:50 (9.0) (1272): Trying to reconnect to disconnected job
07/29/14 22:52:50 (9.0) (1272): LastJobLeaseRenewal: 1406688769 Tue Jul 29 22:52:49 2014
07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration: 1200 seconds
07/29/14 22:52:50 (9.0) (1272): JobLeaseDuration remaining: 1199
07/29/14 22:52:50 (9.0) (1272): Attempting to locate disconnected starter
07/29/14 22:52:50 (9.0) (1272): locateStarter(): ClaimId (<10.128.20.195:63109>#1406688549#18#c23a6392bd34c97659a3880901f2fa3aa84a3a0b) and GlobalJobId ( my-computer-name.mydomain.org#9.0#1406688769 ) not found
07/29/14 22:52:50 (9.0) (1272): Reconnect FAILED: Job not found at execution machine
07/29/14 22:52:50 (9.0) (1272): **** condor_shadow (condor_SHADOW) pid 1272 EXITING WITH STATUS 107
07/29/14 22:53:49 ******************************************************

Finally, I enabled D_ALL logs in the StartLog which I suspect is the main one of interest, and here is what I see around the error (which occurs near bottom of this excerpt):

07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=4096,timeout=0,flags=0,non_blocking=0)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684 <127.0.0.1:63243>,,size=738,timeout=0,flags=0,non_blocking=0)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore::Create_Process(D:\condor\bin\condor_starter.exe,...)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) LISTEN <10.128.20.195:63244> fd=628
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) InitCommandSocket(IPv4, 1, want UDP, non-fatal errors) created <10.128.20.195:63244>
07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: created non-negotiated security session 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5 for 0 (inf) seconds.
07/29/14 22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: exporting session info for 80303a6d948e06827975e04f9a5113d74d74f7b68318afd5: [CurrentTime=time();Encryption="NO";Integrity="NO";CryptoMethods="3DES";]
07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) About to register family for PID 8160 with the ProcD
07/29/14 22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) Result of "register_subfamily" operation from ProcD: SUCCESS
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) Child Process: pid 8160 at <10.128.20.195:63244> (0.00 sec)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63244> fd=1276
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63243> fd=1256
07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Got universe "VANILLA" (5) from request classad
07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: State change: claim-activation protocol successful
07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Changing activity: Idle -> Busy
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) in DaemonCore NewTimer()
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=67
07/29/14 22:52:49 (fd:3) (pid:1428) (D_COMMAND) Return from HandleReq <command_activate_claim> (handler: 0.016s, sec: 0.000s, payload: 0.000s)
07/29/14 22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <10.128.20.195:63109> fd=688
07/29/14 22:52:49 (fd:3) (pid:1428) (D_PRIV) PRIV_CONDOR --> PRIV_CONDOR at c:\condor\execute\dir_18128\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4101
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore Timeout()
07/29/14 22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) DaemonCore Timeout() Complete, returning 4 
07/29/14 22:52:49 (fd:3) (pid:1428) (D_ALWAYS) PERF: entering select
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) PERF: leaving select
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) State = FDS_READY
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) max_fd = 1156
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Selection FD's
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {576 600 684 1156 } = 4
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Ready FD's
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {684 } = 1
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Timeout = 4.000000 seconds
07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Calling Handler <receiveJobClassAdUpdate> for Socket <starter ClassAd update socket>
07/29/14 22:52:50 (fd:3) (pid:1428) (D_COMMAND) Calling Handler <receiveJobClassAdUpdate> (2)
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) condor_read(fd=684 <127.0.0.1:63243>,,size=5,timeout=10,flags=0,non_blocking=0)
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) condor_read() failed: recv(fd=684) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:63243>.
07/29/14 22:52:50 (fd:3) (pid:1428) (D_ALWAYS) IO: Failed to read packet header
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) Stream::get(int) failed to read padding
07/29/14 22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Cancel_Socket: cancelled socket 2 <starter ClassAd update socket> 01D51760
07/29/14 22:52:50 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63242> fd=684

I've tried tons of config variants, toying with options COLLECTOR_HOST, use SECURITY, NO_DNS, NETWORK_INTERFACE, UID_DOMAIN, DEFAULT_DOMAIN_NAME, BIND_ALL_INTERFACES, UPDATE_COLLECTOR_WITH_TCP... but the error is basically always the same.

This is on a single computer which I setup to be both a submit and an execute node.

I wonder if the problem might be because it's using 127.0.0.1, I'm not sure why it uses it instead of 10.128.20.195, which is the computer's IP on the network. I'm just saying that because if I try to force the IP to 127.0.0.1 through NETWORK_INTERFACE then nothing works at all (I can't even submit a job). It's just a wild guess though.

Thanks for any help,

-=- Olivier


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/