[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange standard universe failure



Hi,

We're trying to submit standard universe jobs from a pool running INTEL/Linux platforms to a flocked pool running X86_64/Linux machines but with no success. The jobs get matched to an execute machine but then the Shadow on the submit host fails to connect to a Starter on the execute node. In the following logs, the submit host is at 172.24.116.7, the execute node's at 172.24.89.200 and the latter's central manager is at 172.24.89.201. Here's the relevant snippet from the ShadowLog:

11/24 16:18:59 (?.?) (10531):******* Standard Shadow starting up *******
11/24 16:18:59 (?.?) (10531):** $CondorVersion: 6.8.2 Oct 12 2006 $
11/24 16:18:59 (?.?) (10531):** $CondorPlatform: I386-LINUX_RHEL3 $
11/24 16:18:59 (?.?) (10531):*******************************************
11/24 16:18:59 (?.?) (10531):uid=0, euid=1001, gid=0, egid=100
11/24 16:18:59 (?.?) (10531):Hostname = "<172.24.89.200:9677>", Job = 118.0
11/24 16:18:59 (118.0) (10531):Requesting Primary Starter
11/24 16:18:59 (118.0) (10531):Shadow: Request to run a job was ACCEPTED
11/24 16:18:59 (118.0) (10531):connect returns -1, errno = 113
11/24 16:18:59 (118.0) (10531):failed to connect to scheduler on <172.24.89.200:9696> 11/24 16:18:59 (118.0) (10531):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/spool/cluster118.proc0.subproc0.tmp' 11/24 16:18:59 (118.0) (10531):Trying to unlink /home/condor/spool/cluster118.proc0.subproc0.tmp
11/24 16:18:59 (118.0) (10531):********** Shadow Exiting(108) **********

And here's the corresponding entry from the StartLog on the execute node:

11/24 16:18:54 vm1: Received match <172.24.89.200:9677>#1164119595#192
11/24 16:18:54 vm1: Started match timer (6775) for 120 seconds.
11/24 16:18:54 vm1: State change: match notification protocol successful
11/24 16:18:54 vm1: Changing state: Unclaimed -> Matched
11/24 16:18:55 DaemonCore: Command received via TCP from host <172.24.116.7:9610> 11/24 16:18:55 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
11/24 16:18:55 vm1: Canceled match timer (6775)
11/24 16:18:55 vm1: Schedd addr = <172.24.116.7:9690>
11/24 16:18:55 vm1: Alive interval = 300
11/24 16:18:55 vm1: Received ClaimId from schedd (<172.24.89.200:9677>#1164119595#192)
11/24 16:18:55 vm1: Rank of this claim is: 0.000000
11/24 16:18:55 vm1: Request accepted.
11/24 16:18:55 vm1: Remote owner is mcal00@xxxxxxxxxxxxxxxxxxxxxxxxx
11/24 16:18:55 vm1: State change: claiming protocol successful
11/24 16:18:55 vm1: Changing state: Matched -> Claimed
11/24 16:18:55 vm1: Started ClaimLease timer (6777) w/ 1800 second lease duration
11/24 16:18:58 Trying to update collector <172.24.89.201:9618>
11/24 16:18:58 Attempting to send update via UDP to collector appcs--ra--phy.grid.private.cam.ac.uk <172.24.89.201:9618>
11/24 16:18:58 vm1: Sent update to 1 collector(s)
11/24 16:18:59 DaemonCore: Command received via TCP from host <172.24.116.7:9647> 11/24 16:18:59 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/24 16:18:59 vm1: Got activate_claim request from shadow (<172.24.116.7:9647>)
11/24 16:18:59 vm1: Read request ad and starter from shadow.
11/24 16:18:59 Swap space: 6288944
11/24 16:18:59 47238576 kbytes available for "/usr/condor/local/execute"
11/24 16:18:59 Looking up RESERVED_DISK parameter
11/24 16:18:59 Reserving 5120 kbytes for file system
11/24 16:18:59 Disk space: 47233456
11/24 16:18:59 Job wants old RSC/Ckpt starter, skipping /usr/condor/sbin/condor_starter 11/24 16:18:59 Job wants old RSC/Ckpt starter, skipping /usr/condor/sbin/condor_starter.pvm
11/24 16:21:29 vm1: accept timed out

There's no activity in the StarterLog, which I find strange. Now, why is the Shadow trying to connect to a "scheduler" on the execute host, as reported in the ShadowLog? The execute host only runs two daemons, a master and a startd, so what's it alluding to?

Thanks for any pointers/clues,
Mark