[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 7.6.2 error sending a program to slave nodes form a master



First, thanks for spend the time reading this.

I upgraded to Condor 7.6.2 from condor-7.6.2-x86_rhap_5-unstripped.tar.gz on a Dual Xeon 32 bit machine running the i686 version of Centos 5.6 , stock kernel, with 12 GB DRAM.

Originally it was 7.5.3, 32 bit and only controlling 2 Windows XP Pro 32 bit machines - this worked.

I upgraded to 7.6.2 so that we could add Windows 7 64 bit slaves and Windows 2008 Server 64 bit slaves

I uninstalled all the old code on the linux master and the windows slaves.
Our test code runs fine, standalone, on all the machines (both 32 and 64 bit)

Un-tar-gz'd the condor-7.6.2-x86_rhap_5-unstripped.tar.gz on the linux machine and downloaded the msi files and the redistributable 2008 and 2011 C executables for the windows machines.

I then installed (./condor_install --type=manager,submit --central-manager=sched1.am1.mnet --verbose) on the linux machine named sched1.am1.mnet.

I then installed the msi files on the old XP Pro 32 bit machines, the Windows 7 64 Bit machine and Windows 2008 Server machine.

Question 1: How do I get the Windows 2008 Server machine to start the condor service as a local service. If I give it a user and login password it starts. If I choose a local service it won't start. This machine has a full install of W2K8 STD SVR 64 bit but is used for nothing else. condor starts correctly on the XP and W7 machines

Questions 2: Once the W2K8 STD SVR machine is started via a user and password in the services screen I see:

[root@sched1 condor]# condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000 1019 0+01:07:25 slot2@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000 1019 0+01:06:26 slot1@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047 0+03:55:04 slot2@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047 0+03:54:45 slot3@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047 0+03:55:06 slot4@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047 0+03:55:07 slot1@HP3 WINNT61 X86_64 Unclaimed Idle 0.010 1023 0+02:11:56 slot2@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023 0+02:13:57 slot3@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023 0+02:12:58 slot4@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023 0+03:15:07 Total Owner Claimed Unclaimed Matched Preempting Backfill

INTEL/WINNT51 2 0 0 2 0 0 0 X86_64/WINNT61 8 0 0 8 0 0 0

Total 10 0 0 10 0 0 0

condor-xp1  -  Windows XP Pro 32 Bit
HP2                 Windows 7 64 Bit
HP3                Windows 2008 STD Server 64 Bit

but when I try to submit a job using condor_submit file_name I get these errors:

08/25/11 13:18:34 (pid:3451) Activity on stashed negotiator socket: <172.28.96.118:43953> 08/25/11 13:18:34 (pid:3451) Negotiating for owner: rita@localdomain localhost 08/25/11 13:18:34 (pid:3451) Finished negotiating for rita in local pool: 1 matched, 0 rejected 08/25/11 13:18:34 (pid:3451) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/25/11 13:18:34 (pid:3451) Sent ad to central manager for rita@localdomain localhost 08/25/11 13:18:34 (pid:3451) Sent ad to 1 collectors for rita@localdomain localhost 08/25/11 13:18:34 (pid:3451) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot2@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:18:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:18:34 (pid:3451) Response problem from startd when requesting claim slot2@HP3 <172.28.96.120:49215> for rita 7.0. 08/25/11 13:18:34 (pid:3451) Failed to send REQUEST_CLAIM to startd slot2@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from socket 08/25/11 13:18:34 (pid:3451) Match record (slot2@HP3 <172.28.96.120:49215> for rita, 7.0) deleted 08/25/11 13:19:34 (pid:3451) Activity on stashed negotiator socket: <172.28.96.118:43953> 08/25/11 13:19:34 (pid:3451) Negotiating for owner: rita@localdomain localhost 08/25/11 13:19:34 (pid:3451) Finished negotiating for rita in local pool: 1 matched, 0 rejected 08/25/11 13:19:34 (pid:3451) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/25/11 13:19:34 (pid:3451) Sent ad to central manager for rita@localdomain localhost 08/25/11 13:19:34 (pid:3451) Sent ad to 1 collectors for rita@localdomain localhost 08/25/11 13:19:34 (pid:3451) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot3@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:19:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:19:34 (pid:3451) Response problem from startd when requesting claim slot3@HP3 <172.28.96.120:49215> for rita 7.0. 08/25/11 13:19:34 (pid:3451) Failed to send REQUEST_CLAIM to startd slot3@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from socket 08/25/11 13:19:34 (pid:3451) Match record (slot3@HP3 <172.28.96.120:49215> for rita, 7.0) deleted

The job gets matched to a free or a set of free cores on a slave node machine but I get a "condor_read() failed: recv() returned -1, errno = 104" error after the match.

Output from the master for processes running is:

[root@sched1 condor]# ps -eflc | grep condor_
5 S condor 3448 1 TS 21 - 2094 - 11:26 ? 00:00:09 condor_master 4 S condor 3449 3448 TS 21 - 2290 - 11:26 ? 00:00:01 condor_collector -f 4 S condor 3450 3448 TS 20 - 2182 - 11:26 ? 00:00:04 condor_negotiator -f 4 S condor 3451 3448 TS 21 - 2585 - 11:26 ? 00:00:00 condor_schedd -f 4 S root 3452 3451 TS 21 - 978 - 11:26 ? 00:00:03 condor_procd -A /tmp/condor-lock.sched10.974037967463122/procd_pipe.SCHEDD -R 10000000 -S 60 -C 1016 0 S root 4786 3866 TS 21 - 1005 pipe_w 15:44 pts/2 00:00:00 grep condor_

the scheduler (sched1) is 172.28.96.118, the slave nodes are .119, .120 for the 64 bit machines and .79 for the WinXP Pro 32 bit machine

users on all machines are 2 people (rita and seth), a "condor" user and root (on linux)

Any help in looking for where to troubleshoot this would be greatly appreciated.

--
Seth Bardash

Integrated Solutions and Systems LLC
1510 Old North Gate Road
Colorado Springs, CO  80921

719-495-5866   Shop Phone
719-337-4779   Cell
seth@xxxxxxxxxxxxxxxxxxxxxxx

Failure cannot survive knowledge and perseverance!