[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.6.2 error sending a program to slave nodes form a master - Fixed + server work-around



On 8/25/2011 3:53 PM, Seth Bardash wrote:
First, thanks for spend the time reading this.

I upgraded to Condor 7.6.2 from condor-7.6.2-x86_rhap_5-unstripped.tar.gz on  a Dual Xeon 32 bit machine running the i686 version of Centos 5.6 , stock kernel, with 12 GB DRAM.

Originally it was 7.5.3, 32 bit and only controlling 2 Windows XP Pro 32 bit machines - this worked.

I upgraded to 7.6.2 so that we could add Windows 7 64 bit slaves and Windows 2008 Server 64 bit slaves

I uninstalled all the old code on the linux master and the windows slaves.
Our test code runs fine, standalone, on all the machines (both 32 and 64 bit)

Un-tar-gz'd the condor-7.6.2-x86_rhap_5-unstripped.tar.gz on the linux machine and downloaded the msi files and the redistributable 2008 and 2011 C executables for the windows machines.

I then installed (./condor_install --type=manager,submit --central-manager=sched1.am1.mnet --verbose) on the linux machine named sched1.am1.mnet.

I then installed the msi files on the old XP Pro 32 bit machines, the Windows 7 64 Bit machine and Windows 2008 Server machine.

Question 1: How do I get the Windows 2008 Server machine to start the condor service as a local service. If I give it a user and login password it starts. If I choose a local service it won't start. This machine has a full install of W2K8 STD SVR 64 bit but is used for nothing else. condor starts correctly on the XP and W7 machines

Questions 2: Once the W2K8 STD SVR machine is started via a user and password in the services screen I see:

[root@sched1 condor]# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@condor-xp1   WINNT51    INTEL  Unclaimed Idle     0.000  1019  0+01:07:25
slot2@condor-xp1   WINNT51    INTEL  Unclaimed Idle     0.000  1019  0+01:06:26
slot1@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  0+03:55:04
slot2@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  0+03:54:45
slot3@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  0+03:55:06
slot4@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  0+03:55:07
slot1@HP3          WINNT61    X86_64 Unclaimed Idle     0.010  1023  0+02:11:56
slot2@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  0+02:13:57
slot3@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  0+02:12:58
slot4@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  0+03:15:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

       INTEL/WINNT51     2     0       0         2       0          0        0
      X86_64/WINNT61     8     0       0         8       0          0        0

               Total    10     0       0        10       0          0        0

condor-xp1  -  Windows XP Pro 32 Bit
HP2                 Windows 7 64 Bit
HP3                Windows 2008 STD Server 64 Bit

but when I try to submit a job using condor_submit file_name I get these errors:

08/25/11 13:18:34 (pid:3451) Activity on stashed negotiator socket: <172.28.96.118:43953>
08/25/11 13:18:34 (pid:3451) Negotiating for owner: rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Finished negotiating for rita in local pool: 1 matched, 0 rejected
08/25/11 13:18:34 (pid:3451) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:18:34 (pid:3451) Sent ad to central manager for rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Sent ad to 1 collectors for rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot2@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:18:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:18:34 (pid:3451) Response problem from startd when requesting claim slot2@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:18:34 (pid:3451) Failed to send REQUEST_CLAIM to startd slot2@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from socket
08/25/11 13:18:34 (pid:3451) Match record (slot2@HP3 <172.28.96.120:49215> for rita, 7.0) deleted
08/25/11 13:19:34 (pid:3451) Activity on stashed negotiator socket: <172.28.96.118:43953>
08/25/11 13:19:34 (pid:3451) Negotiating for owner: rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Finished negotiating for rita in local pool: 1 matched, 0 rejected
08/25/11 13:19:34 (pid:3451) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:19:34 (pid:3451) Sent ad to central manager for rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Sent ad to 1 collectors for rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot3@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:19:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:19:34 (pid:3451) Response problem from startd when requesting claim slot3@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:19:34 (pid:3451) Failed to send REQUEST_CLAIM to startd slot3@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from socket
08/25/11 13:19:34 (pid:3451) Match record (slot3@HP3 <172.28.96.120:49215> for rita, 7.0) deleted

The job gets matched to a free or a set of free cores on a slave node machine but I get a "condor_read() failed: recv() returned -1, errno = 104" error after the match.

Output from the master for processes running is:

[root@sched1 condor]# ps -eflc | grep condor_
5 S condor    3448     1 TS   21 -  2094 -      11:26 ?        00:00:09 condor_master
4 S condor    3449  3448 TS   21 -  2290 -      11:26 ?        00:00:01 condor_collector -f
4 S condor    3450  3448 TS   20 -  2182 -      11:26 ?        00:00:04 condor_negotiator -f
4 S condor    3451  3448 TS   21 -  2585 -      11:26 ?        00:00:00 condor_schedd -f
4 S root      3452  3451 TS   21 -   978 -      11:26 ?        00:00:03 condor_procd -A /tmp/condor-lock.sched10.974037967463122/procd_pipe.SCHEDD -R 10000000 -S 60 -C 1016
0 S root      4786  3866 TS   21 -  1005 pipe_w 15:44 pts/2    00:00:00 grep condor_

the scheduler (sched1) is 172.28.96.118, the slave nodes are .119, .120 for the 64 bit machines and .79 for the WinXP Pro 32 bit machine

users on all machines are 2 people (rita and seth), a "condor" user and root (on linux)

Any help in looking for where to troubleshoot this would be greatly appreciated.


After reading many forum entries that had only marginal applicability......

I added explicitly the name of the master (sched1) on all the windows machines' config files to both the lines listed below:

##  Negotiator access.  Machines listed here are trusted central (- ADDED "sched1")
##  managers.  You should normally not have to change this.
ALLOW_NEGOTIATOR = $(CONDOR_HOST), sched1
##  Now, with flocking we need to let the SCHEDD trust the other
##  negotiators we are flocking with as well.  You should normally
##  not have to change this either.
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS), sched1

AND

installed java on all the machines.

Rebooted.

but the Windows 2008 Server machine would not come up automatically.
I made it a manual start service, waited till the network had started then started it manually from the
services screen and now it runs correctly.

If someone has an idea of how to get this to wait for the network to come up and how to make it a local service
I would be appreciative.

-- 
Seth Bardash

Integrated Solutions and Systems LLC
1510 Old North Gate Road
Colorado Springs, CO  80921

719-495-5866   Shop Phone
719-337-4779   Cell
719-386-0218   Metso Phone
seth@xxxxxxxxxxxxxxxxxxxxxxx

Failure cannot survive knowledge and perseverance!