[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Setting up a new cluster - condor_schedd.exe exited (4)



Condor 7.6

 

Hello I am configuring a test cluster of 67 Windows 7 PC’s and a Windows Server 2008 R2 Master but I am seeing a lot of error’s from each node where the condor_schedd.exe exited (4) and if I submit a job I get the message that

 

 

condor_q -analyze

 

 

-- Submitter: icwincondor1.cc.ic.ac.uk : <155.198.30.249:63316> : icwincondor1.cc.ic.ac.uk

---

004.000:  Request has not yet been considered by the matchmaker.

 

And the job will just sit there.

 

Now the DNS address of the machine appears to be wrong. i.e. it should be maws414-43.ma.ic.ac.uk and not maws414-43.ic.ac.uk

Does anyone have suggestions for initially setting up the cluster?

 

Thanks

Bryan

 

From:	b.cochrane@xxxxxxxxxxxxxx
Sent:	17 May 2011 14:57
To:	Cochrane, Bryan T
Subject:	[Condor] Problem MAWS414-43.ic.ac.uk: condor_schedd.exe exited (4)

This is an automated email from the Condor system on machine "MAWS414-43.ic.ac.uk".  Do not reply.

"C:\Condor/bin/condor_schedd.exe" on "MAWS414-43.ic.ac.uk" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file C:\Condor/log/SchedLog:
05/17/11 14:43:32 (pid:2432) Using local config sources: 
05/17/11 14:43:32 (pid:2432)    C:\Condor/condor_config.local
05/17/11 14:43:32 (pid:2432) DaemonCore: command socket at <155.198.201.122:51411>
05/17/11 14:43:32 (pid:2432) DaemonCore: private command socket at <155.198.201.122:51411>
05/17/11 14:43:32 (pid:2432) Setting maximum accepts per cycle 4.
05/17/11 14:43:32 (pid:2432) History file rotation is enabled.
05/17/11 14:43:32 (pid:2432)   Maximum history file size is: 20971520 bytes
05/17/11 14:43:32 (pid:2432)   Number of rotated history files is: 2
05/17/11 14:43:32 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 
10051.  Will keep trying for 390 total seconds (390 to go).
05/17/11 14:50:02 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 
10051.
05/17/11 14:50:02 (pid:2432) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at 
<169.254.119.1:49159> (try 1 of 3): CEDAR:6001:Failed to connect to <169.254.119.1:49159>
05/17/11 14:50:02 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 
10051.  Will keep trying for 390 total seconds (390 to go).
05/17/11 14:56:32 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 
10051.
05/17/11 14:56:32 (pid:2432) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at 
<169.254.119.1:49159> (try 2 of 3): CEDAR:6001:Failed to connect to 
<169.254.119.1:49159>|CEDAR:6001:Failed to connect to <169.254.119.1:49159>
05/17/11 14:56:32 (pid:2432) ChildAliveMsg: giving up because deadline expired for sending 
DC_CHILDALIVE to parent.
05/17/11 14:56:32 (pid:2432) ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT 
<169.254.119.1:49159>" at line 9995 in file 
c:\condor\execute\dir_4052\userdir\src\condor_daemon_core.v6\daemon_core.cpp
05/17/11 14:56:32 (pid:2432) Cron: Killing all jobs
05/17/11 14:56:32 (pid:2432) CronJobList: Deleting all jobs
05/17/11 14:56:32 (pid:2432) Cron: Killing all jobs
05/17/11 14:56:32 (pid:2432) CronJobList: Deleting all jobs
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: b.cochrane@xxxxxxxxxxxxxx The Official Condor 
Homepage is http://www.cs.wisc.edu/condor