Condor 7.6 Hello I am configuring a test cluster of 67 Windows 7 PC’s and a Windows Server 2008 R2 Master but I am seeing a lot of error’s from each node where the condor_schedd.exe exited (4) and if I submit a job I get the message that
condor_q -analyze -- Submitter: icwincondor1.cc.ic.ac.uk : <155.198.30.249:63316> : icwincondor1.cc.ic.ac.uk --- 004.000: Request has not yet been considered by the matchmaker. And the job will just sit there. Now the DNS address of the machine appears to be wrong. i.e. it should be maws414-43.ma.ic.ac.uk and not maws414-43.ic.ac.uk Does anyone have suggestions for initially setting up the cluster? Thanks Bryan |
From: b.cochrane@xxxxxxxxxxxxxx Sent: 17 May 2011 14:57 To: Cochrane, Bryan T Subject: [Condor] Problem MAWS414-43.ic.ac.uk: condor_schedd.exe exited (4) This is an automated email from the Condor system on machine "MAWS414-43.ic.ac.uk". Do not reply. "C:\Condor/bin/condor_schedd.exe" on "MAWS414-43.ic.ac.uk" exited with status 4. Condor will automatically restart this process in 10 seconds. *** Last 20 line(s) of file C:\Condor/log/SchedLog: 05/17/11 14:43:32 (pid:2432) Using local config sources: 05/17/11 14:43:32 (pid:2432) C:\Condor/condor_config.local 05/17/11 14:43:32 (pid:2432) DaemonCore: command socket at <155.198.201.122:51411> 05/17/11 14:43:32 (pid:2432) DaemonCore: private command socket at <155.198.201.122:51411> 05/17/11 14:43:32 (pid:2432) Setting maximum accepts per cycle 4. 05/17/11 14:43:32 (pid:2432) History file rotation is enabled. 05/17/11 14:43:32 (pid:2432) Maximum history file size is: 20971520 bytes 05/17/11 14:43:32 (pid:2432) Number of rotated history files is: 2 05/17/11 14:43:32 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 10051. Will keep trying for 390 total seconds (390 to go). 05/17/11 14:50:02 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 10051. 05/17/11 14:50:02 (pid:2432) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <169.254.119.1:49159> (try 1 of 3): CEDAR:6001:Failed to connect to <169.254.119.1:49159> 05/17/11 14:50:02 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 10051. Will keep trying for 390 total seconds (390 to go). 05/17/11 14:56:32 (pid:2432) attempt to connect to <169.254.119.1:49159> failed: connect errno = 10051. 05/17/11 14:56:32 (pid:2432) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <169.254.119.1:49159> (try 2 of 3): CEDAR:6001:Failed to connect to <169.254.119.1:49159>|CEDAR:6001:Failed to connect to <169.254.119.1:49159> 05/17/11 14:56:32 (pid:2432) ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent. 05/17/11 14:56:32 (pid:2432) ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <169.254.119.1:49159>" at line 9995 in file c:\condor\execute\dir_4052\userdir\src\condor_daemon_core.v6\daemon_core.cpp 05/17/11 14:56:32 (pid:2432) Cron: Killing all jobs 05/17/11 14:56:32 (pid:2432) CronJobList: Deleting all jobs 05/17/11 14:56:32 (pid:2432) Cron: Killing all jobs 05/17/11 14:56:32 (pid:2432) CronJobList: Deleting all jobs *** End of file SchedLog -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Questions about this message or Condor in general? Email address of the local Condor administrator: b.cochrane@xxxxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor