[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Weird flurry of condor_shadow problem emails this morning



On Mar 10, 2008, at 9:58 AM, Ian Chesal wrote:

My inbox was full of messages from my condor_schedd on the other side of
the world telling me about problems with condor_shadow. The emails all
looked like this:	

	Subject: [Condor] Condor job 10725.239 put on hold
	
	This is an automated email from the Condor system
	on machine "pg-schedd1.altera.com".  Do not reply.

	Condor job 10725.239 has been put on hold.
	No condor_shadow installed that supports vanilla jobs
	on resources older than V6.3.3
	Please correct this problem and release the job with
	"condor_release"

My first thought was maybe the NFS file system where we host condor went
down. Nope. I got smart to this years ago and now, on my central
servers, I keep Condor on local disk. So there's a copy of condor_shadow
in /opt/condor/sbin. And it says it's 6.8.6 I386-LINUX_RHEL3 just like
it should.

Nothing has been changed in /opt/condor. Time stamps are fine.

Very mysterious. The emails happened around 7:00 am. I didn't see them
until 10:00 am. Looking at the queue on the scheduler now everything is
either I or R, so it all got released automatically.

Can anyone offer some insight into what might have occurred here? We've
*never* run anything older that 6.7.x at Altera. My guess is that this
message might get sent if a condor_shadow binary can't be found -- is
that possible? Someone /opt/condor/sbin/condor_shadow couldn't be seen
by the condor_schedd process running the machine perhaps?


A missing condor_shadow binary shouldn't cause this error. What will cause it is a machine ad that's missing its CondorVersion attribute and a missing condor_shadow.std binary.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team