[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Weird flurry of condor_shadow problem emails this morning

On Mar 10, 2008, at 9:58 AM, Ian Chesal wrote:

My inbox was full of messages from my condor_schedd on the other side of
the world telling me about problems with condor_shadow. The emails all
looked like this:	

	Subject: [Condor] Condor job 10725.239 put on hold
	This is an automated email from the Condor system
	on machine "pg-schedd1.altera.com".  Do not reply.

	Condor job 10725.239 has been put on hold.
	No condor_shadow installed that supports vanilla jobs
	on resources older than V6.3.3
	Please correct this problem and release the job with

My first thought was maybe the NFS file system where we host condor went
down. Nope. I got smart to this years ago and now, on my central
servers, I keep Condor on local disk. So there's a copy of condor_shadow
in /opt/condor/sbin. And it says it's 6.8.6 I386-LINUX_RHEL3 just like
it should.

Nothing has been changed in /opt/condor. Time stamps are fine.

Very mysterious. The emails happened around 7:00 am. I didn't see them
until 10:00 am. Looking at the queue on the scheduler now everything is
either I or R, so it all got released automatically.

Can anyone offer some insight into what might have occurred here? We've
*never* run anything older that 6.7.x at Altera. My guess is that this
message might get sent if a condor_shadow binary can't be found -- is
that possible? Someone /opt/condor/sbin/condor_shadow couldn't be seen
by the condor_schedd process running the machine perhaps?

A missing condor_shadow binary shouldn't cause this error. What will cause it is a machine ad that's missing its CondorVersion attribute and a missing condor_shadow.std binary.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team