[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_schedd Crash - Jobs not running



Hi,

We are using condor 6.8 pool  (around 150 nodes), until now it use to run fine. Looks like somebody submitted 2000 jobs seems.

After that none of the jobs are running, I am keep getting email from condor like this

 

Condor job 13221.0 has been put on hold.

No condor_shadow installed that supports vanilla jobs on V6.3.3 or newer resources Please correct this problem and release the job with "condor_release"

 

And also …..

condor_schedd is exited with status 44. And I am seeing this file under log.

“dprintf_failure.SCHEDD”

 

Here is the file content.

 

3/21 18:55:48 dprintf() had a fatal error in pid 6362

Can't link(/u/condor/log/SchedLog,/u/condor/log/SchedLog.old)

errno: 17 (File exists)

euid: 32768, ruid: 0

 

 

Here is the SchedLog. Even I killed condor master and restarted, then released all the jobs. But still having the same problem all the jobs are going to hold.

Could you please help me, what might be the problem? How to fix this.

 

ps -ef | grep condor

condor   19294     1  0 Mar21 ?        00:00:03 /usr/local/condor/sbin/condor_master

condor   19295 19294  0 Mar21 ?        00:00:11 condor_collector -f

condor   19296 19294  0 Mar21 ?        00:00:03 condor_negotiator -f

condor   19297 19294  0 Mar21 ?        00:00:06 condor_startd -f

condor   19298 19294  4 Mar21 ?        00:03:25 condor_schedd -f -p 9600

 

But the pool couldn’t not run job, turn into hold and keep complaining about No condor_shadow installed ………….

 

Is it possible to fix it with out reinstalling condor ?

 

 

Thanks,

Senthil

 

SchedLog

***********

3/22 00:00:22 Marked job 14508.0 as IDLE

3/22 00:00:22 Job 14508.0 put on hold: No condor_shadow installed that supports vanilla jobs on V6.3.3 or newer resources

3/22 00:00:22 abort_job_myself: 14508.0 action:Hold log_hold:true notify:true

3/22 00:00:22 Writing record to user logfile=/u/jum18/AIRTP/Patient_278.log owner=jum18

3/22 00:00:22 Forking Mailer process...

3/22 00:00:22 start next job after 2 sec, JobsThisBurst 0

3/22 00:00:22 DaemonCore: No more children processes to reap.

3/22 00:00:24 Job prep for 14509.0 will not block, calling aboutToSpawnJobHandler() directly

3/22 00:00:24 aboutToSpawnJobHandler() completed for job 14509.0, attempting to spawn job handler

3/22 00:00:24 Trying to run a VANILLA job on a 6.3.3 or later resource, but you do not have condor_shadow that will work, aborting