[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs terminated with SIGQUIT signal



Hi there,

We got a strange problem on our systems: jobs are terminated with SIGQUIT signal a couple of minutes after started. This happens from time to time, its seems all jobs are affected and no jobs can be executed for more than 10 minutes. Below is a typical log on the worknode's StarterLog.slot*:


12/19 07:40:03 ******************************************************
12/19 07:40:03 ** condor_starter (CONDOR_STARTER) STARTING UP
12/19 07:40:03 ** /opt/condor-7.0.4/sbin/condor_starter
12/19 07:40:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
12/19 07:40:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/19 07:40:03 ** PID = 7352
12/19 07:40:03 ** Log last touched 12/19 07:39:58
12/19 07:40:03 ******************************************************
12/19 07:40:03 Using config source: /etc/condor/condor_config
12/19 07:40:03 Using local config sources:
12/19 07:40:03    /opt/condor/condor_config.local
12/19 07:40:03 DaemonCore: Command Socket at <128.227.221.104:59102>
12/19 07:40:03 Done setting resource limits
12/19 07:40:03 Communicating with shadow <128.227.221.12:36599>
12/19 07:40:03 Submitting machine is "hg.ihepa.ufl.edu"
12/19 07:40:03 setting the orig job name in starter
12/19 07:40:03 setting the orig job iwd in starter
12/19 07:40:03 Job 1219662.0 set to execute immediately
12/19 07:40:03 Starting a VANILLA universe job with ID: 1219662.0
12/19 07:40:03 IWD: /share/home/cms31291/gram_scratch_RRbmV1FBMe
12/19 07:40:03 Output file: /share/home/cms31291/.globus/job/hg.ihepa.ufl.edu/35
8.1229689455/stdout
12/19 07:40:03 Error file: /share/home/cms31291/.globus/job/hg.ihepa.ufl.edu/358.1229689455/stderr 12/19 07:40:03 About to exec /share/home/cms31291/.globus/.gass_cache/local/md5/ 02/d5a0e9fd5e13006d0bf2c5381f3b0f/md5/96/55ad8460528337ce276628a2e787ad/data UI=
000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:AP
P=000000:LBS=000000
12/19 07:40:03 Create_Process succeeded, pid=7353
12/19 07:46:57 Process exited, pid=7353, status=0
12/19 07:46:57 Got SIGQUIT.  Performing fast shutdown.
12/19 07:46:57 ShutdownFast all jobs.


According to MatchLog on the gatekeeper, the jobs were preempted. But I don't understand why higher rank and priority jobs were preempted by lower ones. The condor version we are running is 7.0.4.

Thanks,

Yu