[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs are being killed after 30-45 minutes



Hi again,

Jobs are still being killed, not as fast as it was before but within couple of hours. Logs on the execute node say this:

7/25 09:56:13 ******************************************************
7/25 09:56:13 ** condor_starter (CONDOR_STARTER) STARTING UP
7/25 09:56:13 ** /opt/condor-6.6.11/sbin/condor_starter
7/25 09:56:13 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/25 09:56:13 ** $CondorPlatform: I386-LINUX_RH9 $
7/25 09:56:13 ** PID = 29124
7/25 09:56:13 ******************************************************
7/25 09:56:13 Using config file: /etc/condor/condor_config
7/25 09:56:13 Using local config files: /home/condorr/ condor_config.local
7/25 09:56:13 DaemonCore: Command Socket at <172.24.116.138:9506>
7/25 09:56:13 Done setting resource limits
7/25 09:56:13 Starter communicating with condor_shadow <172.24.116.151:9579> 7/25 09:56:13 Submitting machine is "serv03--hep-- phy.grid.private.cam.ac.uk"
7/25 09:56:13 VM1_USER set, so running job as condor_user1
7/25 09:56:13 File transfer completed successfully.
7/25 09:56:14 Starting a VANILLA universe job with ID: 7773.0
7/25 09:56:14 IWD: /home/condorr/execute/dir_29124
7/25 09:56:14 Output file: /home/condorr/execute/dir_29124/globus- cache-export.t10953.batch.out 7/25 09:56:14 Error file: /home/condorr/execute/dir_29124/globus- cache-export.t10953.batch.err 7/25 09:56:14 Using wrapper /opt/condor/etc/condor_job_wrapper.sh to exec /home/condorr/execute/dir_29124/condor_exec.exe
7/25 09:56:14 Create_Process succeeded, pid=29127
7/25 10:07:17 passwd_cache::cache_uid(): getpwnam("condor") failed: Success 7/25 10:07:17 passwd_cache::cache_uid(): getpwnam("condor") failed: Success


I got couple of this set of lines for the same job id (i.e. 7773) - is it normal? StartLog on the same execute node say this for this job id:


7/25 09:37:01 DaemonCore: Command received via UDP from host <172.24.116.151:9557> 7/25 09:37:01 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
7/25 09:37:01 match_info called
7/25 09:37:01 Received match <172.24.116.138:9662>#1875612231
7/25 09:37:01 State change: match notification protocol successful
7/25 09:37:01 Changing state: Unclaimed -> Matched
7/25 09:37:01 DaemonCore: Command received via TCP from host <172.24.116.151:9692> 7/25 09:37:01 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
7/25 09:37:01 Request accepted.
7/25 09:37:01 Remote owner is lhcb011@xxxxxxxxxxxxxxxxxxxxxxxx
7/25 09:37:01 State change: claiming protocol successful
7/25 09:37:01 Changing state: Matched -> Claimed
7/25 09:37:08 DaemonCore: Command received via TCP from host <172.24.116.151:9668> 7/25 09:37:08 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 7/25 09:37:08 Got activate_claim request from shadow (<172.24.116.151:9668>)
7/25 09:37:08 Remote job ID is 7773.0
7/25 09:37:08 Got universe "VANILLA" (5) from request classad
7/25 09:37:08 State change: claim-activation protocol successful
7/25 09:37:08 Changing activity: Idle -> Busy
7/25 09:40:26 Starter pid 21138 died on signal 11 (signal 11)
7/25 09:40:26 State change: starter exited
7/25 09:40:26 Changing activity: Busy -> Idle
7/25 09:40:27 DaemonCore: Command received via TCP from host <172.24.116.151:9684> 7/25 09:40:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 7/25 09:40:27 Got activate_claim request from shadow (<172.24.116.151:9684>)
7/25 09:40:27 Remote job ID is 7773.0
7/25 09:40:27 Got universe "VANILLA" (5) from request classad
7/25 09:40:27 State change: claim-activation protocol successful
7/25 09:40:27 Changing activity: Idle -> Busy
7/25 09:56:00 Starter pid 25026 died on signal 11 (signal 11)
7/25 09:56:00 State change: starter exited
7/25 09:56:00 Changing activity: Busy -> Idle
7/25 09:56:13 DaemonCore: Command received via TCP from host <172.24.116.151:9552> 7/25 09:56:13 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 7/25 09:56:13 Got activate_claim request from shadow (<172.24.116.151:9552>)
7/25 09:56:13 Remote job ID is 7773.0
7/25 09:56:13 Got universe "VANILLA" (5) from request classad
7/25 09:56:13 State change: claim-activation protocol successful
7/25 09:56:13 Changing activity: Idle -> Busy
7/25 10:07:16 Starter pid 29124 died on signal 11 (signal 11)
7/25 10:07:16 State change: starter exited
7/25 10:07:16 Changing activity: Busy -> Idle
7/25 10:07:17 DaemonCore: Command received via TCP from host <172.24.116.151:9583> 7/25 10:07:17 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 7/25 10:07:17 Got activate_claim request from shadow (<172.24.116.151:9583>)
7/25 10:07:17 Remote job ID is 7773.0
7/25 10:07:17 Got universe "VANILLA" (5) from request classad
7/25 10:07:17 State change: claim-activation protocol successful
7/25 10:07:17 Changing activity: Idle -> Busy


Is this due to "signal 11" issue - what does this actually mean?

I looks like lengthy jobs are being killed after several hrs. I didn't take log backup on the submit host, so they are lost for this job. But I get a numbers of these in the StartLog on the submit host:

6/26 13:13:41 ******************************************************
6/26 13:13:41 ** condor_master (CONDOR_MASTER) STARTING UP
6/26 13:13:41 ** /opt/condor-6.6.11/sbin/condor_master
6/26 13:13:41 ** $CondorVersion: 6.6.11 Mar 23 2006 $
6/26 13:13:41 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 13:13:41 ** PID = 5200
6/26 13:13:41 ******************************************************
6/26 13:13:41 Using config file: /etc/condor/condor_config
6/26 13:13:41 Using local config files: /home/condorr/ condor_config.local
6/26 13:13:41 DaemonCore: Command Socket at <172.24.116.151:9639>
6/26 13:13:42 Started DaemonCore process "/opt/condor/sbin/ condor_collector", pid and pgroup = 5320 6/26 13:13:42 Started DaemonCore process "/opt/condor/sbin/ condor_negotiator", pid and pgroup = 5321 6/26 13:13:42 Started DaemonCore process "/opt/condor/sbin/ condor_schedd", pid and pgroup = 5322 6/26 13:46:34 DaemonCore: Command received via TCP from host <172.24.116.151:9623> 6/26 13:46:34 DaemonCore: received command 60004 (DC_RECONFIG), calling handler (handle_reconfig())
6/26 13:46:34 Reconfiguring all running daemons.
6/26 13:46:34 Sent SIGHUP to COLLECTOR (pid 5320)
6/26 13:46:34 Sent SIGHUP to NEGOTIATOR (pid 5321)
6/26 13:46:34 Sent SIGHUP to SCHEDD (pid 5322)
6/26 13:59:44 DaemonCore: Command received via TCP from host <172.24.116.151:9681> 6/26 13:59:44 DaemonCore: received command 453 (RESTART), calling handler (admin_command_handler)
6/26 13:59:44 Sent SIGTERM to COLLECTOR (pid 5320)
6/26 13:59:44 Sent SIGTERM to NEGOTIATOR (pid 5321)
6/26 13:59:44 Sent SIGTERM to SCHEDD (pid 5322)
6/26 13:59:44 The COLLECTOR (pid 5320) exited with status 0
6/26 13:59:44 The NEGOTIATOR (pid 5321) exited with status 0
6/26 14:00:36 The SCHEDD (pid 5322) exited with status 0
6/26 14:00:36 All daemons are gone.  Restarting.
6/26 14:00:36 Restarting master right away.
6/26 14:00:36 Doing exec( "/opt/condor/sbin/condor_master" )
6/26 14:00:36 getExecPath: readlink("/proc/self/exe") failed: errno 13 (Permission denied)

6/26 14:00:36 passwd_cache::cache_uid(): getpwnam("condor") failed: Success 6/26 14:00:36 passwd_cache::cache_uid(): getpwnam("condor") failed: Success


Can any one please help to figure out what's going on.

Thanks,
Santanu



On 25 Jul 2006, at 06:43, Groenwold, Jan wrote:

Your problem is that in the condor_config.local is a rule that preempt
the job after 2160 sec. (=36 min.)
To avoid that you should modify that rule:

PREEMPT     = ( (Activity == "Busy") && (State == "Claimed") &&
($(ActivityTimer) > 2160) )

I even don't understand why you put in that rule. In the condor_config
is the macro and gives you also possibilities to configure:
UWCS_PREEMPT = ( ((Activity == "Suspended") && \
                  ($(ActivityTimer) > $(MaxSuspendTime))) \
		 || (SUSPEND && (WANT_SUSPEND == False)) )

Jan

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Santanu Das
Sent: maandag 24 juli 2006 18:02
To: Condor-Users Mail List
Subject: [Condor-users] jobs are being killed after 30-45 minutes

Hi Guys,

A number of jobs are being killed after 30-45 minutes of running on our
site.

An example is job (https://mu3.matrix.sara.nl:9000/yha-
AADzc51_ytdhLM_9Ow) that ran on one of our WNs (for which then RB
reports the following timestamps):

       Submitted        : Sat Jul 22 21:56:19 2006
       Waiting          : Sat Jul 22 22:47:46 2006
       Ready            : Sat Jul 22 21:57:04 2006
       Scheduled        : Sat Jul 22 21:57:18 2006
       Running          : Sat Jul 22 21:59:57 2006
       Done             : Sat Jul 22 22:47:38 2006
       Cleared          :                ---
       Aborted          : Sat Jul 22 22:47:48 2006
       Cancelled        :                ---
(they are UTC)

Now, on that execute node,  StarterLog is pretty normal:


7/22 22:32:59 ******************************************************
7/22 22:32:59 ** condor_starter (CONDOR_STARTER) STARTING UP
7/22 22:32:59 ** /opt/condor-6.6.11/sbin/condor_starter
7/22 22:32:59 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/22 22:32:59 ** $CondorPlatform: I386-LINUX_RH9 $
7/22 22:32:59 ** PID = 28393
7/22 22:32:59 ******************************************************
7/22 22:32:59 Using config file: /etc/condor/condor_config
7/22 22:32:59 Using local config files: /home/condorr/
condor_config.local
7/22 22:32:59 DaemonCore: Command Socket at <172.24.116.141:9583>
7/22 22:32:59 Done setting resource limits
7/22 22:32:59 Starter communicating with condor_shadow
<172.24.116.151:9685>
7/22 22:32:59 Submitting machine is "serv03--hep--
phy.grid.private.cam.ac.uk"
7/22 22:32:59 VM1_USER set, so running job as condor_user1
7/22 22:32:59 File transfer completed successfully.
7/22 22:33:00 Starting a VANILLA universe job with ID: 5169.0
7/22 22:33:00 IWD: /home/condorr/execute/dir_28393
7/22 22:33:00 Output file: /home/condorr/execute/dir_28393/globus-
cache-export.q10134.batch.out
7/22 22:33:00 Error file: /home/condorr/execute/dir_28393/globus-
cache-export.q10134.batch.err
7/22 22:33:00 Using wrapper /opt/condor/etc/condor_job_wrapper.sh to
exec /home/condorr/execute/dir_28393/condor_exec.exe
7/22 22:33:00 Create_Process succeeded, pid=28395
7/22 22:56:28 passwd_cache::cache_uid(): getpwnam("condor") failed:
Success

7/22 22:56:28 passwd_cache::cache_uid(): getpwnam("condor") failed:
Success


except the last two lines: 7/22 22:56:28 passwd_cache::cache_uid():
getpwnam("condor") failed: Success
This is not an NIS based set up. I don't have a user named "condor"
anywhere but the condor is running as the same user everywhere.


This is what I have in my condor_config.local file on every execute
node:

HEPCAM_CONTINUE			= ( $(CPUIdle) && ($(ActivityTimer) >
10) )
CONTINUE                        			=
$(HEPCAM_CONTINUE)
KILL                      					= FALSE
PERIODIC_CHECKPOINT      		= FALSE
PREEMPT                         			= ( (Activity ==
"Busy") && (State
== "Claimed") && ($(ActivityTimer) > 2160) )
PREEMPTION_RANK                	= FALSE
PREEMPTION_REQUIREMENTS 	= FALSE
START                           			= TRUE
SUSPEND                         			= FALSE
WANT_SUSPEND                    		= FALSE
WANT_VACATE                     		= FALSE
SEC_DEFAULT_NEGOTIATION	= NEVER
USER_JOB_WRAPPER			=
/opt/condor/etc/condor_job_wrapper.sh


ShadowLog on the submit host is not so good. I got a lot of theses in
there:


7/23 20:14:58 ******************************************************
7/23 20:14:58 ** condor_shadow (CONDOR_SHADOW) STARTING UP
7/23 20:14:58 ** /opt/condor-6.6.11/sbin/condor_shadow
7/23 20:14:58 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/23 20:14:58 ** $CondorPlatform: I386-LINUX_RH9 $
7/23 20:14:58 ** PID = 3661
7/23 20:14:58 ******************************************************
7/23 20:14:58 Using config file: /opt/condor/etc/condor_config
7/23 20:14:58 Using local config files: /home/condorr/
condor_config.local
7/23 20:14:58 DaemonCore: Command Socket at <172.24.116.151:9530>
7/23 20:14:59 Initializing a VANILLA shadow
7/23 20:14:59 (7426.0) (3661): Request to run on <172.24.116.155:9653>
was ACCEPTED
7/23 20:14:59 (7462.0) (26192): IO: Failed to read packet header
7/23 20:14:59 (7462.0) (26192): ERROR "Can no longer talk to
condor_starter on execute machine (172.24.116.161)" at line 63 in file
NTreceivers.C
7/23 20:14:59 passwd_cache::cache_uid(): getpwnam("condor") failed:
Success


and this from the StartLog on the respective execute node:

7/23 20:14:59 Starter pid 28113 died on signal 11 (signal 11)
7/23 20:14:59 vm2: State change: starter exited
7/23 20:14:59 vm2: Changing activity: Busy -> Idle


Any idea what's going on? Thanks in advance for your help.

Cheers,
Santanu

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR