Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] eviction of all jobs after a period of successfulruns.

Date: Tue, 12 Oct 2004 08:02:26 -0400
From: Robert Parrott <parrott@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] eviction of all jobs after a period of successfulruns.

Hi folks,

I have some parametric problems, where I run as many as 40,000 small (1.5 minute) jobs on a condor cluster. The binaries are compiled with checkpointing. Condor version is 6.6.6 on RH 9.0 intel.

What I find is that, besides being less than optimal time-wise to use many 1.5 minute jobs, that the condor cluster after a time stops running the jobs. The Negotiator matches them up with run nodes, but the jobs are evicted immediately after being Started. This happens spontaneously after about two hours of successful runs.

The cluster remains idle even with these jobs in queue.

The Negotiator log just shows successful matchups that schedule jobs on all available resources, the SchedLog on the submitting node shows that a shadow job is started for the jobs, and then the job is VACATED, and the StarterLog on the run node shows that the job receives a CHILD_EXIT event before the job really runs at all.

The logs are below.

Any thoughts on why this is happening? I've seen something like this before.

------------------------------------------------------------------------ ----------

NegotiatorLog:

10/12 07:47:19 Started shadow for job 9.11517 on "<10.101.1.4:40709>", (shadow pid = 21560) 10/12 07:47:21 Started shadow for job 9.11518 on "<10.101.1.4:40709>", (shadow pid = 21563) 10/12 07:47:21 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709> 10/12 07:47:21 Match record (<10.101.1.4:40709>, 9, 11517) deleted 10/12 07:47:21 DaemonCore: Command received via TCP from host <10.101.1.4:43557> 10/12 07:47:21 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/12 07:47:21 Got VACATE_SERVICE from <10.101.1.4:43557> 10/12 07:47:22 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709> 10/12 07:47:22 Match record (<10.101.1.4:40709>, 9, 11518) deleted 10/12 07:47:22 DaemonCore: Command received via TCP from host <10.101.1.4:43561> 10/12 07:47:22 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service) 10/12 07:47:22 Got VACATE_SERVICE from <10.101.1.4:43561>

The ShadowLog (different job, same behavior)

10/12 07:53:03 (?.?) (22151):******* Standard Shadow starting up ******* 10/12 07:53:03 (?.?) (22151):** $CondorVersion: 6.6.6 Jul 26 2004 $ 10/12 07:53:03 (?.?) (22151):** $CondorPlatform: I386-LINUX_RH9 $ 10/12 07:53:03 (?.?) (22151):******************************************* 10/12 07:53:03 (?.?) (22151):uid=0, euid=10228, gid=0, egid=10228 10/12 07:53:03 (?.?) (22151):Hostname = "<10.101.1.48:32771>", Job = 9.11579 10/12 07:53:03 (9.11579) (22151):Requesting Primary Starter 10/12 07:53:03 (9.11579) (22151):Shadow: Request to run a job was ACCEPTED 10/12 07:53:03 (9.11579) (22151):Shadow: RSC_SOCK connected, fd = 17 10/12 07:53:03 (9.11579) (22151):Shadow: CLIENT_LOG connected, fd = 18 10/12 07:53:03 (9.11579) (22151):My_Filesystem_Domain = "deas.harvard.edu" 10/12 07:53:03 (9.11579) (22151):My_UID_Domain = "deas.harvard.edu" 10/12 07:53:03 (9.11579) (22151): Entering pseudo_get_file_stream 10/12 07:53:03 (9.11579) (22151): file = "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" 10/12 07:53:03 (9.11579) (22151): Weird 0xa650001 10/12 07:53:03 (9.11579) (22151): Weird 0xa650001 10/12 07:53:04 (9.11579) (22151):Reaped child status - pid 22152 exited with status 0 10/12 07:53:05 (9.11579) (22151):Read: User Job - $CondorPlatform: I386-LINUX-RH9 $ 10/12 07:53:05 (9.11579) (22151):Read: User Job - $CondorVersion: 6.6.5 May 3 2004 $ 10/12 07:53:05 (9.11579) (22151):Read: Checkpoint file name is "/home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0" 10/12 07:53:05 (9.11579) (22151):error: Error: Couldn't open standard file '/dev/null' 10/12 07:53:05 (9.11579) (22151):Shadow: Job 9.11579 exited, termsig = 9, coredump = 0, retcode = 0 10/12 07:53:05 (9.11579) (22151):Shadow: Job was kicked off without a checkpoint 10/12 07:53:05 (9.11579) (22151):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0.tmp' 10/12 07:53:05 (9.11579) (22151):Trying to unlink /home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0.tmp 10/12 07:53:05 (9.11579) (22151):user_time = 0 ticks 10/12 07:53:05 (9.11579) (22151):sys_time = 5 ticks 10/12 07:53:05 (9.11579) (22151):********** Shadow Exiting(107) **********

And the StarterLog on the run node show that the Starter gets a CHILD_EXIT event right after the job starts:

10/12 07:57:23 ********** STARTER starting up *********** 10/12 07:57:23 ** $CondorVersion: 6.6.6 Jul 26 2004 $ 10/12 07:57:23 ** $CondorPlatform: I386-LINUX_RH9 $ 10/12 07:57:23 ****************************************** 10/12 07:57:23 Submitting machine is "grid-gw-01.deas.harvard.edu" 10/12 07:57:23 EventHandler { 10/12 07:57:23 func = 0x8073476 10/12 07:57:23 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 10/12 07:57:23 } 10/12 07:57:23 Done setting resource limits 10/12 07:57:23 *FSM* Transitioning to state "GET_PROC" 10/12 07:57:23 *FSM* Executing state func "get_proc()" [ ] 10/12 07:57:23 Entering get_proc() 10/12 07:57:23 Entering get_job_info() 10/12 07:57:23 Startup Info: 10/12 07:57:23 Version Number: 1 10/12 07:57:23 Id: 9.11517 10/12 07:57:23 JobClass: STANDARD 10/12 07:57:23 Uid: 20563 10/12 07:57:23 Gid: 20563 10/12 07:57:23 VirtPid: -1 10/12 07:57:23 SoftKillSignal: 20 10/12 07:57:23 Cmd: "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" 10/12 07:57:23 Args: "elliptical_fringes.cdf elliptical_fringes_out_-276p0_1852p0 -276.000000 1852.000000" 10/12 07:57:23 Env: "" 10/12 07:57:23 Iwd: "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder" 10/12 07:57:23 Ckpt Wanted: TRUE 10/12 07:57:23 Is Restart: FALSE 10/12 07:57:23 Core Limit Valid: TRUE 10/12 07:57:23 Coredump Limit 0 10/12 07:57:23 User uid set to 20563 10/12 07:57:23 User uid set to 20563 10/12 07:57:23 User Process 9.11517 { 10/12 07:57:23 cmd = /home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL 10/12 07:57:23 args = elliptical_fringes.cdf elliptical_fringes_out_-276p0_1852p0 -276.000000 1852.000000 10/12 07:57:23 env = 10/12 07:57:23 local_dir = dir_32363 10/12 07:57:23 cur_ckpt = dir_32363/condor_exec.9.11517 10/12 07:57:23 core_name = (either 'core' or 'core.<pid>') 10/12 07:57:23 uid = 20563, gid = 20563 10/12 07:57:23 v_pid = -1 10/12 07:57:23 pid = (NOT CURRENTLY EXECUTING) 10/12 07:57:23 exit_status_valid = FALSE 10/12 07:57:23 exit_status = (NEVER BEEN EXECUTED) 10/12 07:57:23 ckpt_wanted = TRUE 10/12 07:57:23 coredump_limit_exists = TRUE 10/12 07:57:23 coredump_limit = 0 10/12 07:57:23 soft_kill_sig = 20 10/12 07:57:23 job_class = STANDARD 10/12 07:57:23 state = NEW 10/12 07:57:23 new_ckpt_created = FALSE 10/12 07:57:23 ckpt_transferred = FALSE 10/12 07:57:23 core_created = FALSE 10/12 07:57:23 core_transferred = FALSE 10/12 07:57:23 exit_requested = FALSE 10/12 07:57:23 image_size = -1 blocks 10/12 07:57:23 user_time = 0 10/12 07:57:23 sys_time = 0 10/12 07:57:23 guaranteed_user_time = 0 10/12 07:57:23 guaranteed_sys_time = 0 10/12 07:57:23 } 10/12 07:57:23 *FSM* Transitioning to state "GET_EXEC" 10/12 07:57:23 *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE ] 10/12 07:57:23 Entering get_exec() 10/12 07:57:23 Executable is located on submitting host 10/12 07:57:23 Expanded executable name is "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" 10/12 07:57:23 Going to try 3 attempts at getting the inital executable 10/12 07:57:23 Entering get_file( /home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL, dir_32363/condor_exec.9.11517, 0755 ) 10/12 07:57:23 Opened "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" via file stream 10/12 07:57:24 Get_file() transferred 14188400 bytes, 43871792 bytes/second 10/12 07:57:24 Fetched orig ckpt file "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" into "dir_32363/condor_exec.9.11517" with 1 attempt 10/12 07:57:24 Executable 'dir_32363/condor_exec.9.11517' is linked with "$CondorVersion: 6.6.5 May 3 2004 $" on a "$CondorPlatform: I386-LINUX-RH9 $" 10/12 07:57:24 *FSM* Executing transition function "spawn_all" 10/12 07:57:24 Pipe built 10/12 07:57:24 New pipe_fds[14,1] 10/12 07:57:24 cmd_fd = 14 10/12 07:57:24 Calling execve( "/home/condor/hosts/node-04/execute/dir_32363/condor_exec.9.11517", "condor_exec.9.11517", "-_condor_cmd_fd", "14", "elliptical_fringes.cdf", "elliptical_fringes_out_-276p0_1852p0", "-276.000000", "1852.000000", 0, "CONDOR_VM=vm1", "CONDOR_SCRATCH_DIR=/home/condor/hosts/node-04/execute/dir_32363", 0 ) 10/12 07:57:24 Started user job - PID = 32364 10/12 07:57:24 cmd_fp = 0x828a720 10/12 07:57:24 end 10/12 07:57:24 *FSM* Transitioning to state "SUPERVISE" 10/12 07:57:24 *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT ] 10/12 07:57:25 *FSM* Got asynchronous event "CHILD_EXIT" 10/12 07:57:25 *FSM* Executing transition function "reaper" 10/12 07:57:25 Process 32364 killed by signal 9 10/12 07:57:25 Process exited by request 10/12 07:57:25 *FSM* Transitioning to state "PROC_EXIT" 10/12 07:57:25 *FSM* Executing state func "proc_exit()" [ DIE ] 10/12 07:57:25 *FSM* Executing transition function "dispose_one" 10/12 07:57:25 Sending final status for process 9.11517 10/12 07:57:25 STATUS encoded as CKPT, *NOT* TRANSFERRED 10/12 07:57:25 User time = 0.000000 seconds 10/12 07:57:25 System time = 0.000000 seconds 10/12 07:57:25 Unlinked "dir_32363/condor_exec.9.11517" 10/12 07:57:25 Removed directory "dir_32363" 10/12 07:57:25 *FSM* Transitioning to state "SUPERVISE" 10/12 07:57:25 *FSM* Got asynchronous event "DIE" 10/12 07:57:25 *FSM* Executing transition function "req_die" 10/12 07:57:25 *FSM* Transitioning to state "TERMINATE" 10/12 07:57:25 *FSM* Executing state func "terminate_all()" [ ] 10/12 07:57:25 *FSM* Transitioning to state "SEND_STATUS_ALL" 10/12 07:57:25 *FSM* Executing state func "dispose_all()" [ ] 10/12 07:57:25 *FSM* Reached state "END" 10/12 07:57:25 ********* STARTER terminating normally **********

References:
- RE: [Condor-users] config problems....
  - From: bruce
- Re: [Condor-users] config problems....
  - From: Nick LeRoy

Prev by Date: [Condor-users] condor setup using nfs....
Next by Date: [Condor-users] RE: condor setup using nfs....
Previous by thread: Re: [Condor-users] config problems....
Next by thread: [Condor-users] condor setup using nfs....
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] eviction of all jobs after a period of successfulruns.