[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] eviction of all jobs after a period of successfulruns.



Hi folks,

I have some parametric problems, where I run as many as 40,000 small (1.5 minute) jobs on a condor cluster.
The binaries are compiled with checkpointing. Condor version is 6.6.6 on RH 9.0 intel.


What I find is that, besides being less than optimal time-wise to use many 1.5 minute jobs, that the condor cluster after a time stops running the jobs. The Negotiator matches them up with run nodes, but the jobs are evicted immediately after being Started. This happens spontaneously after about two hours of successful runs.

The cluster remains idle even with these jobs in queue.

The Negotiator log just shows successful matchups that schedule jobs on all available resources, the SchedLog on the submitting node shows that a shadow job is started for the jobs, and then the job is VACATED, and the StarterLog on the run node shows that the job receives a CHILD_EXIT event before the job really runs at all.

The logs are below.


Any thoughts on why this is happening? I've seen something like this before.


------------------------------------------------------------------------ ----------

NegotiatorLog:

10/12 07:47:19 Started shadow for job 9.11517 on "<10.101.1.4:40709>", (shadow pid = 21560)
10/12 07:47:21 Started shadow for job 9.11518 on "<10.101.1.4:40709>", (shadow pid = 21563)
10/12 07:47:21 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709>
10/12 07:47:21 Match record (<10.101.1.4:40709>, 9, 11517) deleted
10/12 07:47:21 DaemonCore: Command received via TCP from host <10.101.1.4:43557>
10/12 07:47:21 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/12 07:47:21 Got VACATE_SERVICE from <10.101.1.4:43557>
10/12 07:47:22 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709>
10/12 07:47:22 Match record (<10.101.1.4:40709>, 9, 11518) deleted
10/12 07:47:22 DaemonCore: Command received via TCP from host <10.101.1.4:43561>
10/12 07:47:22 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/12 07:47:22 Got VACATE_SERVICE from <10.101.1.4:43561>


The ShadowLog (different job, same behavior)

10/12 07:53:03 (?.?) (22151):******* Standard Shadow starting up *******
10/12 07:53:03 (?.?) (22151):** $CondorVersion: 6.6.6 Jul 26 2004 $
10/12 07:53:03 (?.?) (22151):** $CondorPlatform: I386-LINUX_RH9 $
10/12 07:53:03 (?.?) (22151):*******************************************
10/12 07:53:03 (?.?) (22151):uid=0, euid=10228, gid=0, egid=10228
10/12 07:53:03 (?.?) (22151):Hostname = "<10.101.1.48:32771>", Job = 9.11579
10/12 07:53:03 (9.11579) (22151):Requesting Primary Starter
10/12 07:53:03 (9.11579) (22151):Shadow: Request to run a job was ACCEPTED
10/12 07:53:03 (9.11579) (22151):Shadow: RSC_SOCK connected, fd = 17
10/12 07:53:03 (9.11579) (22151):Shadow: CLIENT_LOG connected, fd = 18
10/12 07:53:03 (9.11579) (22151):My_Filesystem_Domain = "deas.harvard.edu"
10/12 07:53:03 (9.11579) (22151):My_UID_Domain = "deas.harvard.edu"
10/12 07:53:03 (9.11579) (22151): Entering pseudo_get_file_stream
10/12 07:53:03 (9.11579) (22151): file = "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL"
10/12 07:53:03 (9.11579) (22151): Weird 0xa650001
10/12 07:53:03 (9.11579) (22151): Weird 0xa650001
10/12 07:53:04 (9.11579) (22151):Reaped child status - pid 22152 exited with status 0
10/12 07:53:05 (9.11579) (22151):Read: User Job - $CondorPlatform: I386-LINUX-RH9 $
10/12 07:53:05 (9.11579) (22151):Read: User Job - $CondorVersion: 6.6.5 May 3 2004 $
10/12 07:53:05 (9.11579) (22151):Read: Checkpoint file name is "/home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0"
10/12 07:53:05 (9.11579) (22151):error: Error: Couldn't open standard file '/dev/null'
10/12 07:53:05 (9.11579) (22151):Shadow: Job 9.11579 exited, termsig = 9, coredump = 0, retcode = 0
10/12 07:53:05 (9.11579) (22151):Shadow: Job was kicked off without a checkpoint
10/12 07:53:05 (9.11579) (22151):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0.tmp'
10/12 07:53:05 (9.11579) (22151):Trying to unlink /home/condor/hosts/grid-gw-01/spool/cluster9.proc11579.subproc0.tmp
10/12 07:53:05 (9.11579) (22151):user_time = 0 ticks
10/12 07:53:05 (9.11579) (22151):sys_time = 5 ticks
10/12 07:53:05 (9.11579) (22151):********** Shadow Exiting(107) **********



And the StarterLog on the run node show that the Starter gets a CHILD_EXIT event right after the job starts:


10/12 07:57:23 ********** STARTER starting up ***********
10/12 07:57:23 ** $CondorVersion: 6.6.6 Jul 26 2004 $
10/12 07:57:23 ** $CondorPlatform: I386-LINUX_RH9 $
10/12 07:57:23 ******************************************
10/12 07:57:23 Submitting machine is "grid-gw-01.deas.harvard.edu"
10/12 07:57:23 EventHandler {
10/12 07:57:23 func = 0x8073476
10/12 07:57:23 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
10/12 07:57:23 }
10/12 07:57:23 Done setting resource limits
10/12 07:57:23 *FSM* Transitioning to state "GET_PROC"
10/12 07:57:23 *FSM* Executing state func "get_proc()" [ ]
10/12 07:57:23 Entering get_proc()
10/12 07:57:23 Entering get_job_info()
10/12 07:57:23 Startup Info:
10/12 07:57:23 Version Number: 1
10/12 07:57:23 Id: 9.11517
10/12 07:57:23 JobClass: STANDARD
10/12 07:57:23 Uid: 20563
10/12 07:57:23 Gid: 20563
10/12 07:57:23 VirtPid: -1
10/12 07:57:23 SoftKillSignal: 20
10/12 07:57:23 Cmd: "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL"
10/12 07:57:23 Args: "elliptical_fringes.cdf elliptical_fringes_out_-276p0_1852p0 -276.000000 1852.000000"
10/12 07:57:23 Env: ""
10/12 07:57:23 Iwd: "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder"
10/12 07:57:23 Ckpt Wanted: TRUE
10/12 07:57:23 Is Restart: FALSE
10/12 07:57:23 Core Limit Valid: TRUE
10/12 07:57:23 Coredump Limit 0
10/12 07:57:23 User uid set to 20563
10/12 07:57:23 User uid set to 20563
10/12 07:57:23 User Process 9.11517 {
10/12 07:57:23 cmd = /home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL
10/12 07:57:23 args = elliptical_fringes.cdf elliptical_fringes_out_-276p0_1852p0 -276.000000 1852.000000
10/12 07:57:23 env =
10/12 07:57:23 local_dir = dir_32363
10/12 07:57:23 cur_ckpt = dir_32363/condor_exec.9.11517
10/12 07:57:23 core_name = (either 'core' or 'core.<pid>')
10/12 07:57:23 uid = 20563, gid = 20563
10/12 07:57:23 v_pid = -1
10/12 07:57:23 pid = (NOT CURRENTLY EXECUTING)
10/12 07:57:23 exit_status_valid = FALSE
10/12 07:57:23 exit_status = (NEVER BEEN EXECUTED)
10/12 07:57:23 ckpt_wanted = TRUE
10/12 07:57:23 coredump_limit_exists = TRUE
10/12 07:57:23 coredump_limit = 0
10/12 07:57:23 soft_kill_sig = 20
10/12 07:57:23 job_class = STANDARD
10/12 07:57:23 state = NEW
10/12 07:57:23 new_ckpt_created = FALSE
10/12 07:57:23 ckpt_transferred = FALSE
10/12 07:57:23 core_created = FALSE
10/12 07:57:23 core_transferred = FALSE
10/12 07:57:23 exit_requested = FALSE
10/12 07:57:23 image_size = -1 blocks
10/12 07:57:23 user_time = 0
10/12 07:57:23 sys_time = 0
10/12 07:57:23 guaranteed_user_time = 0
10/12 07:57:23 guaranteed_sys_time = 0
10/12 07:57:23 }
10/12 07:57:23 *FSM* Transitioning to state "GET_EXEC"
10/12 07:57:23 *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE ]
10/12 07:57:23 Entering get_exec()
10/12 07:57:23 Executable is located on submitting host
10/12 07:57:23 Expanded executable name is "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL"
10/12 07:57:23 Going to try 3 attempts at getting the inital executable
10/12 07:57:23 Entering get_file( /home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL, dir_32363/condor_exec.9.11517, 0755 )
10/12 07:57:23 Opened "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" via file stream
10/12 07:57:24 Get_file() transferred 14188400 bytes, 43871792 bytes/second
10/12 07:57:24 Fetched orig ckpt file "/home/parrott/Research/EllipticalFringes/data_runs/toy-run-disorder/./ bin/WP_tip_engine_checkpointing.LINUX.INTEL" into "dir_32363/condor_exec.9.11517" with 1 attempt
10/12 07:57:24 Executable 'dir_32363/condor_exec.9.11517' is linked with "$CondorVersion: 6.6.5 May 3 2004 $" on a "$CondorPlatform: I386-LINUX-RH9 $"
10/12 07:57:24 *FSM* Executing transition function "spawn_all"
10/12 07:57:24 Pipe built
10/12 07:57:24 New pipe_fds[14,1]
10/12 07:57:24 cmd_fd = 14
10/12 07:57:24 Calling execve( "/home/condor/hosts/node-04/execute/dir_32363/condor_exec.9.11517", "condor_exec.9.11517", "-_condor_cmd_fd", "14", "elliptical_fringes.cdf", "elliptical_fringes_out_-276p0_1852p0", "-276.000000", "1852.000000", 0, "CONDOR_VM=vm1", "CONDOR_SCRATCH_DIR=/home/condor/hosts/node-04/execute/dir_32363", 0 )
10/12 07:57:24 Started user job - PID = 32364
10/12 07:57:24 cmd_fp = 0x828a720
10/12 07:57:24 end
10/12 07:57:24 *FSM* Transitioning to state "SUPERVISE"
10/12 07:57:24 *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT ]
10/12 07:57:25 *FSM* Got asynchronous event "CHILD_EXIT"
10/12 07:57:25 *FSM* Executing transition function "reaper"
10/12 07:57:25 Process 32364 killed by signal 9
10/12 07:57:25 Process exited by request
10/12 07:57:25 *FSM* Transitioning to state "PROC_EXIT"
10/12 07:57:25 *FSM* Executing state func "proc_exit()" [ DIE ]
10/12 07:57:25 *FSM* Executing transition function "dispose_one"
10/12 07:57:25 Sending final status for process 9.11517
10/12 07:57:25 STATUS encoded as CKPT, *NOT* TRANSFERRED
10/12 07:57:25 User time = 0.000000 seconds
10/12 07:57:25 System time = 0.000000 seconds
10/12 07:57:25 Unlinked "dir_32363/condor_exec.9.11517"
10/12 07:57:25 Removed directory "dir_32363"
10/12 07:57:25 *FSM* Transitioning to state "SUPERVISE"
10/12 07:57:25 *FSM* Got asynchronous event "DIE"
10/12 07:57:25 *FSM* Executing transition function "req_die"
10/12 07:57:25 *FSM* Transitioning to state "TERMINATE"
10/12 07:57:25 *FSM* Executing state func "terminate_all()" [ ]
10/12 07:57:25 *FSM* Transitioning to state "SEND_STATUS_ALL"
10/12 07:57:25 *FSM* Executing state func "dispose_all()" [ ]
10/12 07:57:25 *FSM* Reached state "END"
10/12 07:57:25 ********* STARTER terminating normally **********