[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1



I recently found that our HTCondor jobs were never vacating since we had not set up a method for running condor_kbdd.  So, I set it up so that a user logging into Gnome gets it run for him/her and has it killed when they log out.  But, then I started getting reports of âuser abortedâ jobs.  Some debugging showed me that, while nothing bad occurs when a checkpoint is made, a job which tries to restart from a checkpoint fails.  This shows up in the userâs log file as:

001 (008.000.000) 11/02 11:18:32 Job executing on host: <129.15.nn.nn:9757?addrs=129.15.nn.nn-9757>
...
005 (008.000.000) 11/02 11:18:33 Job terminated.
	(0) Abnormal termination (signal 6)
	(0) No core file
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:10:04, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	334  -  Run Bytes Sent By Job
	4097614  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
009 (008.000.000) 11/02 11:18:33 Job was aborted by the user.


In the shadow log file for the job, I see:

11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated

followed by a Backtrace, followed by:

11/02/15 11:18:33 (8.0) (2889688):Shadow: Job 8.0 exited, termsig = 6, coredump = 0, retcode = 0
11/02/15 11:18:33 (8.0) (2889688):Shadow: was killed by signal 6.
11/02/15 11:18:33 (8.0) (2889688):user_time = 0 ticks
11/02/15 11:18:33 (8.0) (2889688):sys_time = 2 ticks
11/02/15 11:18:33 (8.0) (2889688):Static Policy: removing job because OnExitRemove has become true
11/02/15 11:18:33 (8.0) (2889688):********** Shadow Exiting(102) **********

on the RemoteHost in the StartLog, I see:

11/02/15 10:59:39 Starter pid 3317011 exited with status 0
11/02/15 10:59:39 slot1: State change: starter exited
11/02/15 10:59:39 slot1: State change: No preempting claim, returning to owner
11/02/15 10:59:39 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
11/02/15 11:18:03 slot1: State change: IS_OWNER is false
11/02/15 11:18:03 slot1: Changing state: Owner -> Unclaimed
11/02/15 11:18:32 slot1: Request accepted.
11/02/15 11:18:32 slot1: Remote owner is feldt@xxxxxxxxxx
11/02/15 11:18:32 slot1: State change: claiming protocol successful
11/02/15 11:18:32 slot1: Changing state: Unclaimed -> Claimed
11/02/15 11:18:32 slot1: Got activate_claim request from shadow (129.15.nn.nn)
11/02/15 11:18:32 slot1: Remote job ID is 8.0
11/02/15 11:18:32 slot1: Got universe "STANDARD" (1) from request classad
11/02/15 11:18:32 slot1: State change: claim-activation protocol successful
11/02/15 11:18:32 slot1: Changing activity: Idle -> Busy
11/02/15 11:18:33 condor_write(): Socket closed when trying to write 28 bytes to <129.15.nn.nn:9682>, fd is 8
11/02/15 11:18:33 Buf::write(): condor_write() failed
11/02/15 11:18:33 slot1: Called deactivate_claim_forcibly()
11/02/15 11:18:33 Starter pid 3319125 exited with status 0
11/02/15 11:18:33 slot1: State change: starter exited

So, it looks like the job either dies trying to write to the shadow on the submitting host or is unable to execute the checkpointed file.  Note that this is not a firewall issue.  We have all ports open between the submit host and the startd host.

I did see one SELinux issue, but added the following local mod based on it:

allow hald_t condor_master_t:bus send_msg;

This did not help and so I am stuck.

Andy