[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1



Folks,

I neglected to note that this is on RHEL 6.7 in an NIS/NFS environment.  Any thoughts on how to make checkpointing work in this environment are welcome!

Andy

> On Nov 2, 2015, at 3:11 PM, Feldt, Andrew N. <afeldt@xxxxxx> wrote:
> 
> I recently found that our HTCondor jobs were never vacating since we had not set up a method for running condor_kbdd.  So, I set it up so that a user logging into Gnome gets it run for him/her and has it killed when they log out.  But, then I started getting reports of âuser abortedâ jobs.  Some debugging showed me that, while nothing bad occurs when a checkpoint is made, a job which tries to restart from a checkpoint fails.  This shows up in the userâs log file as:
> 
> 001 (008.000.000) 11/02 11:18:32 Job executing on host: <129.15.nn.nn:9757?addrs=129.15.nn.nn-9757>
> ...
> 005 (008.000.000) 11/02 11:18:33 Job terminated.
> 	(0) Abnormal termination (signal 6)
> 	(0) No core file
> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
> 		Usr 0 00:10:04, Sys 0 00:00:00  -  Total Remote Usage
> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
> 	334  -  Run Bytes Sent By Job
> 	4097614  -  Run Bytes Received By Job
> 	0  -  Total Bytes Sent By Job
> 	0  -  Total Bytes Received By Job
> ...
> 009 (008.000.000) 11/02 11:18:33 Job was aborted by the user.
> 
> 
> In the shadow log file for the job, I see:
> 
> 11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
> 11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
> 11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
> 11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated
> 
> followed by a Backtrace, followed by:
> 
> 11/02/15 11:18:33 (8.0) (2889688):Shadow: Job 8.0 exited, termsig = 6, coredump = 0, retcode = 0
> 11/02/15 11:18:33 (8.0) (2889688):Shadow: was killed by signal 6.
> 11/02/15 11:18:33 (8.0) (2889688):user_time = 0 ticks
> 11/02/15 11:18:33 (8.0) (2889688):sys_time = 2 ticks
> 11/02/15 11:18:33 (8.0) (2889688):Static Policy: removing job because OnExitRemove has become true
> 11/02/15 11:18:33 (8.0) (2889688):********** Shadow Exiting(102) **********
> 
> on the RemoteHost in the StartLog, I see:
> 
> 11/02/15 10:59:39 Starter pid 3317011 exited with status 0
> 11/02/15 10:59:39 slot1: State change: starter exited
> 11/02/15 10:59:39 slot1: State change: No preempting claim, returning to owner
> 11/02/15 10:59:39 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 11/02/15 11:18:03 slot1: State change: IS_OWNER is false
> 11/02/15 11:18:03 slot1: Changing state: Owner -> Unclaimed
> 11/02/15 11:18:32 slot1: Request accepted.
> 11/02/15 11:18:32 slot1: Remote owner is feldt@xxxxxxxxxx
> 11/02/15 11:18:32 slot1: State change: claiming protocol successful
> 11/02/15 11:18:32 slot1: Changing state: Unclaimed -> Claimed
> 11/02/15 11:18:32 slot1: Got activate_claim request from shadow (129.15.nn.nn)
> 11/02/15 11:18:32 slot1: Remote job ID is 8.0
> 11/02/15 11:18:32 slot1: Got universe "STANDARD" (1) from request classad
> 11/02/15 11:18:32 slot1: State change: claim-activation protocol successful
> 11/02/15 11:18:32 slot1: Changing activity: Idle -> Busy
> 11/02/15 11:18:33 condor_write(): Socket closed when trying to write 28 bytes to <129.15.nn.nn:9682>, fd is 8
> 11/02/15 11:18:33 Buf::write(): condor_write() failed
> 11/02/15 11:18:33 slot1: Called deactivate_claim_forcibly()
> 11/02/15 11:18:33 Starter pid 3319125 exited with status 0
> 11/02/15 11:18:33 slot1: State change: starter exited
> 
> So, it looks like the job either dies trying to write to the shadow on the submitting host or is unable to execute the checkpointed file.  Note that this is not a firewall issue.  We have all ports open between the submit host and the startd host.
> 
> I did see one SELinux issue, but added the following local mod based on it:
> 
> allow hald_t condor_master_t:bus send_msg;
> 
> This did not help and so I am stuck.
> 
> Andy
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/