[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor checkpointing problems?



Hello,

At Purdue University we've recently installed a new cluster
running RedHat ES3.  It's running Condor 6.6.5 alongside Debian
clusters running the same version.  About a week ago a user
reported seeing "shadow exception" errors in the logs.  The jobs
were unable to checkpoint.  At first we thought it applied to all
the nodes, but now we've narrowed it down to only nodes in the
new RedHat cluster.

So, here's a ShadowLog excerpt on the (Debian) frontend server
where users submit their jobs:

11/19 11:48:55 (?.?) (5876):Hostname = "<172.18.4.49:32773>", Job = 24292.0
11/19 11:48:55 (24292.0) (5876):Requesting Primary Starter
11/19 11:48:55 (24292.0) (5876):Shadow: Request to run a job was ACCEPTED
11/19 11:48:55 (24292.0) (5876):Shadow: RSC_SOCK connected, fd = 17
11/19 11:48:55 (24292.0) (5876):Shadow: CLIENT_LOG connected, fd = 18
11/19 11:48:56 (24292.0) (5876):My_Filesystem_Domain = "radon.rcac.purdue.edu"
11/19 11:48:56 (24292.0) (5876):My_UID_Domain = "rcac.purdue.edu"
11/19 11:48:56 (24292.0) (5876):        Entering pseudo_get_file_stream
11/19 11:48:56 (24292.0) (5876):        file = "/var/condor/spool/cluster24292.ickpt.subproc0"
11/19 11:48:56 (24292.0) (5876):        128.210.9.35
11/19 11:48:56 (24292.0) (5876):        128.210.9.35
11/19 11:48:56 (24292.0) (5876):Reaped child status - pid 5879 exited with status 0
11/19 11:48:57 (24292.0) (5876):Read: condor_restart:
11/19 11:48:57 (24292.0) (5876):Read: Checkpoint file name is "/var/condor/spool/cluster24292.proc0.subproc0"
11/19 11:48:57 (24292.0) (5876):        Entering pseudo_get_file_stream
11/19 11:48:57 (24292.0) (5876):        file = "/var/condor/spool/cluster24292.proc0.subproc0"
11/19 11:48:57 (24292.0) (5876):        128.210.9.71
11/19 11:48:57 (24292.0) (5876):RestoreRequest returned 0 using port 37168
11/19 11:48:57 (24292.0) (5876):Read: Opened "/var/condor/spool/cluster24292.proc0.subproc0" via file stream
11/19 11:48:57 (24292.0) (5876):Read: Read headers OK
11/19 11:48:57 (24292.0) (5876):Read: Read SegMap[0](DATA) OK
11/19 11:48:57 (24292.0) (5876):Read: Read SegMap[1](STACK) OK
11/19 11:48:57 (24292.0) (5876):Read: Read all SegMaps OK
11/19 11:48:57 (24292.0) (5876):Read: Found a DATA block, increasing heap from 0x9934000 to 0x10e30000
11/19 11:48:57 (24292.0) (5876):Read: About to overwrite 0x8c17000 bytes starting at 0x8219000(DATA)
11/19 11:48:57 (24292.0) (5876):Read: in Segmap::Read(): fd = 3, read_size=65536
11/19 11:48:57 (24292.0) (5876):Read: core_loc=8219000, nbytes=-1
11/19 11:48:57 (24292.0) (5876):Read: SegMap::Read() failed!
11/19 11:48:57 (24292.0) (5876):Shadow: Job 24292.0 exited, termsig = 9, coredump = 0, retcode = 0
11/19 11:48:57 (24292.0) (5876):Shadow: Job was kicked off without a checkpoint
11/19 11:48:57 (24292.0) (5876):Shadow: DoCleanup: unlinking TmpCkpt '/var/condor/spool/cluster24292.proc0.subproc0.tmp'
11/19 11:48:57 (24292.0) (5876):Trying to unlink /var/condor/spool/cluster24292.proc0.subproc0.tmp
11/19 11:48:57 (24292.0) (5876):user_time = 8 ticks
11/19 11:48:57 (24292.0) (5876):sys_time = 6 ticks
11/19 11:48:57 (24292.0) (5876):********** Shadow Exiting(107) **********

They all look the same.  It tries to map the checkpoint and for
some reason fails.  The checkpoint server transfer log gets this:

11/19 12:15:29 S F -1 bytes 1 sec 172.18.4.53 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35
11/19 12:15:32 S F -1 bytes 0 sec 172.18.4.42 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35
11/19 12:15:32 S F -1 bytes 0 sec 172.18.4.65 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35
11/19 12:15:35 S F -1 bytes 0 sec 172.18.4.53 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35
11/19 12:15:35 S F -1 bytes 0 sec 172.18.4.55 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35

and the checkpoint server log looks like this (per line above):

11/19 12:15:35  Request to restore checkpoint file GRANTED
11/19 12:15:35 stream_file_xfer: 983040 bytes written, 11928575 bytes to go
11/19 12:15:35 stream_file_xfer: write returns -1 (errno=104) when attempting to write 65536 bytes
11/19 12:15:35 ERROR: Bad file size (sent -1, expected 12911615)
11/19 12:15:35  File transmission self-terminated; error #24
                        occurred during file transmission
11/19 12:15:35  Receiving service request from 128.210.9.35
                        Using descriptor 8 to handle request
11/19 12:15:35  Service request cannot be completed:
                        Attempt returns error #91

Comparatively, Debian nodes work fine:

(CkptServerLog:)
11/16 10:59:37  Request to store checkpoint file GRANTED
11/16 10:59:43  Receiving service request from 128.210.9.35
                        Using descriptor 8 to handle request
11/16 10:59:43  Service request successfully completed
11/16 10:59:44  File successfully received
11/16 10:59:44 RECV transferred 69055487 bytes in 7 seconds (9865069 bytes / sec)
(TransferLog:)
11/16 10:59:44 R S 69055487 bytes 7 sec 128.211.130.32 xyz@xxxxxxxxxxxxxxxxxxxxx@128.210.9.35

Does anyone know what could be causing this?  I looked at the
6.6.6 changelog but it did not indicate any issues like this
might have been resolved.  The 6.6.7 changelog was not available
to judge whether upgrading would resolve this.

Any help is greatly appreciated.  Please keep shuey@xxxxxxxxxx on
cc so he receives replies.  Thank you!

Regards,
-- 
wca
Purdue University Rosen Center for Advanced Computing
Office: MATH G135 @ Purdue - http://www.itap.purdue.edu/rcac/