[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Checkpointing Errors




Hi All,

I seem to be having major problems checkpointing, the jobs run OK but when interupted I get the following message in ShadowLog:


5/8 11:56:09 (1.1) (2846):Read: About to write checkpoint
5/8 11:56:09 (1.1) (2846):Read: Image::Write(): fd -1 file_name /var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0 5/8 11:56:09 (1.1) (2846):Read: Checkpoint name is "/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0" 5/8 11:56:09 (1.1) (2846):Read: Tmp name is "/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0.tmp"
5/8 11:56:09 (1.1) (2846):      Entering pseudo_put_file_stream
5/8 11:56:09 (1.1) (2846): file = "/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0.tmp"
5/8 11:56:09 (1.1) (2846):      len = 66511871
5/8 11:56:09 (1.1) (2846):      owner = condor
5/8 11:56:09 (1.1) (2846):       Weird 0xf77cd89
5/8 11:56:09 (1.1) (2846):Returned addr
5/8 11:56:09 (1.1) (2846):      137.205.119.15
5/8 11:56:09 (1.1) (2846):Returned port 53211
5/8 11:56:09 (1.1) (2846):Read: connect() failed - errno = 111
5/8 11:56:09 (1.1) (2846):Read: open_tcp_stream() failed
5/8 11:56:09 (1.1) (2846):Read: ERROR:open_ckpt_file failed, aborting ckpt
5/8 11:56:09 (1.1) (2846):Read: Ckpt exit
5/8 11:56:09 (1.1) (2846):Read: Write failed with [-1]
5/8 11:56:09 (1.1) (2846):Shadow: Job 1.1 exited, termsig = 9, coredump = 0, retcode = 0

Our LOWPORT is 9000 and HIGHPORT is 9500 for servers and 9060 for clients. I'm confused as to why the checkpointing system is picking 53211 and I can't seem to find a configuration option to change it! There aren't any checkpoint files in the disk and the TransferLog shows a negative number of bytes being received - so I think that it probably counts as failed?

The image size of the job does seem to update however on the condor_q listing but the jobs seem to run forever and never finish which makes me think the checkpointing isn't happening and they are being restarted.

I'd be grateful for any help!!!

Thanks,

Si Hammond
Univ. of Warwick