[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing Errors




Hi All,


Is Condor configured to send the checkpoint back to the condor_shadow process, or have you configured a checkpoint server?

We have configured a checkpoint server, it runs on what we identify as a server. So it has a HIGHPORT of 9500 and a LOWPORT of 9000.

In the TransferLog I get:

5/8 13:54:54 R F -1075417848 bytes 120 sec 0.0.0.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15 5/8 13:54:58 R F -1075417848 bytes 120 sec 0.0.0.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15 5/8 13:54:58 R F -1075417848 bytes 120 sec 0.0.0.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15 5/8 13:58:13 R F 0 bytes 120 sec 0.0.0.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15 5/8 14:19:18 R F -1075417848 bytes 120 sec 0.0.0.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15

The number of bytes is obviously a concern since its negative and very low.



5/8 11:17:14 (1.23) (4640):Shadow: Request to run a job was ACCEPTED
5/8 11:17:14 (1.23) (4640):Shadow: RSC_SOCK connected, fd = 17
5/8 11:17:14 (1.23) (4640):Shadow: CLIENT_LOG connected, fd = 18
5/8 11:17:14 (1.23) (4640):My_Filesystem_Domain = "dcs.warwick.ac.uk"
5/8 11:17:14 (1.23) (4640):My_UID_Domain = "dcs.warwick.ac.uk"
5/8 11:17:14 (1.23) (4640):     Entering pseudo_get_file_stream
5/8 11:17:14 (1.23) (4640):     file = "/dcs/condor/condor/bin/octave"
5/8 11:17:23 (1.23) (4640):Reaped child status - pid 4645 exited with status 0 5/8 11:17:23 (1.23) (4640):Read: User Job - $CondorPlatform: I386-LINUX_RHEL3 $ 5/8 11:17:23 (1.23) (4640):Read: User Job - $CondorVersion: 6.9.2 Apr 9 2007 $ 5/8 11:17:23 (1.23) (4640):Read: Checkpoint file name is "/var/tmp/dcsoff-15-condor/spool/cluster1.proc23.subproc0"
5/8 11:17:45 (1.20) (3751):Read: Got SIGTSTP
5/8 11:17:45 (1.20) (3751):Read: Saved signal state.
5/8 11:17:45 (1.20) (3751):Read: About to save file state
5/8 11:17:45 (1.20) (3751):Read: CondorFileTable::checkpoint
5/8 11:17:45 (1.20) (3751):Read: OPEN FILE TABLE:
5/8 11:17:45 (1.20) (3751):Read: fd 0
5/8 11:17:45 (1.20) (3751):Read: logical name: /dcs/condor/condor/riteshfiles/riteshgrid/./S_20//octaveinput.txt
5/8 11:17:45 (1.20) (3751):Read:        offset:       23
5/8 11:17:45 (1.20) (3751):Read:        dups:         1
5/8 11:17:45 (1.20) (3751):Read:        open flags:   0x0
5/8 11:17:45 (1.20) (3751):Read: url: local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//octaveinput.txt
5/8 11:17:45 (1.20) (3751):Read:        size:         23
5/8 11:17:45 (1.20) (3751):Read:        opens:        1
5/8 11:17:45 (1.20) (3751):Read: fd 1
5/8 11:17:45 (1.20) (3751):Read: logical name: /dcs/condor/condor/riteshfiles/riteshgrid/./S_20//out.log
5/8 11:17:45 (1.20) (3751):Read:        offset:       2472980
5/8 11:17:45 (1.20) (3751):Read:        dups:         1
5/8 11:17:45 (1.20) (3751):Read:        open flags:   0x1
5/8 11:17:45 (1.20) (3751):Read: url: local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//out.log
5/8 11:17:45 (1.20) (3751):Read:        size:         2472980
5/8 11:17:45 (1.20) (3751):Read:        opens:        1
5/8 11:17:45 (1.20) (3751):Read: fd 2
5/8 11:17:45 (1.20) (3751):Read: logical name: /dcs/condor/condor/riteshfiles/riteshgrid/./S_20//err.log
5/8 11:17:45 (1.20) (3751):Read:        offset:       90
5/8 11:17:45 (1.20) (3751):Read:        dups:         1
5/8 11:17:45 (1.20) (3751):Read:        open flags:   0x1
5/8 11:17:45 (1.20) (3751):Read: url: local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//err.log
5/8 11:17:45 (1.20) (3751):Read:        size:         90
5/8 11:17:45 (1.20) (3751):Read:        opens:        1
5/8 11:17:45 (1.20) (3751):Read: fd 3
5/8 11:17:45 (1.20) (3751):Read: logical name: /dcs/condor/condor/riteshfiles/riteshgrid/./S_20//Y_20.txt
5/8 11:17:45 (1.20) (3751):Read:        offset:       0
5/8 11:17:45 (1.20) (3751):Read:        dups:         1
5/8 11:17:45 (1.20) (3751):Read:        open flags:   0x1
5/8 11:17:45 (1.20) (3751):Read: url: local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//Y_20.txt
5/8 11:17:45 (1.20) (3751):Read:        size:         0
5/8 11:17:45 (1.20) (3751):Read:        opens:        1
5/8 11:17:45 (1.20) (3751):Read: working dir = /dcs/condor/condor/riteshfiles/riteshgrid/./S_20/
5/8 11:17:45 (1.20) (3751):Read: Done saving file state
5/8 11:17:45 (1.20) (3751):Read: About to update MyImage
5/8 11:17:45 (1.20) (3751):Read: Size of ckpt image = 66511871 bytes
5/8 11:17:45 (1.20) (3751):Read: About to write checkpoint
5/8 11:17:45 (1.20) (3751):Read: Image::Write(): fd -1 file_name /var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0 5/8 11:17:45 (1.20) (3751):Read: Checkpoint name is "/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0" 5/8 11:17:45 (1.20) (3751):Read: Tmp name is "/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp"
5/8 11:17:45 (1.20) (3751):     Entering pseudo_put_file_stream
5/8 11:17:45 (1.20) (3751): file = "/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp"
5/8 11:17:45 (1.20) (3751):     len = 66511871
5/8 11:17:45 (1.20) (3751):     owner = condor
5/8 11:17:45 (1.20) (3751):      Weird 0xf77cd89
5/8 11:17:45 (1.20) (3751):Returned addr
5/8 11:17:45 (1.20) (3751):     137.205.119.15
5/8 11:17:45 (1.20) (3751):Returned port 53075
5/8 11:17:45 (1.20) (3751):Read: connect() failed - errno = 111
5/8 11:17:45 (1.20) (3751):Read: open_tcp_stream() failed
5/8 11:17:45 (1.20) (3751):Read: ERROR:open_ckpt_file failed, aborting ckpt
5/8 11:17:45 (1.20) (3751):Read: Ckpt exit
5/8 11:17:45 (1.20) (3751):Read: Write failed with [-1]
5/8 11:17:45 (1.20) (3751):Shadow: Job 1.20 exited, termsig = 9, coredump = 0, retcode = 0
5/8 11:17:45 (1.20) (3751):Shadow: Job was kicked off without a checkpoint
5/8 11:17:45 (1.20) (3751):Shadow: DoCleanup: unlinking TmpCkpt '/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp'

The above is from the ShadowLog output. As you can see there is the port being opened on 53075 which is not in the range. We have got a checkpoint server setup (as above) and 'clients' are configured to use it. What is very odd is that all of the checkpoints seem to come back to the server to be written - in the TransferLog all of the receives are from the server not the clients - should this be the case?

I'm just not sure why the port is incorrect? Does checkpointing work by opening a port to copy the file over onto - if so why does it not use one in the range 9000-9500?

Thanks for your help,

Si Hammond
Univ. of Warwick