[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] (no subject)



Hi,

I have been having for some time a problem with my jobs making progress until the first shadow exception (ckpt store failed). They continue running after that but seem to make no progress.

It looks like this problem has come up before:
  https://lists.cs.wisc.edu/archive/condor-users/2006-February/msg00280.shtml

Any guidance would be much appreciated. I've appended log and shadow log
files below. Please let me know if you need more information.

Thanks,
Kartik

===JOB LOG FILE
000 (863.000.000) 03/10 22:19:01 Job submitted from host: <128.83.120.83:60629>
...
001 (863.000.000) 03/10 22:19:11 Job executing on host: <128.83.144.121:44911>
...
006 (863.000.000) 03/11 01:29:13 Image size of job updated: 29757
...
007 (863.000.000) 03/11 03:54:59 Shadow exception!
  ckpt server store failed
  10102987  -  Run Bytes Sent By Job
  8830852  -  Run Bytes Received By Job
...
001 (863.000.000) 03/11 03:55:05 Job executing on host: <128.83.144.121:44911>
...
006 (863.000.000) 03/11 07:05:08 Image size of job updated: 29757
...
007 (863.000.000) 03/11 08:30:53 Shadow exception!
  ckpt server store failed
  10102987  -  Run Bytes Sent By Job
  8830852  -  Run Bytes Received By Job
...
001 (863.000.000) 03/11 08:30:59 Job executing on host: <128.83.144.121:44911>
...
006 (863.000.000) 03/11 11:41:04 Image size of job updated: 29757
...
007 (863.000.000) 03/11 13:06:49 Shadow exception!
  ckpt server store failed
  10102987  -  Run Bytes Sent By Job
  8830852  -  Run Bytes Received By Job
...
001 (863.000.000) 03/11 13:06:55 Job executing on host: <128.83.144.121:44911>
...

===SHADOW LOG OF SUBMITTING MACHINE
3/10 22:19:10 (?.?) (18955):******* Standard Shadow starting up *******
3/10 22:19:10 (?.?) (18955):** $CondorVersion: 6.8.2 Oct 12 2006 $
3/10 22:19:10 (?.?) (18955):** $CondorPlatform: I386-LINUX_RHEL3 $
3/10 22:19:10 (?.?) (18955):*******************************************
3/10 22:19:10 (?.?) (18955):uid=0, euid=9586, gid=0, egid=110
3/10 22:19:10 (?.?) (18955):Hostname = "<128.83.144.121:44911>", Job = 863.0
3/10 22:19:10 (863.0) (18955):Requesting Primary Starter
3/10 22:19:10 (863.0) (18955):Shadow: Request to run a job was ACCEPTED
3/10 22:19:10 (863.0) (18955):Shadow: RSC_SOCK connected, fd = 17
3/10 22:19:10 (863.0) (18955):Shadow: CLIENT_LOG connected, fd = 18
3/10 22:19:10 (863.0) (18955):My_Filesystem_Domain = "cs.utexas.edu"
3/10 22:19:10 (863.0) (18955):My_UID_Domain = "cs.utexas.edu"
3/10 22:19:10 (863.0) (18955):	Entering pseudo_get_file_stream
3/10 22:19:10 (863.0) (18955):	file = "/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/sim-alpha"
3/10 22:19:10 (863.0) (18955):Reaped child status - pid 18956 exited with status 0
3/10 22:19:11 (863.0) (18955):Read: User Job - $CondorPlatform: I386-LINUX_RHEL3 $
3/10 22:19:11 (863.0) (18955):Read: User Job - $CondorVersion: 6.8.2 Oct 12 2006 $
3/10 22:19:11 (863.0) (18955):Read: Checkpoint file name is "/var/condor/spool/cluster863.proc0.subproc0"
3/10 22:19:11 (863.0) (18955):error: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/10 22:19:11 (863.0) (18955):Read: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 01:29:13 (863.0) (18955):Read: Got SIGUSR2
3/11 01:29:13 (863.0) (18955):Read: Saved signal state.
3/11 01:29:13 (863.0) (18955):Read: About to save file state
3/11 01:29:13 (863.0) (18955):Read: CondorFileTable::checkpoint
3/11 01:29:13 (863.0) (18955):Read: OPEN FILE TABLE:
3/11 01:29:13 (863.0) (18955):Read: fd 0
3/11 01:29:13 (863.0) (18955):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 01:29:13 (863.0) (18955):Read: 	offset:       413
3/11 01:29:13 (863.0) (18955):Read: 	dups:         1
3/11 01:29:13 (863.0) (18955):Read: 	open flags:   0x0
3/11 01:29:13 (863.0) (18955):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 01:29:13 (863.0) (18955):Read: 	size:         413
3/11 01:29:13 (863.0) (18955):Read: 	opens:        1
3/11 01:29:13 (863.0) (18955):Read: fd 1
3/11 01:29:13 (863.0) (18955):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 01:29:13 (863.0) (18955):Read: 	offset:       9447423
3/11 01:29:13 (863.0) (18955):Read: 	dups:         1
3/11 01:29:13 (863.0) (18955):Read: 	open flags:   0x1
3/11 01:29:13 (863.0) (18955):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 01:29:13 (863.0) (18955):Read: 	size:         9447423
3/11 01:29:13 (863.0) (18955):Read: 	opens:        1
3/11 01:29:13 (863.0) (18955):Read: fd 2
3/11 01:29:13 (863.0) (18955):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 01:29:13 (863.0) (18955):Read: 	offset:       25410
3/11 01:29:13 (863.0) (18955):Read: 	dups:         1
3/11 01:29:13 (863.0) (18955):Read: 	open flags:   0x1
3/11 01:29:13 (863.0) (18955):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 01:29:13 (863.0) (18955):Read: 	size:         25410
3/11 01:29:13 (863.0) (18955):Read: 	opens:        1
3/11 01:29:13 (863.0) (18955):Read: fd 3
3/11 01:29:13 (863.0) (18955):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 01:29:13 (863.0) (18955):Read: 	offset:       8192
3/11 01:29:13 (863.0) (18955):Read: 	dups:         1
3/11 01:29:13 (863.0) (18955):Read: 	open flags:   0x0
3/11 01:29:13 (863.0) (18955):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 01:29:13 (863.0) (18955):Read: 	size:         1196608
3/11 01:29:13 (863.0) (18955):Read: 	opens:        1
3/11 01:29:13 (863.0) (18955):Read: fd 4
3/11 01:29:13 (863.0) (18955):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 01:29:13 (863.0) (18955):Read: 	offset:       0
3/11 01:29:13 (863.0) (18955):Read: 	dups:         1
3/11 01:29:13 (863.0) (18955):Read: 	open flags:   0x2
3/11 01:29:13 (863.0) (18955):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 01:29:13 (863.0) (18955):Read: 	size:         0
3/11 01:29:13 (863.0) (18955):Read: 	opens:        1
3/11 01:29:13 (863.0) (18955):Read: working dir = /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0
3/11 01:29:13 (863.0) (18955):Read: Done saving file state
3/11 01:29:13 (863.0) (18955):Read: About to update MyImage
3/11 01:29:13 (863.0) (18955):Read: Size of ckpt image = 20231167 bytes
3/11 01:29:13 (863.0) (18955):Read: About to write checkpoint
3/11 01:29:13 (863.0) (18955):Read: Image::Write(): fd -1 file_name /var/condor/spool/cluster863.proc0.subproc0
3/11 01:29:13 (863.0) (18955):Read: Checkpoint name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 01:29:13 (863.0) (18955):Read: Tmp name is "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 01:29:13 (863.0) (18955):	Entering pseudo_put_file_stream
3/11 01:29:13 (863.0) (18955):	file = "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 01:29:13 (863.0) (18955):	len = 20231167
3/11 01:29:13 (863.0) (18955):	owner = akkartik
3/11 01:29:16 (863.0) (18955):store request to ckpt server failed, trying again in 5 seconds
3/11 01:29:24 (863.0) (18955):store request to ckpt server failed, trying again in 10 seconds
3/11 01:29:37 (863.0) (18955):store request to ckpt server failed, trying again in 20 seconds
3/11 01:30:00 (863.0) (18955):store request to ckpt server failed, trying again in 40 seconds
3/11 01:30:43 (863.0) (18955):store request to ckpt server failed, trying again in 80 seconds
3/11 01:32:06 (863.0) (18955):store request to ckpt server failed, trying again in 160 seconds
3/11 01:34:49 (863.0) (18955):store request to ckpt server failed, trying again in 320 seconds
3/11 01:40:12 (863.0) (18955):store request to ckpt server failed, trying again in 640 seconds
3/11 01:50:55 (863.0) (18955):store request to ckpt server failed, trying again in 1280 seconds
3/11 03:12:19 (863.0) (18955):store request to ckpt server failed, trying again in 2560 seconds
3/11 03:54:59 (863.0) (18955):ERROR "ckpt server store failed" at line 959 in file pseudo_ops.C
3/11 03:54:59 (863.0) (18955):Shadow: DoCleanup: unlinking TmpCkpt '/var/condor/spool/cluster863.proc0.subproc0.tmp'
3/11 03:54:59 (863.0) (18955):Trying to unlink /var/condor/spool/cluster863.proc0.subproc0.tmp
3/11 03:55:04 (?.?) (9314):Hostname = "<128.83.144.121:44911>", Job = 863.0
3/11 03:55:04 (863.0) (9314):Requesting Primary Starter
3/11 03:55:04 (863.0) (9314):Shadow: Request to run a job was ACCEPTED
3/11 03:55:04 (863.0) (9314):Shadow: RSC_SOCK connected, fd = 17
3/11 03:55:04 (863.0) (9314):Shadow: CLIENT_LOG connected, fd = 18
3/11 03:55:04 (863.0) (9314):My_Filesystem_Domain = "cs.utexas.edu"
3/11 03:55:04 (863.0) (9314):My_UID_Domain = "cs.utexas.edu"
3/11 03:55:04 (863.0) (9314):	Entering pseudo_get_file_stream
3/11 03:55:04 (863.0) (9314):	file = "/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/sim-alpha"
3/11 03:55:05 (863.0) (9314):Reaped child status - pid 9315 exited with status 0
3/11 03:55:05 (863.0) (9314):Read: User Job - $CondorPlatform: I386-LINUX_RHEL3 $
3/11 03:55:05 (863.0) (9314):Read: User Job - $CondorVersion: 6.8.2 Oct 12 2006 $
3/11 03:55:05 (863.0) (9314):Read: Checkpoint file name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 03:55:05 (863.0) (9314):error: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 03:55:05 (863.0) (9314):Read: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 07:05:07 (863.0) (9314):Read: Got SIGUSR2
3/11 07:05:07 (863.0) (9314):Read: Saved signal state.
3/11 07:05:07 (863.0) (9314):Read: About to save file state
3/11 07:05:07 (863.0) (9314):Read: CondorFileTable::checkpoint
3/11 07:05:07 (863.0) (9314):Read: OPEN FILE TABLE:
3/11 07:05:07 (863.0) (9314):Read: fd 0
3/11 07:05:07 (863.0) (9314):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 07:05:07 (863.0) (9314):Read: 	offset:       413
3/11 07:05:07 (863.0) (9314):Read: 	dups:         1
3/11 07:05:07 (863.0) (9314):Read: 	open flags:   0x0
3/11 07:05:07 (863.0) (9314):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 07:05:07 (863.0) (9314):Read: 	size:         413
3/11 07:05:07 (863.0) (9314):Read: 	opens:        1
3/11 07:05:07 (863.0) (9314):Read: fd 1
3/11 07:05:07 (863.0) (9314):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 07:05:07 (863.0) (9314):Read: 	offset:       9447423
3/11 07:05:07 (863.0) (9314):Read: 	dups:         1
3/11 07:05:07 (863.0) (9314):Read: 	open flags:   0x1
3/11 07:05:07 (863.0) (9314):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 07:05:07 (863.0) (9314):Read: 	size:         9447423
3/11 07:05:07 (863.0) (9314):Read: 	opens:        1
3/11 07:05:07 (863.0) (9314):Read: fd 2
3/11 07:05:07 (863.0) (9314):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 07:05:07 (863.0) (9314):Read: 	offset:       25410
3/11 07:05:07 (863.0) (9314):Read: 	dups:         1
3/11 07:05:07 (863.0) (9314):Read: 	open flags:   0x1
3/11 07:05:07 (863.0) (9314):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 07:05:07 (863.0) (9314):Read: 	size:         25410
3/11 07:05:07 (863.0) (9314):Read: 	opens:        1
3/11 07:05:07 (863.0) (9314):Read: fd 3
3/11 07:05:07 (863.0) (9314):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 07:05:07 (863.0) (9314):Read: 	offset:       8192
3/11 07:05:07 (863.0) (9314):Read: 	dups:         1
3/11 07:05:07 (863.0) (9314):Read: 	open flags:   0x0
3/11 07:05:07 (863.0) (9314):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 07:05:07 (863.0) (9314):Read: 	size:         1196608
3/11 07:05:07 (863.0) (9314):Read: 	opens:        1
3/11 07:05:07 (863.0) (9314):Read: fd 4
3/11 07:05:07 (863.0) (9314):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 07:05:07 (863.0) (9314):Read: 	offset:       0
3/11 07:05:07 (863.0) (9314):Read: 	dups:         1
3/11 07:05:07 (863.0) (9314):Read: 	open flags:   0x2
3/11 07:05:07 (863.0) (9314):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 07:05:07 (863.0) (9314):Read: 	size:         0
3/11 07:05:07 (863.0) (9314):Read: 	opens:        1
3/11 07:05:07 (863.0) (9314):Read: working dir = /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0
3/11 07:05:08 (863.0) (9314):Read: Done saving file state
3/11 07:05:08 (863.0) (9314):Read: About to update MyImage
3/11 07:05:08 (863.0) (9314):Read: Size of ckpt image = 20231167 bytes
3/11 07:05:08 (863.0) (9314):Read: About to write checkpoint
3/11 07:05:08 (863.0) (9314):Read: Image::Write(): fd -1 file_name /var/condor/spool/cluster863.proc0.subproc0
3/11 07:05:08 (863.0) (9314):Read: Checkpoint name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 07:05:08 (863.0) (9314):Read: Tmp name is "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 07:05:08 (863.0) (9314):	Entering pseudo_put_file_stream
3/11 07:05:08 (863.0) (9314):	file = "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 07:05:08 (863.0) (9314):	len = 20231167
3/11 07:05:08 (863.0) (9314):	owner = akkartik
3/11 07:05:11 (863.0) (9314):store request to ckpt server failed, trying again in 5 seconds
3/11 07:05:19 (863.0) (9314):store request to ckpt server failed, trying again in 10 seconds
3/11 07:05:32 (863.0) (9314):store request to ckpt server failed, trying again in 20 seconds
3/11 07:05:55 (863.0) (9314):store request to ckpt server failed, trying again in 40 seconds
3/11 07:06:38 (863.0) (9314):store request to ckpt server failed, trying again in 80 seconds
3/11 07:08:01 (863.0) (9314):store request to ckpt server failed, trying again in 160 seconds
3/11 07:10:44 (863.0) (9314):store request to ckpt server failed, trying again in 320 seconds
3/11 07:16:07 (863.0) (9314):store request to ckpt server failed, trying again in 640 seconds
3/11 07:26:50 (863.0) (9314):store request to ckpt server failed, trying again in 1280 seconds
3/11 07:48:13 (863.0) (9314):store request to ckpt server failed, trying again in 2560 seconds
3/11 08:30:53 (863.0) (9314):ERROR "ckpt server store failed" at line 959 in file pseudo_ops.C
3/11 08:30:53 (863.0) (9314):Shadow: DoCleanup: unlinking TmpCkpt '/var/condor/spool/cluster863.proc0.subproc0.tmp'
3/11 08:30:53 (863.0) (9314):Trying to unlink /var/condor/spool/cluster863.proc0.subproc0.tmp
3/11 08:30:58 (?.?) (19268):Hostname = "<128.83.144.121:44911>", Job = 863.0
3/11 08:30:58 (863.0) (19268):Requesting Primary Starter
3/11 08:30:58 (863.0) (19268):Shadow: Request to run a job was ACCEPTED
3/11 08:30:58 (863.0) (19268):Shadow: RSC_SOCK connected, fd = 17
3/11 08:30:58 (863.0) (19268):Shadow: CLIENT_LOG connected, fd = 18
3/11 08:30:58 (863.0) (19268):My_Filesystem_Domain = "cs.utexas.edu"
3/11 08:30:58 (863.0) (19268):My_UID_Domain = "cs.utexas.edu"
3/11 08:30:58 (863.0) (19268):	Entering pseudo_get_file_stream
3/11 08:30:58 (863.0) (19268):	file = "/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/sim-alpha"
3/11 08:30:59 (863.0) (19268):Reaped child status - pid 19269 exited with status 0
3/11 08:30:59 (863.0) (19268):Read: User Job - $CondorPlatform: I386-LINUX_RHEL3 $
3/11 08:30:59 (863.0) (19268):Read: User Job - $CondorVersion: 6.8.2 Oct 12 2006 $
3/11 08:30:59 (863.0) (19268):Read: Checkpoint file name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 08:31:00 (863.0) (19268):error: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 08:31:00 (863.0) (19268):Read: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 11:41:03 (863.0) (19268):Read: Got SIGUSR2
3/11 11:41:04 (863.0) (19268):Read: Saved signal state.
3/11 11:41:04 (863.0) (19268):Read: About to save file state
3/11 11:41:04 (863.0) (19268):Read: CondorFileTable::checkpoint
3/11 11:41:04 (863.0) (19268):Read: OPEN FILE TABLE:
3/11 11:41:04 (863.0) (19268):Read: fd 0
3/11 11:41:04 (863.0) (19268):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 11:41:04 (863.0) (19268):Read: 	offset:       413
3/11 11:41:04 (863.0) (19268):Read: 	dups:         1
3/11 11:41:04 (863.0) (19268):Read: 	open flags:   0x0
3/11 11:41:04 (863.0) (19268):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty.in
3/11 11:41:04 (863.0) (19268):Read: 	size:         413
3/11 11:41:04 (863.0) (19268):Read: 	opens:        1
3/11 11:41:04 (863.0) (19268):Read: fd 1
3/11 11:41:04 (863.0) (19268):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 11:41:04 (863.0) (19268):Read: 	offset:       9447423
3/11 11:41:04 (863.0) (19268):Read: 	dups:         1
3/11 11:41:04 (863.0) (19268):Read: 	open flags:   0x1
3/11 11:41:04 (863.0) (19268):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/out
3/11 11:41:04 (863.0) (19268):Read: 	size:         9447423
3/11 11:41:04 (863.0) (19268):Read: 	opens:        1
3/11 11:41:04 (863.0) (19268):Read: fd 2
3/11 11:41:04 (863.0) (19268):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 11:41:04 (863.0) (19268):Read: 	offset:       25410
3/11 11:41:04 (863.0) (19268):Read: 	dups:         1
3/11 11:41:04 (863.0) (19268):Read: 	open flags:   0x1
3/11 11:41:04 (863.0) (19268):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/err
3/11 11:41:04 (863.0) (19268):Read: 	size:         25410
3/11 11:41:04 (863.0) (19268):Read: 	opens:        1
3/11 11:41:04 (863.0) (19268):Read: fd 3
3/11 11:41:04 (863.0) (19268):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 11:41:04 (863.0) (19268):Read: 	offset:       8192
3/11 11:41:04 (863.0) (19268):Read: 	dups:         1
3/11 11:41:04 (863.0) (19268):Read: 	open flags:   0x0
3/11 11:41:04 (863.0) (19268):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/crafty
3/11 11:41:04 (863.0) (19268):Read: 	size:         1196608
3/11 11:41:04 (863.0) (19268):Read: 	opens:        1
3/11 11:41:04 (863.0) (19268):Read: fd 4
3/11 11:41:04 (863.0) (19268):Read: 	logical name: /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 11:41:04 (863.0) (19268):Read: 	offset:       0
3/11 11:41:04 (863.0) (19268):Read: 	dups:         1
3/11 11:41:04 (863.0) (19268):Read: 	open flags:   0x2
3/11 11:41:04 (863.0) (19268):Read: 	url:          buffer:remote:/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/game.001
3/11 11:41:04 (863.0) (19268):Read: 	size:         0
3/11 11:41:04 (863.0) (19268):Read: 	opens:        1
3/11 11:41:04 (863.0) (19268):Read: working dir = /v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0
3/11 11:41:04 (863.0) (19268):Read: Done saving file state
3/11 11:41:04 (863.0) (19268):Read: About to update MyImage
3/11 11:41:04 (863.0) (19268):Read: Size of ckpt image = 20231167 bytes
3/11 11:41:04 (863.0) (19268):Read: About to write checkpoint
3/11 11:41:04 (863.0) (19268):Read: Image::Write(): fd -1 file_name /var/condor/spool/cluster863.proc0.subproc0
3/11 11:41:04 (863.0) (19268):Read: Checkpoint name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 11:41:04 (863.0) (19268):Read: Tmp name is "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 11:41:04 (863.0) (19268):	Entering pseudo_put_file_stream
3/11 11:41:04 (863.0) (19268):	file = "/var/condor/spool/cluster863.proc0.subproc0.tmp"
3/11 11:41:04 (863.0) (19268):	len = 20231167
3/11 11:41:04 (863.0) (19268):	owner = akkartik
3/11 11:41:07 (863.0) (19268):store request to ckpt server failed, trying again in 5 seconds
3/11 11:41:15 (863.0) (19268):store request to ckpt server failed, trying again in 10 seconds
3/11 11:41:28 (863.0) (19268):store request to ckpt server failed, trying again in 20 seconds
3/11 11:41:51 (863.0) (19268):store request to ckpt server failed, trying again in 40 seconds
3/11 11:42:34 (863.0) (19268):store request to ckpt server failed, trying again in 80 seconds
3/11 11:43:57 (863.0) (19268):store request to ckpt server failed, trying again in 160 seconds
3/11 11:46:40 (863.0) (19268):store request to ckpt server failed, trying again in 320 seconds
3/11 11:52:03 (863.0) (19268):store request to ckpt server failed, trying again in 640 seconds
3/11 12:02:46 (863.0) (19268):store request to ckpt server failed, trying again in 1280 seconds
3/11 12:24:09 (863.0) (19268):store request to ckpt server failed, trying again in 2560 seconds
3/11 13:06:49 (863.0) (19268):ERROR "ckpt server store failed" at line 959 in file pseudo_ops.C
3/11 13:06:49 (863.0) (19268):Shadow: DoCleanup: unlinking TmpCkpt '/var/condor/spool/cluster863.proc0.subproc0.tmp'
3/11 13:06:49 (863.0) (19268):Trying to unlink /var/condor/spool/cluster863.proc0.subproc0.tmp
3/11 13:06:54 (?.?) (28767):Hostname = "<128.83.144.121:44911>", Job = 863.0
3/11 13:06:54 (863.0) (28767):Requesting Primary Starter
3/11 13:06:54 (863.0) (28767):Shadow: Request to run a job was ACCEPTED
3/11 13:06:54 (863.0) (28767):Shadow: RSC_SOCK connected, fd = 17
3/11 13:06:54 (863.0) (28767):Shadow: CLIENT_LOG connected, fd = 18
3/11 13:06:54 (863.0) (28767):My_Filesystem_Domain = "cs.utexas.edu"
3/11 13:06:54 (863.0) (28767):My_UID_Domain = "cs.utexas.edu"
3/11 13:06:54 (863.0) (28767):	Entering pseudo_get_file_stream
3/11 13:06:54 (863.0) (28767):	file = "/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/sim-alpha"
3/11 13:06:55 (863.0) (28767):Reaped child status - pid 28768 exited with status 0
3/11 13:06:55 (863.0) (28767):Read: User Job - $CondorPlatform: I386-LINUX_RHEL3 $
3/11 13:06:55 (863.0) (28767):Read: User Job - $CondorVersion: 6.8.2 Oct 12 2006 $
3/11 13:06:55 (863.0) (28767):Read: Checkpoint file name is "/var/condor/spool/cluster863.proc0.subproc0"
3/11 13:06:56 (863.0) (28767):error: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
3/11 13:06:56 (863.0) (28767):Read: Warning: READWRITE: File '/v/filer3/v0q059/apps/projects/benchMemView/ismm/186.crafty/ph/0/SS_APP_FILE' used for both reading and writing.  This is not checkpoint-safe.
=eof