[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [SPAM] - RE: [SPAM] - RE: [condor-users] Output files not being returnedupon preemption (transfer_files = ALWAYS) - Email found in subject - Email found in subject



Mark,

Thanks for the suggestion.  I set debug on several logfiles to D_ALL, and it looks like the node does attempt to transfer output files, and it looks like they get to the Central Manager okay, but then don't appear on the hard drive.

>From StarterLog on the node:
2/25 07:44:53 (fd:3) SECMAN: not negotiating, just sending command (61001)
2/25 07:44:53 (fd:3) FileTransfer::UploadFiles: sent TransKey=1#403c98468965f77
2/25 07:44:53 (fd:3) entering FileTransfer::Upload
2/25 07:44:53 (fd:3) entering FileTransfer::DoUpload
2/25 07:44:53 (fd:3) PRIV_USER --> PRIV_USER at ..\src\condor_c++_util\file_transfer.C:1093
2/25 07:44:53 (fd:3) DoUpload: send file sleeper.out
2/25 07:44:53 (fd:4) ReliSock: put_file: sent 562 bytes
2/25 07:44:53 (fd:3) DoUpload: send file sleeper.err
2/25 07:44:53 (fd:4) ReliSock: put_file: sent 0 bytes
2/25 07:44:53 (fd:3) DoUpload: send file begin.flg
2/25 07:44:53 (fd:4) ReliSock: put_file: sent 0 bytes
2/25 07:44:53 (fd:3) DoUpload: send file dir.txt
2/25 07:44:53 (fd:4) ReliSock: put_file: sent 610 bytes
2/25 07:44:53 (fd:3) DoUpload: send file end.flg
2/25 07:44:53 (fd:4) ReliSock: put_file: sent 0 bytes
2/25 07:44:53 (fd:3) DoUpload: exiting at 1179
2/25 07:44:53 (fd:3) PRIV_USER --> PRIV_USER at ..\src\condor_c++_util\file_transfer.C:1183
2/25 07:44:53 (fd:3) CLOSE <192.168.33.130:3594> fd=1840
2/25 07:44:53 (fd:3) PRIV_USER --> PRIV_CONDOR at ..\src\condor_starter.V6.1\starter_class.C:938
2/25 07:44:53 (fd:3) Inside OsProc::JobExit()

Note that those filenames and files sizes are correct.  They correspond to the file sizes given about 8 lines into ShadowLog on the Central Manager (pasted below), but the files do not appear on the Central Manager's hard drive.  I don't see any errors on the Central Manager.

>From ShadowLog on the Central Manager:  Note that the files written below
2/25 07:44:53 (fd:5) (1084.0) (3540): DaemonCore: Command received via TCP from host <192.168.33.130:3594>
2/25 07:44:53 (fd:5) (1084.0) (3540): DaemonCore: received command 61001 (FILETRANS_DOWNLOAD), calling handler (FileTransfer::HandleCommands())
2/25 07:44:53 (fd:5) (1084.0) (3540): entering FileTransfer::HandleCommands
2/25 07:44:53 (fd:5) (1084.0) (3540): FileTransfer::HandleCommands read transkey=1#403c98468965f77
2/25 07:44:53 (fd:5) (1084.0) (3540): entering FileTransfer::Download
2/25 07:44:53 (fd:5) (1084.0) (3540): entering FileTransfer::DoDownload sync=0
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:917
2/25 07:44:53 (fd:6) (1084.0) (3540): wrote 562 bytes
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:917
2/25 07:44:53 (fd:6) (1084.0) (3540): wrote 0 bytes
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:917
2/25 07:44:53 (fd:6) (1084.0) (3540): wrote 0 bytes
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:917
2/25 07:44:53 (fd:6) (1084.0) (3540): wrote 610 bytes
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:917
2/25 07:44:53 (fd:6) (1084.0) (3540): wrote 0 bytes
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:994
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:551
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:320
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:458
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:471
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:549
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:361
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:447
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:479
2/25 07:44:53 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:551
2/25 07:44:54 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\directory.C:330
2/25 07:44:54 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:1015
2/25 07:44:54 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_c++_util\file_transfer.C:978
2/25 07:44:54 (fd:5) (1084.0) (3540): CLOSE <192.168.33.4:2248> fd=1880
2/25 07:44:54 (fd:5) (1084.0) (3540): PRIV_CONDOR --> PRIV_CONDOR at ..\src\condor_daemon_core.V6\daemon_core.C:2134
2/25 07:44:54 (fd:5) (1084.0) (3540): In DaemonCore Timeout()
2/25 07:44:54 (fd:5) (1084.0) (3540): 
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> Timers
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> ~~~~~~
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> id = 4, when = 1077713112, period = 20, handler_descrip=<checkPeriodic>
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> id = 0, when = 1077713255, period = 0, handler_descrip=<check_session_cache>
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> id = 5, when = 1077713872, period = 900, handler_descrip=<periodicUpdateQ>
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> id = 2, when = 1077714131, period = 1170, handler_descrip=<DaemonCore::SendAliveToParent>
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore--> id = 1, when = 1077741756, period = 0, handler_descrip=<DaemonCore::ReInit()>
2/25 07:44:54 (fd:5) (1084.0) (3540): 
2/25 07:44:54 (fd:5) (1084.0) (3540): DaemonCore Timeout() Complete, returning 18 
2/25 07:44:54 (fd:5) (1084.0) (3540): Calling Handler <HandleSyscalls> for Socket <RSC Socket>
2/25 07:44:54 (fd:5) (1084.0) (3540): perm::init() starting up for account (dvestal) domain (NOREGON)
2/25 07:44:54 (fd:5) (1084.0) (3540): perm::init: Found Account Name dvestal
2/25 07:44:54 (fd:5) (1084.0) (3540): About to decode condor_sysnum
2/25 07:44:54 (fd:5) (1084.0) (3540): Got request for syscall -65
2/25 07:44:54 (fd:5) (1084.0) (3540): in pseudo_job_exit: status=0,reason=107
2/25 07:44:54 (fd:5) (1084.0) (3540): Inside RemoteResource::updateFromStarter()
2/25 07:44:54 (fd:5) (1084.0) (3540): Update ad:
MyType = "(unknown type)"
TargetType = "(unknown type)"
RemoteSysCpu = 0
RemoteUserCpu = 0
ImageSize = 1788
JobState = "Exited"
NumPids = 0
ExitBySignal = FALSE

-----Original Message-----
From: Mark Silberstein [mailto:marks@xxxxxxxxxxxxxxxxxxxxxxx]
Sent: Tuesday, February 24, 2004 4:12 PM
To: condor-users@xxxxxxxxxxx
Subject: [SPAM] - RE: [SPAM] - RE: [condor-users] Output files not being
returnedupon preemption (transfer_files = ALWAYS) - Email found in
subject - Email found in subject


As a last resort I'd suggest putting D_ALL to debug starter and shadow
and understand if the former at least attempts to send something.

On Tue, 2004-02-24 at 22:01, David Vestal wrote:
> The creation time looks valid.  I took your suggestion, changing my batch file to:
> type nul > begin.flg
> dir
> dir > dir.txt
> sleeper 120
> type nul > end.flg
> 
> When I vacated the job, none of the three files created (begin.flg, dir.txt, end.flg) were returned, and the condor output file was empty.
> 
> When I resubmitted the job and let it run to completion, all files were created and returned normally.
> 
> The errors I originally found in the StartLog on the node also appeared upon normal completion, so I suppose those aren't relevant to the problem.  In fact, the only difference I see from the logfiles of the successful and unsuccessful runs are that in the successful run, CONDOR_SHADOW returned 100 instead of 107.
> 
> -David
> 
> -----Original Message-----
> From: Mark Silberstein [mailto:marks@xxxxxxxxxxxxxxxxxxxxxxx]
> Sent: Tuesday, February 24, 2004 2:21 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [SPAM] - RE: [condor-users] Output files not being returned
> upon preemption (transfer_files = ALWAYS) - Email found in subject
> 
> 
> Using transfer_output_files shouldn't help, since it refers to the
> transfer after successful execution. 
> In any case, using transfer_output_files is _HIGHLY_ discouraged, since
> if Condor fails to locate the specified files after the successful
> execution, that  will cause it to mistakenly conclude that something is
> wrong during transferring them back, which it counts as its own problem,
> so it will retry to run the job.
> As of Alexander Klyubin's suggestion to put when_to_transfer_output ->
> on 6.4.7 it is not supported.
> And finally, I would suggest to add "dir" command to your batch file,
> and to check that the creation time stamp of your file is correct and is
> AFTER Condor starts the job ( look at the creation time of your
> execution dir)
> I experienced the problem that on windows this timestamp is not updated
> when you make a copy old file
> Just a thought
> 
> On Tue, 2004-02-24 at 19:59, Kewley, J (John) wrote:
> > > transfer_input_files = sleeper.exe 
> > 
> > would setting transfer_output_files help in this case, or is that 
> > just for restricting the files that are passed back?
> > 
> > JK
> > Condor Support Information:
> > http://www.cs.wisc.edu/condor/condor-support/
> > To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> > unsubscribe condor-users <your_email_address>
> > 
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
> 

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>