[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] Output files not being returnedupon preemption -- new info



To all,

Many thanks for all the replies so far.  I've dug deep into the source code, and I have a much better idea now why my files are not being returned from the grid nodes.

To quickly recap:  my Vanilla jobs, running on Condor 6.4.7, create output files, which are not returned to the Central Manager from the grid node if the job is preempted or vacated, even though the submit file specifies "transfer_files = ALWAYS."

Both the Central Manager and grid node are running WinXP.

>From what I can reconstruct from the logfiles and source code, here's the basic sequence of events, with unimportant ones (I think) removed.  All these logfile entries are from StarterLog on the grid node.

1) I submit a Vanilla job to a grid node from the Central Manager
2) I verify that it has created an empty file signifying that execution has begun.
3) I kill it with "condor_vacate -graceful [grid node name]"
4) The grid node's StarterLog notes: "DaemonCore received UNAUTHENTICATED command 60000."  Command 60000 is "DC_RAISESIGNAL."  It's a SIGTERM.
5) The grid node's StarterLog notes (after a few more messages): "Got SIGTERM. Performing graceful shutdown."
6) The grid node's StarterLog notes the beginning of DaemonCore::Shutdown_Graceful.
6a) StarterLog notes "Skipping Winsta0."
6b) Several calls of DCFindWinSta occur.
6c) StarterLog notes "Error: Failed to open desktop on winsta SAWinSta."
6d) Several more calls of DCFindWinSta occur.
6e) 6a-6d repeat once.
7) StarterLog notes: "Shutdown_Graceful: Failed cuz no hWnd"
8) The node appears to enter a waiting state.  The ProcFamily::takesnapshot() timer initiated by VanillaProc::StartJob continues to fire every 15 seconds.
9) 90 seconds after #7, StarterLog again notes: "DaemonCore received UNAUTHENTICATED command 60000."

Note:  For debugging purposes, I have changed MaxSuspendTime and MaxVacateTime to 90.

10) StarterLog notes:  "Got SIGQUIT.  Performing fast shutdown."
11) handle_dc_sigquit calls main_shutdown_fast, which calls Starter->ShutdownFast(0), which sets the transfer_at_vacate flag to false.  This means that later, CStarter::Reaper will not attempt to upload files with FileTransfer::UploadFiles.

So, at this point it appears that because DaemonCore::Shutdown_Graceful can't find a HWND associated with the job, the job is shut down fast instead of gracefully, which prevents the return of files.  I don't think I can progress any further with this issue alone.  Does anyone know I can fix this problem?

Thanks,
David

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>