[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] Output files not being returnedupon preemption -- new info



>From your item 6c:

6c) StarterLog notes "Error: Failed to open desktop on winsta SAWinSta."

It seems like the 'graceful shutdown' is trying to interact with the
desktop by opening a window.  Somewhere in the manual, I've read that
the Condor created user that runs your job does *not* interact with the
desktop.  Sounds like an unresolved contradiction in the Condor
software.

Edward Diamond
Engineer, Water Resources
DWR Bay-Delta Office
(916) 653-4603
ediamond@xxxxxxxxxxxx


-----Original Message-----
From: David Vestal [mailto:dvestal@xxxxxxxxxxx] 
Sent: Sunday, February 29, 2004 2:30 PM
To: condor-users@xxxxxxxxxxx
Subject: RE: [condor-users] Output files not being returnedupon
preemption -- new info

To all,

Many thanks for all the replies so far.  I've dug deep into the source
code, and I have a much better idea now why my files are not being
returned from the grid nodes.

To quickly recap:  my Vanilla jobs, running on Condor 6.4.7, create
output files, which are not returned to the Central Manager from the
grid node if the job is preempted or vacated, even though the submit
file specifies "transfer_files = ALWAYS."

Both the Central Manager and grid node are running WinXP.

>From what I can reconstruct from the logfiles and source code, here's
the basic sequence of events, with unimportant ones (I think) removed.
All these logfile entries are from StarterLog on the grid node.

1) I submit a Vanilla job to a grid node from the Central Manager
2) I verify that it has created an empty file signifying that execution
has begun.
3) I kill it with "condor_vacate -graceful [grid node name]"
4) The grid node's StarterLog notes: "DaemonCore received
UNAUTHENTICATED command 60000."  Command 60000 is "DC_RAISESIGNAL."
It's a SIGTERM.
5) The grid node's StarterLog notes (after a few more messages): "Got
SIGTERM. Performing graceful shutdown."
6) The grid node's StarterLog notes the beginning of
DaemonCore::Shutdown_Graceful.
6a) StarterLog notes "Skipping Winsta0."
6b) Several calls of DCFindWinSta occur.
6c) StarterLog notes "Error: Failed to open desktop on winsta SAWinSta."
6d) Several more calls of DCFindWinSta occur.
6e) 6a-6d repeat once.
7) StarterLog notes: "Shutdown_Graceful: Failed cuz no hWnd"
8) The node appears to enter a waiting state.  The
ProcFamily::takesnapshot() timer initiated by VanillaProc::StartJob
continues to fire every 15 seconds.
9) 90 seconds after #7, StarterLog again notes: "DaemonCore received
UNAUTHENTICATED command 60000."

Note:  For debugging purposes, I have changed MaxSuspendTime and
MaxVacateTime to 90.

10) StarterLog notes:  "Got SIGQUIT.  Performing fast shutdown."
11) handle_dc_sigquit calls main_shutdown_fast, which calls
Starter->ShutdownFast(0), which sets the transfer_at_vacate flag to
false.  This means that later, CStarter::Reaper will not attempt to
upload files with FileTransfer::UploadFiles.

So, at this point it appears that because DaemonCore::Shutdown_Graceful
can't find a HWND associated with the job, the job is shut down fast
instead of gracefully, which prevents the return of files.  I don't
think I can progress any further with this issue alone.  Does anyone
know I can fix this problem?

Thanks,
David

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>