[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dagman job resubmission after condor_rm



This was the DAG I used:

Job Windows.0 job.windows.submit
Job Windows.1 job.windows.submit
Job Windows.2 job.windows.submit
Job Windows.3 job.windows.submit
Job Windows.4 job.windows.submit
Job Windows.5 job.windows.submit
Job Windows.6 job.windows.submit
Job Windows.7 job.windows.submit
Job Windows.8 job.windows.submit
Job Windows.9 job.windows.submit
Job Windows.10 job.windows.submit
Job Windows.11 job.windows.submit
Job Windows.12 job.windows.submit
Job Windows.13 job.windows.submit
Job Windows.14 job.windows.submit
Job Windows.15 job.windows.submit
Job Windows.16 job.windows.submit
Job Windows.17 job.windows.submit
Job Windows.18 job.windows.submit
Job Windows.19 job.windows.submit
PARENT Windows.0 CHILD Windows.10
PARENT Windows.1 CHILD Windows.11
PARENT Windows.2 CHILD Windows.12
PARENT Windows.3 CHILD Windows.13
PARENT Windows.4 CHILD Windows.14
PARENT Windows.5 CHILD Windows.15
PARENT Windows.6 CHILD Windows.16
PARENT Windows.7 CHILD Windows.17
PARENT Windows.8 CHILD Windows.18
PARENT Windows.9 CHILD Windows.19
FINAL FinalNode finalnode/job.submit




This is the submit file that is used for each of the jobs:

Initialdir = /path/to/wkflw
Executable = /path/to/wkflw /job.bat
Universe = vanilla
Output = /path/to/wkflw /log/largedag.stdout
Error = /path/to/wkflw/log/largedag.stderr
Log = /path/to/wkflw/log/largedag.log

Requirements = (( OpSys >= "WINNT" || OpSys >= "WINDOWS" ) && (Machine == "server3003.ds.susq.com"))

notification = never
RunAsOwner = True
should_transfer_files = yes
WhenToTransferOutput = ON_EXIT_OR_EVICT
stream_output = True
stream_error = True
TransferOut = True

Priority = 0
nice_user = False

+NTDomain="DOMAIN"

request_memory = 2048

Queue





This is the submit file for the FINAL node:

Initialdir = /path/to/wkflw
Executable = /path/to/wkflw/finalnode/job.bat
Universe = vanilla
Output = /path/to/wkflw/finalnode/largedag.stdout
Error = /path/to/wkflw/finalnode/largedag.stderr
Log = /path/to/wkflw/finalnode/largedag.log

Requirements = ( OpSys >= "WINNT" || OpSys >= "WINDOWS" )

notification = never
RunAsOwner = True
should_transfer_files = yes
WhenToTransferOutput = ON_EXIT_OR_EVICT
stream_output = True
stream_error = True
TransferOut = True

Priority = 0
nice_user = False

+NTDomain="SUSQ"

request_memory = 2048

Queue





This is job.bat:

echo "start"
date /t
time /t
hostname.exe
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -command Start-Sleep 1800
echo "end"



Thanks!
Eric Gross
System Engineer
Susquehanna International Group


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
Sent: Wednesday, June 10, 2015 9:43 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Dagman job resubmission after condor_rm

On Wed, 10 Jun 2015, Gross, Eric wrote:

> This morning, we saw a dagman process in our cluster that was stuck in
> an âXâ state (via condor_q âdag) after it had been removed (condor_rm
> <dag cluster id>). A few of the nodes from the DAG were still running
> or idle. We weren't sure why these nodes were still executing after
> the dagman was removed; but we think it has something to do with the
> FINAL node, and the way the dagman parses the DAG. Since these jobs
> take a very long time to complete, this ends up causing us issues with
> slots being held by jobs that aren't actually supposed to be running.
> After we cleaned up these extra processes, we were able to reproduce this with a simple job.
>
> We create a simple bat script that sleeps for 30min (arbitrary time),
> which each node in the dag will use as the executable in the submit
> file. Our test DAG had 20 nodes total (also arbitrary), 10 of which
> were children of the first 10. The most important part is the FINAL
> statement at the end of the DAG. This just sleeps for 15 seconds (also
> arbitrary, but short enough that we don't have to wait a long time to
> watch it finish). When we submit the DAG, we see a dagman enter a
> running state, then begin submitting the nodes of the DAG. If we kill
> this off after a certain amount of time (we weren't able to figure
> this out, but it is likely less than 5-10sec after the first nodes are
> submitted), the dagman exits and doesn't run the workflow; and we
> don't see any processes remaining from the DAG. If we wait a bit
> longer (maybe >10sec), we can do a condor_rm to put the dagman into an
> "X" state, and any running and/or idle nodes will be removed. This is
> where the problem occurs. We expect this to happen, and we expect the
> FINAL node to be queued, execute, and return; killing the rest of the
> workflow. What actually happens is that the FINAL node is queued, and
> any running/idle jobs that were in the queue when dagman was removed
> are *also* queued. I'm not certain if this is expected behavior, but
> we didn't anticipate it when removing a dagman; we assumed every node would be removed. We also noticed that these processes continue running after the FINAL node exits.

No, that's not the expected behavior.  Once you've condor_rm'ed the DAGMan job, the only thing that should run is the FINAL node.

I'm curious about the timing issue -- offhand I can't see what would be causing that.

> We think this has something to do with the FINAL node, since we can't
> reproduce the issue without it. Also, since we don't see it if we kill
> the dagman early enough in the workflow, we think that the FINAL node
> might not be evaluated right away; maybe the DAG is still being parsed
> when the first set up jobs are queued?

Hmm, it wouldn't surprise me if the FINAL node is related somehow, but parsing isn't the issue -- the whole DAG file is parsed before any jobs are submitted.

> Could this be caused by a config setting we are using? Has anyone else
> seen this behavior (can you reproduce it)?

Possibly -- I haven't seen that before.

Can you send me the relevant DAG files and the dagman.out files (from your original DAG, and from the two cases of the tests you ran)?

Kent Wenger
CHTC Team

________________________________

IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.