[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] POST on each Proc



Awesome and thank you for the help. The service piece worked out
perfectly with properly aborting when a true success is found!
The last issue I'm having now is (really a question on how not to
overwhelm condor), is it possible to have a DAG with many JOBS +
SERVICE where one successful ABORT-DAG-ON doesn't abort every job in
the single DAG file? It works a treat when I only put the same
identical IDs in a single DAG, but given the number of JOBs * IDs, the
load time can take hours while ensuring I'm not clobbering condor. In
other words, Is it possible to have SERVICE A-SVD in the example
below, only Abort the JOB A-JOB if they're in a single DAG, where
A-B.SUB is queue > 1?

JOB A-JOB A.sub
SERVICE A-SVC A.svc
JOB B-JOB B.sub
SERVICE B-SVC B.svc

tldr; Is there a ABORT-JOB-ON or a way to tie a service to a job, etc?

Thanks!
Chris

On Mon, Jan 29, 2024 at 1:51âPM Cole Bollig via HTCondor-users
<htcondor-users@xxxxxxxxxxx> wrote:
>
> Hi Christopher,
>
> I was alluding to what you described as having a node on the side that does the intended POST script analysis and action. Some clarification on service nodes:
>
> Service nodes are just normal HTCondor jobs that live as long as the specified executable runs for. Service nodes have a distinction to DAGMan to be run prior to any other node in the DAG (besides provisioner).
> Service nodes are best effort. So, if a service node fails to submit correctly then DAGMan will ignore the failure and carry on with execution.
> I would recommend running the Service node job as universe = local. It is similar to scheduler universe (as it runs locally on the AP), but is not managed directly by the Schedd rather a Shadow and Starter like a vanilla universe job.
> ABORT_DAG_ON can be specified for a Service node. Meaning if you have a service node that watches the results directory for verification, when the job finds a valid success file you can remove all the jobs with constraints as stated and/or exit with a specific value to notify DAGMan to abort. This will cause DAGMan to remove the entirety of the DAG. You can even specify an exit value of 0 for a successful exit.
>
> Finally, I do want to note that within the last half a year some bug fixes have been added to DAGMan in handling Service nodes correctly as they were causing DAGMan to crash assertion checks.
>
> Hope This Helps,
> Cole Bollig
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
> Sent: Monday, January 29, 2024 10:40 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] POST on each Proc
>
> There are certain situations where a success doesn't deem an exit, and
> in those cases the script would let the DAG jobs run to completion
> unless the script determined the success result is valid, in which
> case it would end the entire DAG and any other DAGs that are handling
> similar jobs looking for the same result. Each computation is
> identified by an ID that the current POST removes the work based on
> that ID as a constraint within all job ClassAds. It's a simple
> <pre><job><post> DAG where the job submit file queues 50000, and the
> process within knows how to split the work based on ProcId being part
> of the arguments for the application. On success, a "<uuid>.success"
> file is generated and returned via Condor the the working directory.
> The POST script currently reads this success file and determines if
> it's a true success, and then runs a nuclear `condor_rm -const
> 'computation_uuid==<uuid>'` (this isn't ideal we know) to kill all of
> the pertaining work.
>
> It's funny you mention a job that runs along with it to monitor. I
> just wrote a Service node that I'm about to kick the tires on that
> will monitor the return folder for success files and do the work of
> POST to see if that will meet my needs of semi-real time exiting. Is
> this what you were referring to? The documentation on SERVICE nodes is
> misleading, but I understand it to mean that I treat it like a submit
> file, but this one should remain running (while true) and to ensure
> it's got access to the return working folder defined in the job
> submit, be run in the scheduler universe.
>
> Thanks,
> Chris
>
> On Mon, Jan 29, 2024 at 10:59âAM Cole Bollig via HTCondor-users
> <htcondor-users@xxxxxxxxxxx> wrote:
> >
> > Hi Christopher,
> >
> > As you have noticed the Post Script executes for a DAG node when the job cluster is complete (success or failure), and not per proc. The only way for a Post Script to be run at the end of each job proc is if the nodes were not multi-proc but spread out individually. I will also note that with the current set up if any of the 50,000 some procs fail then all other procs will be removed by DAGMan and the node goes into failure mode. I also would not recommend not using a wrapper script around the actual executable itself since wrapper scripts are evil and hide vital information from condor, have the risk of not handle certain failures correctly, and make debugging even more difficult.
> >
> > All this being said I do have an idea come to mind being creating a local universe job to run under the DAG to monitor the other nodes either by querying the queue or trailing the *.nodes.log file like DAGMan proper does. What is the full requirements of this situation? If one of the many procs succeeds you want to remove the entire cluster, but should the enitre DAG exit? Is there only a single node in the DAG that this applies to? Are there other nodes in the DAG? If so, how complex is the DAG structure? Also what version of HTCondor are you currently using?
> >
> > -Cole Bollig
> > ________________________________
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
> > Sent: Monday, January 29, 2024 8:32 AM
> > To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> > Subject: [HTCondor-users] POST on each Proc
> >
> > I'm in a situation where I need a DAG to run a POST script (or
> > something equivalent) after each procid finishes while queue is
> > greater than 1 in the submit file, to determine if the remainder of
> > the jobs within that cluster, or other jobs performing a similar
> > action within another DAG should be aborted. Obviously the DAG only
> > runs POST after an entire job Cluster completes, but I'm curious if
> > there is another way to have something run after each ProcId finishes
> > so we can kill the remainder of the jobs if we get a result elsewhere.
> >
> > Our Jobs generally contain about 50,000 15 hour jobs that can/should
> > be exited if one of the Processes (ProcId) finishes with a positive
> > result. Given we only have 10,000 cores to work against, we could have
> > a positive result minutes into processing, but have to wait the days
> > necessary for all Jobs to complete in order to know this in POST.
> >
> > I've researched ABORT-DAG-ON, but we have little control over the
> > application we run, so I'd need to write a wrapper that interprets the
> > results and exits appropriately to stop the jobs within the cluster
> > and then handle the removal of like jobs in FINAL after the abort. I'm
> > just curious if there is a way to maintain use of the native binary
> > we're using without a wrapper, without also having to define each
> > ProcId manually in the DAG.
> >
> > Thanks,
> > ChrisP
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> It will be happened; it shall be going to be happening; it will be was
> an event that could will have been taken place in the future. Simple
> as that. ~ Arnold Rimmer
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
It will be happened; it shall be going to be happening; it will be was
an event that could will have been taken place in the future. Simple
as that. ~ Arnold Rimmer