[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Maximum SERVICE's run in local universe



Hi Christopher,

That warning message is a bug thanks for reporting it. As for the other issue, it is concerning to hear that the service node is running when the rest of the work has completed. DAGMan should recognize that the actual work has completed and remove the service nodes. I am curious, is the behavior that the service node stays lingering and the DAGMan job proper has exited the queue? Or is the DAG stuck waiting for service node to exit? Would you be willing to send me a the *.dagman.out file for one of the workflows that saw this behavior (Feel free to send it to me directly offline)? Just to be safe, did you modify the watcher job I provided at all?

-Cole Bollig


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
Sent: Thursday, February 22, 2024 9:02 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Maximum SERVICE's run in local universe
 
One more thing/correction, we're running 23.3.0.. My brain can't read
right this morning

On Thu, Feb 22, 2024 at 6:42âAM Christopher Phipps
<hawtdogflvrwtr@xxxxxxxxx> wrote:
>
> I should also add that the corresponding logs for each DAG with a
> service that's still running say this:
>
> Warning: DAGMan thinks there are -1 idle jobs, even though the DAG is completed!
> ERROR: Warning is fatal error because of DAGMAN_USE_STRICT setting
> Aborting Dag...
> Writing Rescue DAG to x.rescue001...
> Removing submitted jobs...
> Removing any/all submitted HTCondor jobs...
>
> On Thu, Feb 22, 2024 at 5:43âAM Christopher Phipps
> <hawtdogflvrwtr@xxxxxxxxx> wrote:
> >
> > I forgot to report back on this. It worked perfectly! I have noticed
> > though, that sometimes the service node doesn't end when all of the
> > work associated with the service node completes. In fact, the service
> > job separates from the parent DAG and sits in the running state until
> > you remove it manually. At first I thought it was because the job
> > started and finished so quickly, that it didn't start the service
> > until after the job had been completed, but it's happening with jobs
> > that take the better part of 15 hours to complete, and i've confirmed
> > that the service started far before anyone picked up the work. Have
> > you see this before? Other than writing logic into the service to
> > check regularly for any remaining work, is there another way to force
> > the service to end gracefully when the rest of its dag is done?
> >
> > Also, I forgot to mention last time that i'm running 23.0.3
> >
> > On Tue, Feb 6, 2024 at 2:29âPM Cole Bollig via HTCondor-users
> > <htcondor-users@xxxxxxxxxxx> wrote:
> > >
> > > Hi Christopher,
> > >
> > > Assuming this relates to the DAGMan setup I helped with recently, the change to this would have to be in the Schedd configuration. You just have to set START_LOCAL_UNIVERSE in the AP configuration (host that the Schedd/DAGMan is running on). This defaults to TotalLocalJobsRunning < 200 so something like:
> > >
> > > START_LOCAL_UNIVERSE = TotalLocalJobsRunning < n
> > >
> > > where n is the desired cap on local universe jobs that can run at once on the host. Don't forget to reconfigure condor (i.e. condor_reconfig)
> > >
> > > Cheers,
> > > Cole Bollig
> > > ________________________________
> > > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
> > > Sent: Tuesday, February 6, 2024 11:35 AM
> > > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > > Subject: [HTCondor-users] Maximum SERVICE's run in local universe
> > >
> > > Is there a way to increase the number of SERVICE jobs that can be
> > > running at the same time in the local universe? It appears to be
> > > limited by default to 200 and I'd like to increase it slightly.
> > >
> > > Thanks,
> > > Chris
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > >
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > >
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> >
> >
> > --
> > It will be happened; it shall be going to be happening; it will be was
> > an event that could will have been taken place in the future. Simple
> > as that. ~ Arnold Rimmer
>
>
>
> --
> It will be happened; it shall be going to be happening; it will be was
> an event that could will have been taken place in the future. Simple
> as that. ~ Arnold Rimmer



--
It will be happened; it shall be going to be happening; it will be was
an event that could will have been taken place in the future. Simple
as that. ~ Arnold Rimmer

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/