[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Maximum SERVICE's run in local universe



I should also add that the corresponding logs for each DAG with a
service that's still running say this:

Warning: DAGMan thinks there are -1 idle jobs, even though the DAG is completed!
ERROR: Warning is fatal error because of DAGMAN_USE_STRICT setting
Aborting Dag...
Writing Rescue DAG to x.rescue001...
Removing submitted jobs...
Removing any/all submitted HTCondor jobs...

On Thu, Feb 22, 2024 at 5:43âAM Christopher Phipps
<hawtdogflvrwtr@xxxxxxxxx> wrote:
>
> I forgot to report back on this. It worked perfectly! I have noticed
> though, that sometimes the service node doesn't end when all of the
> work associated with the service node completes. In fact, the service
> job separates from the parent DAG and sits in the running state until
> you remove it manually. At first I thought it was because the job
> started and finished so quickly, that it didn't start the service
> until after the job had been completed, but it's happening with jobs
> that take the better part of 15 hours to complete, and i've confirmed
> that the service started far before anyone picked up the work. Have
> you see this before? Other than writing logic into the service to
> check regularly for any remaining work, is there another way to force
> the service to end gracefully when the rest of its dag is done?
>
> Also, I forgot to mention last time that i'm running 23.0.3
>
> On Tue, Feb 6, 2024 at 2:29âPM Cole Bollig via HTCondor-users
> <htcondor-users@xxxxxxxxxxx> wrote:
> >
> > Hi Christopher,
> >
> > Assuming this relates to the DAGMan setup I helped with recently, the change to this would have to be in the Schedd configuration. You just have to set START_LOCAL_UNIVERSE in the AP configuration (host that the Schedd/DAGMan is running on). This defaults to TotalLocalJobsRunning < 200 so something like:
> >
> > START_LOCAL_UNIVERSE = TotalLocalJobsRunning < n
> >
> > where n is the desired cap on local universe jobs that can run at once on the host. Don't forget to reconfigure condor (i.e. condor_reconfig)
> >
> > Cheers,
> > Cole Bollig
> > ________________________________
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
> > Sent: Tuesday, February 6, 2024 11:35 AM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Subject: [HTCondor-users] Maximum SERVICE's run in local universe
> >
> > Is there a way to increase the number of SERVICE jobs that can be
> > running at the same time in the local universe? It appears to be
> > limited by default to 200 and I'd like to increase it slightly.
> >
> > Thanks,
> > Chris
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> It will be happened; it shall be going to be happening; it will be was
> an event that could will have been taken place in the future. Simple
> as that. ~ Arnold Rimmer



-- 
It will be happened; it shall be going to be happening; it will be was
an event that could will have been taken place in the future. Simple
as that. ~ Arnold Rimmer