[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dagman Issue V9



Thanks Mark,
I have many users that sending similar dags and they have no issues, this particular submitter is sending a lot more jobs and maybe it's load related.
This issue happens sporadically.

The DAG file:
SUBDAG EXTERNAL A xxx/xxx/a.dag
SUBDAG EXTERNAL B yyy/yyy/b.dag
PARENT A CHILD B

DOT pipeline.dag.dot

This is our configuration (DAG related):
DAGMAN_MAX_JOBS_IDLE = 25000
DAGMAN_MAX_JOB_HOLDS = 5000
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 100
DAGMAN_MAX_SUBMIT_ATTEMPTS = 1
DAGMAN_USE_SCTRICT =0
DAGMAN_USE_CONDOR_SUBMIT = TRUE

The DAGMAN_MAX_JOBS_SUBMITTED is not configured so the value = 0

Sometimes I can see that shared port daemon is under high load and refuse the connection but it's not happening at the same time.


Many Thanks
David




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mark Coatsworth <coatsworth@xxxxxxxxxxx>
Sent: 07 June 2021 19:19
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dagman Issue V9
 
Hi David,

Can you describe the dag where this is happening (or better yet, send
me the .dag file)? When you mention a child dag, are you talking about
an external subdag or something different?

By default every dag is supposed to have these attributes in its
classad. I just did a quick test to verify this. So I'm wondering if
there's something special about your environment causing it to not be
there.

Are you setting a custom value for max jobs? (either with the
DAGMAN_MAX_JOBS_SUBMITTED configuration knob or the -maxjobs submit
flag)

Mark



On Mon, Jun 7, 2021 at 6:54 AM <duduhandelman@xxxxxxxxxxx> wrote:
>
> Hi Again,
> I forgot to mention it's happening on dag that submit dag only the child dag are effected.
>
> Also, the child dag does not have those classads. which I don't know if it's ok or not.
>
> DAGMan_MaxIdle
> DAGMan_MaxJobs
> DAGMan_MaxPreScripts
> DAGMan_MaxPostScripts
> DAGMan_MaxHoldScripts
>
> Thanks Again,
> David
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of duduhandelman@xxxxxxxxxxx <duduhandelman@xxxxxxxxxxx>
> Sent: 07 June 2021 14:29
> To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Dagman Issue V9
>
> Hi All,
> A week ago I have upgrade to condor 9.0.1 from 8.8 I'm facing an issue with Dagman Jobs,
> Most of the jobs running as expected but some DAGMan are not submitting jobs after a while.
> It seems that Dagman job is asking for DagMan_Max_jobs and sometimes gets a positive value but sometimes gets negative number and that causing the issue I assume.
>
> The Sched debug print:
> GetAttributeInt(968372, 0 , DAGMAN_MaxJobs) not found.
>
> The Dag output display every few minutes:
> Warning: failed to get attribute DAGMan_MaxIdle
> Warning: failed to get attribute DAGMan_MaxJobs
> Warning: failed to get attribute DAGMan_MaxPreScripts
> Warning: failed to get attribute DAGMan_MaxPostScripts
> Warning: failed to get attribute DAGMan_MaxHoldScripts
>
>
> It seems like the value is garbage, probably not initialized.
> Any clues? can it be a security issue?
>
> Many Thanks
> David
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/