[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dagman Issue V9



Thanks Mark.
I think you are correct. 

On Thursday i made a small test sending 500 dags during the send the errors appeared. 


Instead of holding and releasing I rebooted the server and the errors came back. 

I decided to disable the shared_port reboot the server this time the errors where gone and the dags looks ok. 

So now it's running with no shared_port for the weekend and hopefully on Monday we will have a clear answer. 

Many Thanks 
David 


Get Outlook for Android



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mark Coatsworth <coatsworth@xxxxxxxxxxx>
Sent: Saturday, June 12, 2021, 00:34
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Dagman Issue V9

Hi David,

It sounds like this is not a specific DAGMan problem, but something
more sinister is happening to your classads.

You mentioned previously "It seems that Dagman job is asking for
DagMan_Max_jobs and sometimes gets a positive value but sometimes gets
negative number and that causing the issue I assume." You should not
be getting different values here. Unless somebody is actively changing
the value of this attribute using condor_qedit, you should be seeing a
value of 0 (the default) every time.

Moreover, the error "GetAttributeInt(968372, 0 , DAGMAN_MaxJobs) not
found." is also alarming because that attribute should be there by
default.

Because this is only happening sporadically I'm assuming it must be
something load-related, but beyond that it's really hard to say. Your
best bet is probably to set both DAGMAN_DEBUG and SCHEDD_DEBUG up to
D_FULLDEBUG, and check the log files for unusual errors next time it
happens.

Mark


On Tue, Jun 8, 2021 at 4:49 AM <duduhandelman@xxxxxxxxxxx> wrote:
>
> Thanks Mark,
> I have many users that sending similar dags and they have no issues, this particular submitter is sending a lot more jobs and maybe it's load related.
> This issue happens sporadically.
>
> The DAG file:
> SUBDAG EXTERNAL A xxx/xxx/a.dag
> SUBDAG EXTERNAL B yyy/yyy/b.dag
> PARENT A CHILD B
>
> DOT pipeline.dag.dot
>
> This is our configuration (DAG related):
> DAGMAN_MAX_JOBS_IDLE = 25000
> DAGMAN_MAX_JOB_HOLDS = 5000
> DAGMAN_MAX_SUBMITS_PER_INTERVAL = 100
> DAGMAN_MAX_SUBMIT_ATTEMPTS = 1
> DAGMAN_USE_SCTRICT =0
> DAGMAN_USE_CONDOR_SUBMIT = TRUE
>
> The DAGMAN_MAX_JOBS_SUBMITTED is not configured so the value = 0
>
> Sometimes I can see that shared port daemon is under high load and refuse the connection but it's not happening at the same time.
>
>
> Many Thanks
> David
>
>
>
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mark Coatsworth <coatsworth@xxxxxxxxxxx>
> Sent: 07 June 2021 19:19
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Dagman Issue V9
>
> Hi David,
>
> Can you describe the dag where this is happening (or better yet, send
> me the .dag file)? When you mention a child dag, are you talking about
> an external subdag or something different?
>
> By default every dag is supposed to have these attributes in its
> classad. I just did a quick test to verify this. So I'm wondering if
> there's something special about your environment causing it to not be
> there.
>
> Are you setting a custom value for max jobs? (either with the
> DAGMAN_MAX_JOBS_SUBMITTED configuration knob or the -maxjobs submit
> flag)
>
> Mark
>
>
>
> On Mon, Jun 7, 2021 at 6:54 AM <duduhandelman@xxxxxxxxxxx> wrote:
> >
> > Hi Again,
> > I forgot to mention it's happening on dag that submit dag only the child dag are effected.
> >
> > Also, the child dag does not have those classads. which I don't know if it's ok or not.
> >
> > DAGMan_MaxIdle
> > DAGMan_MaxJobs
> > DAGMan_MaxPreScripts
> > DAGMan_MaxPostScripts
> > DAGMan_MaxHoldScripts
> >
> > Thanks Again,
> > David
> > ________________________________
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of duduhandelman@xxxxxxxxxxx <duduhandelman@xxxxxxxxxxx>
> > Sent: 07 June 2021 14:29
> > To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> > Subject: [HTCondor-users] Dagman Issue V9
> >
> > Hi All,
> > A week ago I have upgrade to condor 9.0.1 from 8.8 I'm facing an issue with Dagman Jobs,
> > Most of the jobs running as expected but some DAGMan are not submitting jobs after a while.
> > It seems that Dagman job is asking for DagMan_Max_jobs and sometimes gets a positive value but sometimes gets negative number and that causing the issue I assume.
> >
> > The Sched debug print:
> > GetAttributeInt(968372, 0 , DAGMAN_MaxJobs) not found.
> >
> > The Dag output display every few minutes:
> > Warning: failed to get attribute DAGMan_MaxIdle
> > Warning: failed to get attribute DAGMan_MaxJobs
> > Warning: failed to get attribute DAGMan_MaxPreScripts
> > Warning: failed to get attribute DAGMan_MaxPostScripts
> > Warning: failed to get attribute DAGMan_MaxHoldScripts
> >
> >
> > It seems like the value is garbage, probably not initialized.
> > Any clues? can it be a security issue?
> >
> > Many Thanks
> > David
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/