[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Random job in DAGman throwing submitERROR



Hi Mark,

Thanks for your email.Â

We are not using late materialization here. This issue is not happening every-time, it happens on random occasions not following any pattern or job. Once the job submit failure happens, dagman automatically resubmit it and second attempt is successful. Current debug setting isÂ

SCHEDD_DEBUGÂ Â Â Â Â Â = D_PID

Let me see if I can enable a fulldebug for long time considering the amount of logs which it will generate.Â

Thanks & Regards,
Vikrant Aggarwal


On Wed, May 15, 2019 at 11:50 PM Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
Hi Vikrant, a few questions here:

Late materialization was actually added during the 8.5 series. We only started supporting it for DAGMan jobs in 8.7.4. Are you using late materialization? If so, DAGMan is definitely going to have some strange problems.

As for the SchedLog not reporting this error, you probably don't have your debug level turned up high enough. Can you add the following configuration option:

SCHEDD_DEBUG = D_FULLDEBUG

Then try running your dag again until you see this failure again? With this option set the SchedLog should have more useful option.

One other idea: is it the same job (ie: same .submit file) failing every time? If so, there might be something about your requirements that the condor_schedd doesn't like. Try running the following:

condor_submit -debug myfile.submit

That should output some helpful information which explains why it's failing.

Mark



On Wed, May 15, 2019 at 6:56 AM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Thanks for your response.Â

As per following link, it seems like that late materialization feature was introduced in 8.7.4.

Version I am using is: 8.5.8

Thanks & Regards,
Vikrant Aggarwal


On Wed, May 15, 2019 at 3:02 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi Vikrant,

I might be wrong (the wife assures I often am) but there is a problem that is documented with late materialization and DAGs:


The error message I did see in the case of rescheduling was different from your case but if your problem is actually about rescheduling inside a DAG it might be related ...

Best
Christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 15. Mai 2019 09:23:02
Betreff: Re: [HTCondor-users] Random job in DAGman throwing submitERROR

Hello Team,
Any more thoughts on this issue?

Thanks & Regards,
Vikrant Aggarwal


On Sat, May 11, 2019 at 5:11 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Thanks for your response. No failing disk on the node.Â

There is no schedlog for 04:01:16 when the job submission was failed.Â

05/08/19 04:01:15 (pid:1316698) Shadow pid 1055394 for job 147153.3 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053482 for job 147151.3 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1056671 for job 147155.1 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1056674 for job 147155.4 exited with status 100
05/08/19 04:01:15 (pid:1316698) Number of Active Workers 0
05/08/19 04:01:15 (pid:1316698) Number of Active Workers 0
05/08/19 04:01:17 (pid:1316698) Activity on stashed negotiator socket: <IPaddress:32651>
05/08/19 04:01:17 (pid:1316698) Using negotiation protocol: NEGOTIATE

This is the main logic used in dagman submit file.Â

remove_kill_sig = SIGUSR1
+OtherJobRemoveRequirements  Â= "DAGManJobId =?= $(cluster)"
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool Â= False

Jobs "147157.4" and "147154.6" are part of same dagman output.Â

# grep -ir '147154.6' condor_20190507.20190508.033002.618997.dag.dagman.out
05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.5.0) to (147154.6.0)
05/08/19 03:59:28 Event: ULOG_SUBMIT for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:27}
05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.6.0) to (147154.7.0)
05/08/19 03:59:45 Event: ULOG_EXECUTE for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:28}
05/08/19 04:01:16 Event: ULOG_JOB_TERMINATED for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 04:00:39}
05/08/19 04:01:16 Node test_21090507_ job proc (147154.6.0) completed successfully.

# grep -ir '147157.4' condor_20190507.20190508.033002.618997.dag.dagman.out             Â
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.


Thanks & Regards,
Vikrant Aggarwal


On Thu, May 9, 2019 at 11:04 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

Itâs still not the right time period of the SchedLog. ÂÂand the error about Zombie checking is for different job, itâs not even the same cluster.

Â

You might check to see if you have a failing disk, that *might* explain both of these problems.ÂÂ I canât think of anything else that could.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 10:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Random job in DAGman throwing submitERROR

Â

Sorry for typo. Snippet was from schedlogs only not shadow.

On Thu, 9 May, 2019, 21:07 John M Knoeller, <johnkn@xxxxxxxxxxx> wrote:

The ShadowLog doesnât isnât where to look for this error. when submit fails, there will never be a shadow for that job.

you should look in the SchedLog at time 05/08/19 04:01:16 to see if it has a reason why the submit failed.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 3:15 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Random job in DAGman throwing submitERROR

Â

Hello Team,

Â

We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency. I am trying to debug the reason for same.Â

Â

05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.
05/08/19 04:01:16 ERROR: submit attempt failed

Â

Shadow logs are not showing any indication for this job but it does show the "status in check_zombie"Âmessage for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime.Â

Â

05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100


condor version detailsÂ

Â

$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $

$CondorPlatform: x86_64_RedHat6 $

Â

Anyone else saw this issue?Â

Â

Thanks & Regards,

Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/