[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Random job in DAGman throwing submitERROR



Itâs still not the right time period of the SchedLog.    and the error about Zombie checking is for different job, itâs not even the same cluster.

 

You might check to see if you have a failing disk, that *might* explain both of these problems.   I canât think of anything else that could.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 10:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Random job in DAGman throwing submitERROR

 

Sorry for typo. Snippet was from schedlogs only not shadow.

On Thu, 9 May, 2019, 21:07 John M Knoeller, <johnkn@xxxxxxxxxxx> wrote:

The ShadowLog doesnât isnât where to look for this error.  when submit fails, there will never be a shadow for that job.

you should look in the SchedLog at time 05/08/19 04:01:16 to see if it has a reason why the submit failed.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 3:15 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Random job in DAGman throwing submitERROR

 

Hello Team,

 

We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency.  I am trying to debug the reason for same. 

 

05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.
05/08/19 04:01:16 ERROR: submit attempt failed

 

Shadow logs are not showing any indication for this job but it does show the "status in check_zombie" message for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime. 

 

05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100


condor version details 

 

$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $

$CondorPlatform: x86_64_RedHat6 $

 

Anyone else saw this issue? 

 

Thanks & Regards,

Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/