[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] JobRouter debug info



Hi Thomas,

Unfortunately, there's not a lot of insight when the Job Router fails to commit a job submission to the destination schedd. Frequently when we see these problems, though, they're caused by mismatched CE job router configuration and SUBMIT_REQUIREMENTS in the destination schedd.

- Brian

On 6/7/22 07:03, Thomas Hartmann wrote:
Hi all,

I am debugging a few jobs, where their assigned routes fails like [1,2]
with the DESYPRIO route and then the LRMS Condor route having issues
(the LRMS route probably fallout from the DESYPRIO actual route). On the
Condor side, the routed(?) job IDs do not show up [3], so that I have no
insight yet, why the submission actually failed

I suspect that it might be a user:group issue, but I am still trying to
get more output to better understand the issue. I have already set the
job router to fulldebug output
   JOBROUTER_DEBUG = D_ALL:2
assuming that the job router also uses the daemon debug level syntax -
however, I have not not much more output.

Is there maybe another knob to dig a bit deeper into the router internals?

Cheers and thanks,
   Thomas


[1]
06/07/22 13:47:14 JobRouter
(src=2895713.0,dest=8180603.0,route=Local_Condor): finalized job
06/07/22 13:47:20 WARNING: Saw slow DNS query, which may impact entire
system: getaddrinfo(grid-htc-master02.desy.de) took 5.006753 seconds.
06/07/22 13:47:20 ERROR (schedd grid-htcondorce0.desy.de at pool
condor01.desy.de:9618,grid-htc-master02.desy.de:9618) (8192573.0) Failed to
  commit job submission
06/07/22 13:47:20 JobRouter failure (src=2824628.0,route=DESYPRIO):
failed to submit job
06/07/22 13:47:20 ERROR (schedd grid-htcondorce0.desy.de at pool
condor01.desy.de:9618,grid-htc-master02.desy.de:9618) (8192574.0) Failed
to commit job submission


[2]
06/07/22 13:49:32 JobRouter failure (src=2824628.0,route=Local_Condor):
failed to submit job
06/07/22 13:49:33 ERROR (schedd grid-htcondorce0.desy.de at pool
condor01.desy.de:9618,grid-htc-master02.desy.de:9618) (8192686.0) Failed to
  commit job submission
06/07/22 13:49:33 JobRouter failure (src=2881683.0,route=Local_Condor):
failed to submit job
06/07/22 13:49:33 ERROR (schedd grid-htcondorce0.desy.de at pool
condor01.desy.de:9618,grid-htc-master02.desy.de:9618) (8192687.0) Failed to
  commit job submission

[3]
grep -r 8192686 /var/lib/condor*/spool/history
echo $?
1
grep -r 8192573 /var/lib/condor*/spool/history
echo $?
1

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/