[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor jobs remaining idle if the host name or the port are misspelled



> On Jun 15, 2020, at 10:48 PM, Marco Mambelli <marcom@xxxxxxxx> wrote:
> 
> Greetings,
> I'm submitting grid universe jobs to a HTCondor-CE and the jobs remain idle is there is a mistake in the grid_resource string.
> I would have expected an error at submission or the job to go on hold after a some time.
> Instead also hours later it remains idle
> Is that supposed to be?
> I thought in the past it was different.
> If it is supposed to be this way, how can I distinguish between a site busy and a misspelled name?
> If it is not supposed, any suggestion on finding what I could do wrong?
> 
> e.g.
> CE host aaa.fnal.gov (std OSG installation, port 9619
> 
> universe = grid
> grid_resource = condor aaa.fnal.gov aaa.fnal.gov:9619
> ...
> queue
> grid_resource = condor aaaAAA.fnal.gov aaa.fnal.gov:9619
> queue
> grid_resource = condor aaa.fnal.gov aaaAAA.fnal.gov:9619
> queue
> grid_resource = condor aaa.fnal.gov aaa.fnal.gov:9677
> queue
> 
> Of the jobs above in the same cluster, the first one completes successfully, others remain IDLE

Any hostnames in a grid_resource line are not validated at submit time, so condor_submit wonât fail because of a typo in a name.
You should see a "Detected Down Grid Resourceâ event in the job log, if you have one. Also, the attribute GridResourceUnavailableTime will be set in the job ad. This is done for errors that may be temporary.

If a failure to talk to the CE is not temporary, then our guideline is to put the job on hold. An invalid hostname should probably be treated as such. A mistyped port number or schedd name is trickier. These may be caused by the CE service being down temporarily.

 - Jaime