[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor jobs remaining idle if the host name or the port are misspelled

Thanks Jaime,
I see both the "Detected Down Grid Resourceâ and the GridResourceUnavailableTime attribute, this helps.

The behavior surprised me because in the past I remember the opposite, e.g. w/ GRAM the job was going on hold and in GWMS we had a table of errors that could be recovered and we were triggering a release for those errors.
I interpreted idle as "just wait, all is OK, I'm working on it" (no need to investigate further, things will eventually run) and was expecting a hold for problems letting the submitter decide whether to recover/release or fail.
Here jobs may stay on idle for days if there is a typo and in GWMS we are using the number of idle jobs as a measure of the pressure on the system.

I will inspect the log and GridResourceUnavailableTime to trigger a warning and fail the jobs.
I guess GridResourceUnavailableTime is set for failures with any type resource in the grid universe.
Any other attribute I should look for with grid universe or different universes to alert for possible problems when the job is still idle? 

Thank you,

> On Jun 16, 2020, at 9:15 AM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
>> On Jun 15, 2020, at 10:48 PM, Marco Mambelli <marcom@xxxxxxxx> wrote:
>> Greetings,
>> I'm submitting grid universe jobs to a HTCondor-CE and the jobs remain idle is there is a mistake in the grid_resource string.
>> I would have expected an error at submission or the job to go on hold after a some time.
>> Instead also hours later it remains idle
>> Is that supposed to be?
>> I thought in the past it was different.
>> If it is supposed to be this way, how can I distinguish between a site busy and a misspelled name?
>> If it is not supposed, any suggestion on finding what I could do wrong?
>> e.g.
>> CE host aaa.fnal.gov (std OSG installation, port 9619
>> universe = grid
>> grid_resource = condor aaa.fnal.gov aaa.fnal.gov:9619
>> ...
>> queue
>> grid_resource = condor aaaAAA.fnal.gov aaa.fnal.gov:9619
>> queue
>> grid_resource = condor aaa.fnal.gov aaaAAA.fnal.gov:9619
>> queue
>> grid_resource = condor aaa.fnal.gov aaa.fnal.gov:9677
>> queue
>> Of the jobs above in the same cluster, the first one completes successfully, others remain IDLE
> Any hostnames in a grid_resource line are not validated at submit time, so condor_submit wonât fail because of a typo in a name.
> You should see a "Detected Down Grid Resourceâ event in the job log, if you have one. Also, the attribute GridResourceUnavailableTime will be set in the job ad. This is done for errors that may be temporary.
> If a failure to talk to the CE is not temporary, then our guideline is to put the job on hold. An invalid hostname should probably be treated as such. A mistyped port number or schedd name is trickier. These may be caused by the CE service being down temporarily.
> - Jaime
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=uDgQi1xRBdwglb5IuwntlH5UFsEQ01CkCoRDkcsXLYg&s=TZMkEQulP79bYuBtvzgiqXRM06cM5LzfYK90Va5107s&e= 
> The archives can be found at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=uDgQi1xRBdwglb5IuwntlH5UFsEQ01CkCoRDkcsXLYg&s=NWJnShE35aKtpBZkrv-FjWWwWeEln4jGYcuY3aKHVM4&e=