[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] group quotas - Nagios tests with "Unspecified gridmanager error"



On Mar 3, 2014, at 9:44 AM, L Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:

> After successfully deploying user quotas and priorities, the system
> did stop working for the Nagios tests. Nagios tests are mapped to one
> of the groups: cms.admin, ops.admin or dteam.
> 
> In the configuration (see attachment fairshares.config) I specify the
> priorities with
> GROUP_PRIO_FACTOR_group_cms.admin =  100.0
> GROUP_PRIO_FACTOR_group_dteam =  100.0
> GROUP_PRIO_FACTOR_group_ops =  100.0
> # all other groups have a priority of 10000.0
> and the quotas with
> GROUP_QUOTA_DYNAMIC_group_cms =  0.80
> GROUP_QUOTA_DYNAMIC_group_cms.admin =  0.05
> GROUP_QUOTA_DYNAMIC_group_dteam =  0.02
> GROUP_QUOTA_DYNAMIC_group_ops =  0.05
> 
> with 324 available slots this should give the 3 groups between 6.48
> (dteam) and 16.2 slots (ops). The NegotiatorLog(attached,
> negotiator_admin.log) confirms these numbers:
> 03/03/14 14:57:57 group quotas: fairshare (1): group= group_cms.admin
> quota= 12.3429  requested= 0
> 
> However, looking at the  I have only one slot for ops and none for the
> other two:
> 03/03/14 14:57:57 group quotas: Group group_dteam  allocated= 0  usage= 0
> 03/03/14 14:57:57 group quotas: Group group_ops  allocated= 0  usage= 0
> 03/03/14 14:57:57 group quotas: Group group_cms.admin  allocated= 0  usage= 0
> and everything is used by the cms group:
> 03/03/14 14:57:57 group quotas: Group group_cms  allocated= 315  usage= 315
> 03/03/14 14:57:57 group quotas: Group group_cms.production  allocated=
> 7  usage= 7

I assume you’re using all of these group quota settings for matching of vanilla universe or similar jobs, not grid universe jobs.

> And all Nagios jobs are aborted with
> ================================================
> - Got a job held event, reason: Unspecified gridmanager error
> - Got a job held event, reason: Unspecified gridmanager error
> - Job got an error while in the CondorG queue.
> Status Reason: hit job retry count (0)
> ================================================
> 
> Am I doing something wrong?


To debug these failures, I’d have to see the underlying HTCondor submit files and gridmanger logs.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project