[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Many jobs failure for Alice VO



Hi Mihai,
could it simply have been bad jobs exceeding the JobMemoryLimit?



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mihai Ciubancan <ciubancan@xxxxxxxx>
Sent: Monday, April 3, 2023 9:48 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Many jobs failure for Alice VO
 
Hi Jason,

I wasn't able to find this info. But in the meanwhile I have modify also
the fairshare policyand I saw that form the moment it dropped the number
of failed jobs:


[root@arc6atlas1 ~]# condor_history |grep alice01|head -7500|grep " X
"|wc -l
1192
[root@arc6atlas1 ~]# condor_history |grep alice01|head -3500|grep " X
"|wc -l
175

Also it dropped for LHCb too in the last 5 days.


Anyway I will keep an eye on the situation for the next days to see if
the number of failed jobs remained low.

Thank you,
Mihai

So I will keep an eye on it to see if

On 2023-03-31 23:29, Jason Patton via HTCondor-users wrote:
> Hi Mihai,
>
> Were you able to determine if one of the VO users were setting
> periodic_remove in their jobs?
>
> Thanks
>
> Jason Patton
>
> On Thu, Mar 30, 2023 at 7:36âAM Mihai Ciubancan <ciubancan@xxxxxxxx>
> wrote:
>
>> Hi Max,
>>
>> Thank you for the hint!
>>
>> I will check it.
>>
>> Cheers,
>> Mihai
>>
>> On 2023-03-29 14:30, Fischer, Max (SCC) wrote:
>>> Hi Mihai,
>>>
>>> there are two kinds of periodic remove expressions: One for the
>> system
>>> affecting all jobs, and one for *each* job. The error message
>> suggests
>>> this is the per-job remove _expression_ triggering.
>>>
>>> I would recommend to check the VOsâ jobs if they have a
>>> `PERIODIC_REMOVE` attribute or similar, then try and find out
>> where it
>>> is set.
>>>
>>> Cheers,
>>> Max
>>>
>>>> On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx>
>> wrote:
>>>>
>>>> In the last weeks a see a lot of Alice VO(and also LHCb) jobs
>> failing
>>>> with the following message:
>>>>
>>>> The job attribute PeriodicRemove _expression_ '(JobStatus == 1 &&
>>>> NumJobStarts > 0) || ((ResidentSetSize =!= undefined ?
>> ResidentSetSize
>>>> : 0) > JobMemoryLimit)' evaluated to TRUE
>>>>
>>>> I have set the the remove reason as I saw in a older email in the
>> list
>>>> (few months ago):
>>>>
>>>> # SYSTEM_PERIODIC_REMOVE with reasons
>>>> ########################
>>>>
>>>> # remove jobs running longer than 7 days
>>>> RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime -
>>>> EnteredCurrentStatus ) > 7 * 24 * 3600 ))
>>>>
>>>> # remove jobs on hold for longer than 7 days
>>>> RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime -
>>>> EnteredCurrentStatus) > 7 * 24 * 3600) )
>>>>
>>>> #  remove jobs with to many job starts or shadow starts
>>>> RemoveMultipleRunJobs = ( NumJobStarts >= 10 )
>>>>
>>>> # remove jobs idle for too long
>>>> MaxJobIdleTime = 7 * 24 * 3600
>>>> RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime -
>>>> EnteredCurrentStatus ) > MaxJobIdleTime ))
>>>>
>>>> # do it
>>>> SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
>>>> $(RemoveMultipleRunJobs)    || \
>>>> $(RemoveIdleJobs)           || \
>>>> $(RemoveReadyJobs)
>>>>
>>>> # set reason for remove
>>>> SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by
>>>> SYSTEM_PERIODIC_REMOVE due to ", \
>>>> ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
>>>> ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
>>>> ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed
>> jobstarts",
>>>> \
>>>> "being in idle state for 10 days"))),".")
>>>>
>>>> I have reconfigure the master and schedd daemons, but the problem
>>
>>>> persist.
>>>>
>>>> Do you have any idea how to fix this?
>>>>
>>>> Thank you,
>>>> Mihai
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to
>> htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to
>> htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/