[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Many jobs failure for Alice VO



Hi Jason,

I wasn't able to find this info. But in the meanwhile I have modify also the fairshare policyand I saw that form the moment it dropped the number of failed jobs:


[root@arc6atlas1 ~]# condor_history |grep alice01|head -7500|grep " X "|wc -l
1192
[root@arc6atlas1 ~]# condor_history |grep alice01|head -3500|grep " X "|wc -l
175

Also it dropped for LHCb too in the last 5 days.


Anyway I will keep an eye on the situation for the next days to see if the number of failed jobs remained low.

Thank you,
Mihai

So I will keep an eye on it to see if

On 2023-03-31 23:29, Jason Patton via HTCondor-users wrote:
Hi Mihai,

Were you able to determine if one of the VO users were setting
periodic_remove in their jobs?

Thanks

Jason Patton

On Thu, Mar 30, 2023 at 7:36âAM Mihai Ciubancan <ciubancan@xxxxxxxx>
wrote:

Hi Max,

Thank you for the hint!

I will check it.

Cheers,
Mihai

On 2023-03-29 14:30, Fischer, Max (SCC) wrote:
Hi Mihai,

there are two kinds of periodic remove expressions: One for the
system
affecting all jobs, and one for *each* job. The error message
suggests
this is the per-job remove expression triggering.

I would recommend to check the VOsâ jobs if they have a
`PERIODIC_REMOVE` attribute or similar, then try and find out
where it
is set.

Cheers,
Max

On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx>
wrote:

In the last weeks a see a lot of Alice VO(and also LHCb) jobs
failing
with the following message:

The job attribute PeriodicRemove expression '(JobStatus == 1 &&
NumJobStarts > 0) || ((ResidentSetSize =!= undefined ?
ResidentSetSize
: 0) > JobMemoryLimit)' evaluated to TRUE

I have set the the remove reason as I saw in a older email in the
list
(few months ago):

# SYSTEM_PERIODIC_REMOVE with reasons
########################

# remove jobs running longer than 7 days
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime -
EnteredCurrentStatus ) > 7 * 24 * 3600 ))

# remove jobs on hold for longer than 7 days
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime -
EnteredCurrentStatus) > 7 * 24 * 3600) )

#  remove jobs with to many job starts or shadow starts
RemoveMultipleRunJobs = ( NumJobStarts >= 10 )

# remove jobs idle for too long
MaxJobIdleTime = 7 * 24 * 3600
RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime -
EnteredCurrentStatus ) > MaxJobIdleTime ))

# do it
SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
$(RemoveMultipleRunJobs)    || \
$(RemoveIdleJobs)           || \
$(RemoveReadyJobs)

# set reason for remove
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by
SYSTEM_PERIODIC_REMOVE due to ", \
ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed
jobstarts",
\
"being in idle state for 10 days"))),".")

I have reconfigure the master and schedd daemons, but the problem

persist.

Do you have any idea how to fix this?

Thank you,
Mihai
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/