[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ExitCode persistence in held job



Ok, maybe there's some aspect of the setup of this job that's causing the ExitCode to disappear.

I've set them up as "cron-style" jobs, but using the requirements expression rather than Crondor:

requirements = (WatchdogInterval <= time() - EnteredCurrentStatus || LastJobStatus == 5 || \
	MY.NumJobStarts == 0) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && \
	(TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)

The OnExitRemove is set to false, and LeaveJobInQueue is false as well.

The WatchdogInterval is set at submit time, in these jobs it's 5 minutes. The LastJobStatus and NumJobStarts values are checked to allow it to run immediately when first submitted or released from hold.

Is there something that might be going on with on_exit_remove being false? Perhaps the job is being interrogated in such a way that ExitCode is cleared?

All I know is that the periodic release checking the ExitCode didn't release it. I'll run some more experiments...

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
Digital Transormation & Innovation
 

-----Original Message-----
From: Jaime Frey <jfrey@xxxxxxxxxxx> 
Sent: Thursday, November 12, 2020 5:54 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] ExitCode persistence in held job

> On Nov 9, 2020, at 1:14 PM, Michael Pelletier via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Hi folks,
> 
> I'm trying to use periodic release to release a job after a time delay when the ExitCode from the process is EX_TEMPFAIL. However, it appears that at some point the ExitCode attribute no longer exists, so the periodic release never triggers.
> 
> At what point does the job lose the ExitCode attribute?
> 
> It looks like by setting the on_exit_hold_subcode to ExitCode, it can be preserved, and I think that gives me a workaround to accomplish the self-release, but I'm wondering what the boundaries of the ExitCode attribute are.

Odd. I donât see any code that will clear the ExitCode attribute.
I tried a quick test job with on_exit_hold=true, and the ExitCode is set when the job exits and goes on hold. And it remains when I release the job and it re-runs.

 - Jaime