[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] return code of jupyter notebook jobs



On 3/27/2020 5:25 AM, Beyer, Christoph wrote:
Hi,

as we use jupyter notebooks running in condor slots in production for a while now we need to get a bit of monitoring around this.

One of the bigger problems to come up with something decent is that the jupyterhub uses condor_rm to end the notebook once it is not needed anymore. This results in a condor_history entry with jobstatus == 3 which is considered to be a faulted job (which in fact in this case it is not). The other option is that the notebook job runs into the timelimit and gets removed by the periodic_remove_expression which is a bit more flexible to tweak presumably.

I would like the idea of having an option for condor_rm to influence the subsequent history-job-state.

I think your idea, whereby condor_rm can influence subsequent history-job-state, is on target. Please note that condor_rm takes a "-reason <string>" argument, which allows you to set the RemoveReason job attribute at the time of removal. This RemoveReason attribute will also be in the history. The Python API also supports setting a removal reason at the time of job removal.

Does this help?

best regards,
Todd