[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] searching for job rerun info



On 4/4/2016 5:30 AM, Thomas Hartmann wrote:
> Hi all,
> 
> I am investigating a job at us, that may got rerun.
> Unfortunately, the original job run on March 26 and the potential rerun
> on March 29, so they dropped from condor_history.
> 
> For the first instance, I know the worker node and found some stats in
> the local startd log.
> Is there a way to start from here, to find a potential doppelgaenger?
> (faster than greping over all startd-logs in the pool?)
> 
> If a job is rerun by condor due to a non 0 exit code, I suppose the
> rerun job will have the same basic job ID?
> 
> Cheers and thanks,
>    Thomas
> 

Hi Thomas,

>From reading the above, is your desire that your job never gets re-run by HTCondor even 
in the event of failures?  
If so see
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts
This wiki page also lists out all the typical reasons why HTCondor will automatically
restart a job; be aware that by default HTCondor alone will not restart a job that
exits successfully, even if it exits with a non-zero exit code.  

As for will the rerun job have the same job id: yes it will, unless you are 
using DAGMan -- failed nodes in DAGMan are resubmitted and thus will have a new job id.

As for where you can look since your history file rotated:  did the
job specify a job event log via "log = /some/file" in the submit file? If so
you could look there.  You could also grep the schedd log for the job id, but
guessing that the SchedLog already rotated.  Finally, if you define "EVENT_LOG = /some/file" in
the condor_config on your submit node, you could look there.

But you likely want to increase the size specified via config knob MAX_HISTORY_LOG. :)

Hope the above helps
Todd