[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] searching for job rerun info



Hi Todd,

many thanks for the link!
Probably extending history/logs is the most reasonable way - three days
of reaction time may be to short for all parties involved ;)

Cheers and thanks,
  Thomas

ps: Normally we are having grid jobs, i.e., pilots so re-runs should be
no problem. However, in this case there was a problem upstream causing a
bit of confusion.


On 2016-04-05 22:14, Todd Tannenbaum wrote:
> 
> Hi Thomas,
> 
>>From reading the above, is your desire that your job never gets re-run by HTCondor even 
> in the event of failures?  
> If so see
>  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts
> This wiki page also lists out all the typical reasons why HTCondor will automatically
> restart a job; be aware that by default HTCondor alone will not restart a job that
> exits successfully, even if it exits with a non-zero exit code.  
> 
> As for will the rerun job have the same job id: yes it will, unless you are 
> using DAGMan -- failed nodes in DAGMan are resubmitted and thus will have a new job id.
> 
> As for where you can look since your history file rotated:  did the
> job specify a job event log via "log = /some/file" in the submit file? If so
> you could look there.  You could also grep the schedd log for the job id, but
> guessing that the SchedLog already rotated.  Finally, if you define "EVENT_LOG = /some/file" in
> the condor_config on your submit node, you could look there.
> 
> But you likely want to increase the size specified via config knob MAX_HISTORY_LOG. :)
> 
> Hope the above helps
> Todd

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature