[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "aborted by the user" in successful job



On 4/9/2019 4:32 AM, Alex Armstrong wrote:
> Dear htcondor users,
> 
> Is there a reason why I would see abort events (see [1]) in the logs of 
> my successful condor jobs. I have not run condor_rm on the job below, 
> which is why it finished normally and returns the desired output. The 
> full order of log events is below at [2].
> 

I am a bit confused.  Both your job event log entry [1] and your snippet 
apparently from the ShadowLog at [2] show the same thing, that job 
986318.954 was removed by condor_rm.  [2] also shows that job 986321.018 
did terminate normally, but that has nothing to do with what happened to 
job 986321.954.  Note it is possible, although unlikely, for a removed 
job to still deposit the desired output in your home directory due to 
race conditions - for instance, if the job completed on the execute node 
at the same second you do a condor_rm on the submit node.

> I am trying to parse the log files to determine which jobs were aborted 
> and need to be re-run. However, the abort event (i.e 009) is appearing 
> in log files that were not aborted and so I cannot use that as a handle 
> for identifying user aborted jobs.
> 

HTCondor will never "abort" (remove) jobs on its own without being told 
to do so.  Either condor_rm was run, or some policy expression in the 
submit file or condor_config file was configured to remove the job upon 
some condition (like after X amount of failure) - but in the latter 
case, I don't think the abort entry would say "via condor_rm".  I think 
the only way you see the "via condor_rm"  is if indeed condor_rm was run.

Did you submit these jobs via DAGMan, ie did you use condor_submit_dag? 
If so be aware that jobs submitted by DAGMan are removed if you remove 
the DAGMan job itself.

Hope this helps
Todd


> Thanks,
> Alex
> 
> [1]
> 009 (986318.954.000) 04/08 13:20:03 Job was aborted by the user.
>  Â Â Â Â via condor_rm (by user alarmstr)
> 
> [2]
> 000 (986321.018.000) 04/08 13:16:38 Job submitted from host:
> 028 (986321.018.000) 04/08 13:16:38 Job ad information event triggered.
> 001 (986321.018.000) 04/08 13:17:19 Job executing on host:
> 028 (986321.018.000) 04/08 13:17:19 Job ad information event triggered.
> 006 (986321.018.000) 04/08 13:17:27 Image size of job updated: 34912
> 028 (986321.018.000) 04/08 13:17:27 Job ad information event triggered.
> 024 (986318.954.000) 04/08 13:20:03 Job reconnection failed
> 028 (986318.954.000) 04/08 13:20:03 Job ad information event triggered.
> 009 (986318.954.000) 04/08 13:20:03 Job was aborted by the user.
> 028 (986318.954.000) 04/08 13:20:03 Job ad information event triggered.
> 006 (986321.018.000) 04/08 13:22:27 Image size of job updated: 991768
> 028 (986321.018.000) 04/08 13:22:27 Job ad information event triggered.
> 005 (986321.018.000) 04/08 13:24:19 Job terminated.
> 
>     (1) Normal termination (return value 0)
> 
> 028 (986321.018.000) 04/08 13:24:19 Job ad information event triggered.
>