[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] numjobstarts vs numshadowstarts



On 3/23/2015 11:59 AM, Suchandra Thapa wrote:
Are there any situations where numjobstarts will be different than
numshadowstarts?  Is this something that'll occur frequently?

Thanks,
Suchandra


Hi Suchandra,

NumShadowStarts is incremented by the schedd whenever it launches a condor_shadow (or, in the case of a local universe job, when the schedd launches a condor_starter on the submit machine).

NumJobStarts is incremented by the condor_starter or condor_gridmanager right before it spawns the job, but after the execute node has been successfully claimed and the job's input files have been transferred.

I could imagine several scenarios where they will be different. Some examples:

1. If the job specifies a universe that does not launch a shadow (e.g. grid universe, local universe), NumJobStarts would exceed NumShadowStarts.

2. If the condor_shadow is successfully started but encounters some error before spawning the job, such as an error transferring the input files or spawning the job itself (i.e. execute node is missing required shared libraries, executable does not exit on the execute node, etc), then NumShadowStarts could exceed NumJobStarts.

3. If the job is a parallel universe job, NumJobStarts is incremented for each node (mpi rank) that joins the computation. Thus NumJobStarts would likely exceed NumShadowStarts.


Hope the above helps,
Todd