[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job status of jobs when using USER_JOB_WRAPPER



Thanks Greg for pointing me to condor_chirp. It is new to me and I
think it will be helpful.

The complexity is exactly like what you described. That is probably
why pstree does not return a status code.

The reason why I would like to have this observability is because we
found that some jobs may got stuck in the D state (i.e., the
uninterruptible sleep state) for an extraordinarily long period of
time before they were considered suspicious. Meanwhile, the STAT field
of the condor_q query result will still show R, which is correct but
can be easily misinterpreted by users as an indicator that their jobs
are still running and making progress. Ideally we would like to detect
this kind of jobs early and speculatively rerun them again on
different execute nodes, which can hopefully finish earlier than the
one got in D state.

The reason why such jobs may be stuck in D state may be related to the
resource contention among multiple colocated condor jobs and issues in
the driver of the storage stack.


Thanks


On Thu, Apr 13, 2017 at 4:31 PM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
> On 04/13/2017 03:20 PM, Weiming Shi wrote:
>>
>>
>>>> that this JobStatus is not designed to match the STAT field (i.e.,
>>>> PROCESS STATE CODE) that could be collected by the 'ps aux' query on
>>>> the condor execute node where a job is executed.
>
>
> Your understanding is correct.  A couple of things complicate this -- an
> HTCondor job can consist of many Unix processes, so which process status
> would you want to display?  Also, the process status in ps can check very
> frequently, from "R"unning to "I"dle to "D"isk wait, etc.
>
> If you really want to get this data back to the submit side, your job could
> spawn a script that periodically looks in /proc to figure of the processes
> state of one of the processes in your job, then uses condor_chirp to update
> a custom job attribute in the job classad in the schedd.
>
> Stepping back a bit, what do you want to do with this information?
>
> -greg
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/