[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor losing track of processes under cygwin



Howdy,

Just wondering if someone could take a couple of minutes to answer a
quick query regarding process tracking in the Start daemon or Procd.

We're running condor 6.8.8 and one of our users are executing sh
scripts which run through cygwin on our windows resources.
Unfortunately, it seems that the Start daemon loses track of forked
processes and starts suspending jobs because it thinks that non-condor
load is high (where in reality it is just the "lost" process causing
the load).

I believe this is a well known problem? It seems that cygwin is
particularly problematic? Earlier posts seem to allude to this problem
- https://lists.cs.wisc.edu/archive/condor-users/2007-April/msg00184.shtml

We've done a number of things to try to get around this:
 - Tracked our processes to see if any are doing a double fork. We
found one instance and rewrote some code so that double fork doesn't
occur anymore. Unfortunately, this didn't seem to fix things.
 - Changed the EXECUTE_LOGIN_IS_DEDICATED to true. Still the same
issue. I assume this means that condor views ALL processes under
condor-reuse-vm1 as condor jobs and therefore lost processes should be
an issue?
 - Upgraded to 7.2 to see whether the new Proc daemon fixes things -
but we're still having the same issue.

Has anyone had this issue and comment on their experience / fix?
Can anyone from the condor folks comment on the best approach?

regards,

james