[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor losing track of processes under cygwin



Hello,

The EXECUTE_LOGIN_IS_DEDICATED change should prevent processes from going untracked. Instead of starting off by debugging whether process tracking is working as expected, I had one other thought. We just recently experienced a problem on one of our local Condor pools similar to what you are describing.

In our case, it wasn't that a process was being leaked but rather a deficiency in Condor's determination of Condor vs. non-Condor load average. The problem in our case was triggered by the nature of a particular job: several short-lived CPU bound processes were executed in sequence by a script. Some things to consider:

- Do your jobs create subprocesses that do most of the heavy lifting?

- What is the suspension policy that you are using that is erroneously
  getting triggered? Is it sensitive to Condor vs. non-Condor load?

- You may be able to get more insight into the problem by enabling
  D_FULLDEBUG and D_LOAD in your STARTD_DEBUG setting.

Thanks,

Greg

Wojtek Goscinski wrote:
Howdy,

Just wondering if someone could take a couple of minutes to answer a
quick query regarding process tracking in the Start daemon or Procd.

We're running condor 6.8.8 and one of our users are executing sh
scripts which run through cygwin on our windows resources.
Unfortunately, it seems that the Start daemon loses track of forked
processes and starts suspending jobs because it thinks that non-condor
load is high (where in reality it is just the "lost" process causing
the load).

I believe this is a well known problem? It seems that cygwin is
particularly problematic? Earlier posts seem to allude to this problem
- https://lists.cs.wisc.edu/archive/condor-users/2007-April/msg00184.shtml

We've done a number of things to try to get around this:
 - Tracked our processes to see if any are doing a double fork. We
found one instance and rewrote some code so that double fork doesn't
occur anymore. Unfortunately, this didn't seem to fix things.
 - Changed the EXECUTE_LOGIN_IS_DEDICATED to true. Still the same
issue. I assume this means that condor views ALL processes under
condor-reuse-vm1 as condor jobs and therefore lost processes should be
an issue?
 - Upgraded to 7.2 to see whether the new Proc daemon fixes things -
but we're still having the same issue.

Has anyone had this issue and comment on their experience / fix?
Can anyone from the condor folks comment on the best approach?

regards,

james
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/