[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor completely stuck?



On Tue, Jul 22, 2008 at 09:22:39AM +0200, Steffen Grunewald wrote:
> One of the users, shortly after finding that his jobs weren't doing what
> they were expected to do, did a "condor_rm" of all of them. condor_q didn't
> come back afterwards, and shutting down Condor using the init script doesn't
> work anymore.
> These are the processes still around:
> 
> # ps auxw | grep condor
> root     13082  0.0  0.1  15584  2688 ?        S    Jul21   0:00 condor_preen -m -r
> uglyuser 16836  0.0  0.4  25488  9368 ?        Ss   Jul19   0:09 condor_schedd -f
> root     16837  0.0  0.1  12220  2904 ?        S    Jul19   1:12 condor_procd -A /usr/share/condor/local/log/procd_pipe.SCHEDD -C 666
> condor   18150  0.0  0.1  18416  3736 ?        Ss   Jul03  25:35 /usr/sbin/condor_master
> root     19007  0.0  0.1  15584  2688 ?        S    Jul20   0:00 condor_preen -m -r
> root     25352  0.0  0.1  15584  2688 ?        S    Jul19   0:00 condor_preen -m -r
> root     30283  0.0  0.1  16380  3052 pts/3    S+   09:16   0:00 condor_q -glo

After some time they disappeared; not sure whether gracefully or because I
shot several signals at the condor_master process...

> The last entries in the SchedLog are from a restart of condor_schedd.

I checked /proc/${PID}/fd and strace'd condor_schedd, to find that it was
referring to a log file in the user's space. I moved that away, and the
problem disappeared.

There should be a better way to "low-level condor_rm" before Condor is
started up - old jobs would stay referenced in hidden places and show
up again otherwise.

> This is 7.0.1; is this a known issue of that version, and would upgrading
> fix it? (& when will 7.0.4 be out? :)

Going to check the release notes...

Steffen