[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_shadow-Process Problem
maybe you could help me with the following issue:
As far as I understood the way Condor works, there should be one
'condor_shadow'-process on the submitting host for each submitted job
that is beeing executed on any Client.
Now in my pool (see below for details) it seems to happen, that there
are several shadow-processes with exactly the same job-id for a single job!
'ps ax | grep condor_shadow' returns
23285 ? SN 0:00 condor_shadow -f 9.137 <18.104.22.168:9307>
23286 ? SN 0:00 condor_shadow -f 9.137 <22.214.171.124:9307>
It also happens, that there are several hundred
'condor_shadow'-processes still running on the submitting hosts,
although the corresponding jobs have already terminated normally long
before. When looking into the 'ShadowLog'-File of the submitting host,
there appear some "exited with status 107" messages, but since we are
only running vanilla-jobs here this should not indicate any error,
right? And I can not find any other error messages in my logs.
I am afraid, these 'ghost' shadow-processes are also responsible for the
effect, that I have quite a lot of Clients in "Claimed" but also in
"Idle" state according to 'condor_status'. They should be "Busy" when
claimed, at least shortly after getting claimed.
I would be very happy, if anybody can explain to me, how this can happen
and, even better, how to prevent this, because this leads always to
maximum disagreement between server and clients concerning the number of
running jobs and therefore no new jobs are startet after some time,
although the 'condor_status' gives tens of unclaimed clients ...
Thanks for your help and have a nice day,
Some more information about my pool:
1 Server, Ubuntu Linux, MASTER,NEGOTIATOR,COLLECTOR
30 Clients, Ubuntu Linux, MASTER,SCHEDD,STARTD
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: I386-LINUX_RHEL3 $