[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Aix Condor_shadow Bug ?? v6.6.10



Hello...

We are having problems with a an Aix (v5.2) Condor (v6.6.10) Master node...

The Aix node is configured as Master (Collector/Negotiator) and also as a
Scheduler for a 60 linux machines.

It seems that when we estress the AIX SCheduler sending a lot of jobs in a
short time, some of the condor_shadow processes doesn't receive the
DaemonCore "FILETRANS_DOWNLOAD", so in result we have:

-clean logs for the rest of the pool.
-clean job execution. Jobs ends without problems and with exit status 0,
so there is no need to put the job at the queue again
-NO RESULTS OF SOME JOBS AT THE SUBMIT DIRECTORY.

4836.3 Job has finished and the results have been downloaded
Doing:  cat ShadowLog|grep "4836.3"

5/17 15:27:16 (4836.3) (3833892): Request to run on <172.21.93.238:20786>
was ACCEPTED
5/17 15:27:16 (4836.3) (3833892): DaemonCore: Command received via TCP
from host <172.21.93.238:20070>
5/17 15:27:16 (4836.3) (3833892): DaemonCore: received command 61000
(FILETRANS_UPLOAD), calling handler (FileTransfer::HandleCommands())
5/17 15:27:19 (4836.3) (3833892): DaemonCore: Command received via TCP
from host <172.21.93.238:20527>
5/17 15:27:19 (4836.3) (3833892): DaemonCore: received command 61001
(FILETRANS_DOWNLOAD), calling handler (FileTransfer::HandleCommands())
5/17 15:27:19 (4836.3) (3833892): Job 4836.3 terminated: exited with status 0
5/17 15:27:27 (4836.3) (3833892): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100

4836.4 Job has finished ok but and exited but it hasn't called his
FileTransfer Handler to do a FILETRANS_DOWNLOAD
Doing:  cat ShadowLog|grep "4836.4"    I have
5/17 15:27:16 (4836.4) (3186808): Request to run on <172.21.93.247:20843>
was ACCEPTED
5/17 15:27:16 (4836.4) (3186808): DaemonCore: Command received via TCP
from host <172.21.93.247:20906>
5/17 15:27:16 (4836.4) (3186808): DaemonCore: received command 61000
(FILETRANS_UPLOAD), calling handler (FileTransfer::HandleCommands())
5/17 15:27:19 (4836.4) (3186808): Job 4836.4 terminated: exited with status 0
5/17 15:27:27 (4836.4) (3186808): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100


I've checked that there is no problem with startd nodes, the node
172.21.93.247 has done several jobs before 4836 ClusterId and after 4836 
without problems, and the problem is still happening with other nodes
different from 172.21.93.247

We have launched the same batch several times and it never fails on the
same ClusterId-ProcessId. Other strange behaviour it's that sometimes it
fails to do the FILETRANS_DOWNLOAD for several Jobs, and sometimes it only
fails with a couple of Jobs...

At last, i think it must be an Aix Condor Bug... I've done the same tests
with a Linux Master node (condor version 6.6.10) using the AIX
configuration files, and it works like a charm.

Thanks in advance

Best Regards Carlos Manzanedo.
PS: I will try to set D_FULLDEBUG and try to locate better the bug.