[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Communication problem



Hi,
We're using Condor to execute jobs which take a lot of time on 15 macintosh G5.
Our "vanilla" configuration:
- Central manager: xserve G4 username=condor
- Submit machine: same xserve G4 with another username= submit
- Execution machines: G5
We have 2 condor_master on the same machine (to manage and to submit) with 2 different username. Can this configuration lead pbs ?

We have 2 different problems:

1- After few hours, all the execution machines stop the job, a communication error occurs between the condor_starter and the condor_master (macintosh Xserve):

Cluster01 crashdump: Unable to determine CPSProcessSerNum pid: 11913 name: condor_starter

and in the Shadow log, we have:
ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C

Problem exists with condor6.6.6 and condor6.6.7…

2- After few hours, central manager and execution machine stop the communication but the submit machine follows the jobs. Condor_q indicates "R" status although condor_status indicates the communication is stopped.
Then, when we launch condor_master on the central manager, condor_status become normal that is to say that execution machines are in "busy" status. Is it normal for a vanilla configuration ?

After 2 or 3 days, we have either pb1 or pb2 !

Has anyone got an idea ?


Thank you for your help

Damien

Damien AUTRET:

Unité INSERM 601
Département de Recherche en ImmunoCancérologie
Equipe 6 Biophysique-Cancérologie
9 Quai Moncousu
44093 Nantes Cedex
Tél: 02.40.41.28.21
Fax: 02.40.35.66.97
Sec: 02.40.08.47.47