[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with version 7.4.2



Hello,

We have installed condor version 7.4.2 in a cluster composed of machines with Fedora and Ubuntu 10.04 OS. Our installation is in shared directories and we have different binaries for Fedora and Ubuntu
(condor-7.4.2-linux-x86-rhel3-dynamic and condor-7.4.2-linux-x86-debian50-dynamic, respectively). We also have the home dir of condor and the configuration files in a shared directory. The local dir of our central manager/dedictaed sched id in a local directory and for all the other machines it is in a shared directory. We have been experiencing some serious problems:

1- The condor_submit command gets hung:
 Sometimes when I submit jobs, condor_submit gets stuck, althoug the job enters the queue, the command doesn't stop and I have to kill it with ctrl+c

2. Jobs return to Idle state and can't be removed:
One of our users has jobs that return to the Idle state after they terminate or die. He then tries to remove these jobs from the queue, but that action causes condor to go crazy. Condor_q stops responding and shows the message:
-- Failed to fetch ads from: <192.168.127.3:39790> : zyon.itqb.unl.pt
and then all the jobs die.

It is worth pointing out that everything works fine when we use an older version of condor (6.8.4) in our central manager/dedicated sched. However, we only have Fedora binaries for these version and these means  that we can not run this  version  in a  machine with Ubuntu (due to  libraries incompatibility) and our goal is to have a machine with Ubuntu 10.04  as  central manager/dedicated sched..

Can anyone help?


--
Diana Lousa
PhD student
Protein Modeling Laboratory
ITQB/UNL
Oeiras, Portugal