[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor commands getting stuck



Hello,

We have made a new installation of condor in our cluster, in the beginning of this week. In this new installation we upgraded from condor 6.8 to version 7.4 and we also changes our dedicated scheduler from a machine with a Fedora OS to one with Ubuntu 10.04.
We have condor installed in 2 shared directories (one that has binaries for fedora OS and another that has binaries for ubuntu OS) and each machine runs the release correspondent to its OS. Everything ran fine in the first days (from Monday until today), but today the condor commands started getting stuck. Fist condor_q stopped responding and after a few minutes all the jobs just died (without our intervention). We then restarted condor in all our machines, resubmitted the jobs and the same thing happened again after a while (about 15 minutes). Next, we cleaned all our condor log files, killed the deamon in all the machines and restarted the system and submitted a small number of jobs to see how it handled them. Everything was ok for a few hours and now, I'm trying to submit more jobs and the command condor_submit gets stuck. The strangest thing is that the jobs are submitted and start running, but the condor_submit command does not terminate by itself.
All our system is based on nfs.

Can anyone help?

Thanks in advance.

--
Diana Lousa
PhD student
Protein Modeling Laboratory
ITQB/UNL
Oeiras, Portugal