[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job scheduling



Hi all,

I run a 20 node cluster (160 CPU, 2GB RAM each cpu) and am having an issue with the way condor distributes jobs across the cluster.

A user is launching simulations that grow to over 6GB in size (Memory), and condor reports it as 15GB (I assume this is Mem+Swap), and if 3 jobs are run on one node, at a certain point in time the node will become completely unresponsive. Ganglia shows it as down and ssh hangs, but a couple of hours later the condor_startd will crash and restart and the node becomes responsive again. I assume this is due to the memory being saturated.

While the job is being run outside operating parameters (6GB >> 2GB), the jobs still have to be run, and they run fine if there is only one being run per node. The problem is, all of the jobs are being flocked together to one node (compute-1-0 or compute-2-0), is this an intended function of condor, or is there a way I can configure condor to scatter the jobs across the cluster whenever possible?

-Patrick