[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Torture test
Encountered similar problem before. The root cause of the problem for my
case is an overloaded schedd. Then, I was making submission at a very
high frequency (1 submit every 5secs) and tranfering a large number of
files back to my initial directory.
To avoid this problem, i reduced my rate of submission (batch up jobs
into a cluster) and used other modes of files transfer. Such problem has
reduced significantly!! I am current trying to use periodic_remove to
clean up the occasional zombies that are still hanging ard ;)
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ralf Reinhardt
Sent: Wednesday, June 23, 2004 6:19 PM
To: Condor-Users Mail List
Subject: [Condor-users] Torture test
I am writing a small frontend for bioninformatics tasks, which will be
used by users which are rather unaware of the cluster behind it. Since
the cluster (128 CPU) should work without continuous supervision,
I made some torture tests with many very small jobs. The results are
zombie jobs which ahve been finished successfully, but are still noted
as running on their nodes, slowly blocking the whole cluster.
- Can it be avoided ?
- If not: Is there a better way to get the system back in sync than to
remove all jobs with the forcex option?
Condor-users mailing list