[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Torture test


Encountered similar problem before. The root cause of the problem for my
case is an overloaded schedd. Then, I was making submission at a very
high frequency (1 submit every 5secs) and tranfering a large number of
files back to my initial directory. 

To avoid this problem, i reduced my rate of submission (batch up jobs
into a cluster) and used other modes of files transfer. Such problem has
reduced significantly!! I am current trying to use periodic_remove to
clean up the occasional zombies that are still hanging ard  ;)

Raymond Wong
System Engineer
DID: 7358
Pager: 98028590

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ralf Reinhardt
Sent: Wednesday, June 23, 2004 6:19 PM
To: Condor-Users Mail List
Subject: [Condor-users] Torture test

I am writing a small frontend for bioninformatics tasks, which will be 
used by users which are rather unaware of the cluster behind it. Since
the cluster (128 CPU) should work without continuous supervision, 
I made some torture tests with many very small jobs. The results are 
zombie jobs which ahve been finished successfully, but are still noted 
as running on their nodes, slowly blocking the whole cluster.
- Can it be avoided ?
- If not: Is there a better way to get the system back in sync than to 
remove all jobs with the forcex option?



Condor-users mailing list