[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Job cluster management?



Hi all.  I work for a small visual effects company in San Francisco, and 
we're trying out Condor as a means of distributing renders across our 
available machines.  Heretofore, we have used an extremely minimal tool I 
wrote in Python to accomplish this.  It worked OK for what it was, but we 
have now outgrown it.  I've now gotten Condor installed and working 
without too much fuss.

My question is this:  How do other Condor users manage their job clusters?  
Unless I'm missing something, it seems difficult to get a handle on the 
status of a job cluster as a whole using Condor's tools out-of-the-box.  
For example, say we have a 150-frame shot, with three shadow maps needing 
to be generated for each frame.  That comes out to 600 jobs, which may 
take anywhere from a minute to an hour or more per job.  If I have that 
and a couple of other things in the queue, the output from condor_q 
quickly becomes difficult to make sense of.  It'd be great if there were 
an option for many of the condor tools (-cluster, for example) that would 
return information at the cluster level.  For example:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  18.1   mike           12/8  09:22   0+00:00:01 R  0   0.0  shake
  18.2   mike           12/8  09:22   0+00:00:00 R  0   0.0  shake
  18.3   mike           12/8  09:22   0+00:00:00 R  0   0.0  shake
	.
	.
	.

becomes:

 ID      OWNER          SUBMITTED TOTAL_RUN_TIME ST CMPLTE CMD               
  18.*   mike         12/8  08:32     0+00:41:01 R  52/100 shake
  19.*   mike         12/8  09:22     0+00:00:00 I  0/50   maya
  20.*   mike         12/8  09:44     0+00:00:00 I  0/600  prman


Clearly, I could write a wrapper script for condor_q that would do 
something like this, but it'd be nice if it were built-in.

My biggest problem is with email notification.  Obviously, getting hundreds 
of emails a day is too much.  On the other hand, getting AN email when a 
cluster finishes is really handy.  I know could do this with a DAG, and 
have a job that sends an email to the submitter when all other jobs in the 
cluster have finished.  But, it's a pain to have to make a dag just for 
that, and we lose some of the nice runtime stats that the built-in email 
notification gives.  Hey developers, any chance of getting a 
"Notification=Cluster" setting for condor_submit?

Are there any other Condor users out there who have had similar issues?  
Anybody come up with any good solutions?  

Also for the developers:  Any chance of getting a Python module?  

Cheers.

-Mike
mike@xxxxxxxxxxxxxx




Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>