[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Delay on submit, and other newbie issues
- Date: Mon, 23 Jun 2008 07:37:40 -0500
- From: Matthew Farrellee <mfarrellee@xxxxxxxxxx>
- Subject: Re: [Condor-users] Delay on submit, and other newbie issues
Ira Abramov wrote:
Hello list. I'm a sysadmin of more than a decade, but new to batch
servers and condor. I tried to drink in the list and the manual, but I'm
still not in sync with all the Jargon, so appologies in advance...
My Condor setup is pretty small (and probably an overkill): I have one
CentOS5 (server) and three centos4 machines for execution. I have
installed the rpm of 7.0.2 on them, and did the bare minimum tweeking on
them. changing the 300 second intervals to 30, and I see all slots.
The jobs we are running are EDA tools (VLSI design tools like Specman
from Cadence and such). some are builds of several minutes, and some are
simulations and regressions that run for hours.
Problems I'm seeing now:
1. when I submit a job it only starts running in a slot some 10-20
seconds later. is that the 30 second interval for matching? can I set
the server to match immediately on submission? The people here are used
to the immediacy of running the job locally.
Yes, the NEGOTIATOR_INTERVAL. You can make it shorter, but when you have
hundreds of thousands of jobs in your system later I'd not recommend it.
2. when I use "condor_q" I can see the job is running but not which slot
was allocated for it, and could not find a switch to add such a
coloumn. what have I missed?
3. most of our jobs are "well behaved" in the sense that they only take
up a single core anyway. I could not find anywhere to define jobs that
may parallelize (like "make" forking two compilers or a JVM splitting to
threads) and how to tell about them to the manager
Check out the Parallel Universe
4. two of the three machines are running Xvnc sessions for several
windows users. so while there's no console activity, there's a reason we
don't want jobs running on those two machines till after hours. I could
not find anything smart to do about this kind of scheduling other than a
cron job to remove and restate startd at the start and end of the work
Check out the StartD Policy section of the manual:
You can set the START expression so that no jobs can be started during
5. many of the tools check out licences from the FlexLM, and I'm not
sure how to tell Condor it's a limited resource, and extra attempts to
fire up such jobs will fail and therefore they need to be queued and the
user informed as to why. Where do I look for that? Also, this is the
reason I shouldn't have some jobs pre-empted I think, because unused
licenses will stay stuck as checked out... unless there's a trick I'm
I'll let someone else field the FlexLM question in more detail.
However, if your applications exits in a predictable way you could use
the on_exit_hold attribute of your job to put it on hold and
periodic_release to release it after something changes indicating more
licenses are available.