[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Delay on submit, and other newbie issues



(inline)

Ira Abramov wrote:
Hello list. I'm a sysadmin of more than a decade, but new to batch
servers and condor. I tried to drink in the list and the manual, but I'm
still not in sync with all the Jargon, so appologies in advance...

My Condor setup is pretty small (and probably an overkill): I have one
CentOS5 (server) and three centos4 machines for execution. I have
installed the rpm of 7.0.2 on them, and did the bare minimum tweeking on
them. changing the 300 second intervals to 30, and I see all slots.

The jobs we are running are EDA tools (VLSI design tools like Specman
from Cadence and such). some are builds of several minutes, and some are
simulations and regressions that run for hours.

Problems I'm seeing now:
1. when I submit a job it only starts running in a slot some 10-20
seconds later. is that the 30 second interval for matching? can I set
the server to match immediately on submission? The people here are used
to the immediacy of running the job locally.

Yes, the NEGOTIATOR_INTERVAL. You can make it shorter, but when you have hundreds of thousands of jobs in your system later I'd not recommend it.


2. when I use "condor_q" I can see the job is running but not which slot
was allocated for it, and could not  find a switch to add such a
coloumn. what have I missed?

condor_q -run


3. most of our jobs are "well behaved" in the sense that they only take
up a single core anyway. I could not find anywhere to define jobs that
may parallelize (like "make" forking two compilers or a JVM splitting to
threads) and how to tell about them to the manager

Check out the Parallel Universe


4. two of the three machines are running Xvnc sessions for several
windows users. so while there's no console activity, there's a reason we
don't want jobs running on those two machines till after hours. I could
not find anything smart to do about this kind of scheduling other than a
cron job to remove and restate startd at the start and end of the work
day. Help?

Check out the StartD Policy section of the manual:

	http://www.cs.wisc.edu/condor/manual/v7.1/3_5Startd_Policy.html

You can set the START expression so that no jobs can be started during business hours.


5. many of the tools check out licences from the FlexLM, and I'm not
sure how to tell Condor it's a limited resource, and extra attempts to
fire up such jobs will fail and therefore they need to be queued and the
user informed as to why. Where do I look for that? Also, this is the
reason I shouldn't have some jobs pre-empted I think, because unused
licenses will stay stuck as checked out... unless there's a trick I'm
unaware of?

I'll let someone else field the FlexLM question in more detail.

However, if your applications exits in a predictable way you could use the on_exit_hold attribute of your job to put it on hold and periodic_release to release it after something changes indicating more licenses are available.

http://www.cs.wisc.edu/condor/manual/v7.1/condor_submit.html#57335
http://www.cs.wisc.edu/condor/manual/v7.1/condor_submit.html#57369

Best,


matt

Thanks,
Ira.