[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Delay on submit, and other newbie issues

Hello list. I'm a sysadmin of more than a decade, but new to batch
servers and condor. I tried to drink in the list and the manual, but I'm
still not in sync with all the Jargon, so appologies in advance...

My Condor setup is pretty small (and probably an overkill): I have one
CentOS5 (server) and three centos4 machines for execution. I have
installed the rpm of 7.0.2 on them, and did the bare minimum tweeking on
them. changing the 300 second intervals to 30, and I see all slots.

The jobs we are running are EDA tools (VLSI design tools like Specman
from Cadence and such). some are builds of several minutes, and some are
simulations and regressions that run for hours.

Problems I'm seeing now:
1. when I submit a job it only starts running in a slot some 10-20
seconds later. is that the 30 second interval for matching? can I set
the server to match immediately on submission? The people here are used
to the immediacy of running the job locally.

2. when I use "condor_q" I can see the job is running but not which slot
was allocated for it, and could not  find a switch to add such a
coloumn. what have I missed?

3. most of our jobs are "well behaved" in the sense that they only take
up a single core anyway. I could not find anywhere to define jobs that
may parallelize (like "make" forking two compilers or a JVM splitting to
threads) and how to tell about them to the manager

4. two of the three machines are running Xvnc sessions for several
windows users. so while there's no console activity, there's a reason we
don't want jobs running on those two machines till after hours. I could
not find anything smart to do about this kind of scheduling other than a
cron job to remove and restate startd at the start and end of the work
day. Help?

5. many of the tools check out licences from the FlexLM, and I'm not
sure how to tell Condor it's a limited resource, and extra attempts to
fire up such jobs will fail and therefore they need to be queued and the
user informed as to why. Where do I look for that? Also, this is the
reason I shouldn't have some jobs pre-empted I think, because unused
licenses will stay stuck as checked out... unless there's a trick I'm
unaware of?


Cable guy
Ira Abramov