[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SPOOL file clash with multiple submitters



I don't ever use checkpointing so this was never tested with the suggested config in that post, sorry.

I thought the shadow was responsible for stashing the checkpoint files -- it sounds like, with the suggested configuration, the shadows spawned by the schedd  are not inheriting the schedd settings and getting a unique SPOOL directory.

One thing you could try is to use a SPOOL setting that's unique for every single shadow:

SPOOL = $(LOCAL_DIR)/checkpoints/$(CurrentTime)/$(PID)

That'd stop PID collisions.

Honestly, I'm not sure that'll work but that's probably moving in the right direction.

There's a more convoluted way of setting up multiple schedd's that involves point the schedd at a unique configuration file. It was what we did pre-7.6.x for the 7.2 and 7.4 series. That may be a better way to propagate a unique SPOOL setting to shadows on a per-schedd basis.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing

On Friday, 27 January, 2012 at 7:52 AM, Smith, Ian wrote:

Hello All,

I am trying to set up mutiple schedulers on our SMP central manager/submit
host along the lines suggested by Cycle Computing
(see http://www.cyclecomputing.com/wiki/index.php?title=Running_Multiple_Condor_Schedds)

This seemed to be working well until I noticed there was a clash between the
checkpoint files of jobs from one schedd and those of another. As far as I
can see the job IDs of jobs in separate queues are not unique so if a user of one
scheduler has a checkpointed job with say ID 3.1, its checkpoint files will be in

$(SPOOL_ROOT)/3/1/cluster...

But then another user on another schedd has a job with same ID 3.1 and it
attempts to use the same directory which fails because of file permissions.

I've configured Condor with

SPOOL_ROOT = /condor_scratch/spool

SCHEDD1 = $(SBIN)/condor_schedd1
SCHEDD1_ARGS = -f -local-name Q1
SCHEDD1_LOG = $(LOG)/ScheddLog.1
SCHEDD.Q1.SCHEDD_NAME = Q1@$(HOSTNAME)
SCHEDD.Q1.SPOOL = $(SPOOL_ROOT)/schedd1
SCHEDD.Q1.SCHEDD_LOG = $(SCHEDD1_LOG)

SCHEDD2 = $(SBIN)/condor_schedd2
SCHEDD2_ARGS = -f -local-name Q2
SCHEDD2_LOG = $(LOG)/ScheddLog.2
SCHEDD.Q2.SCHEDD_NAME = Q2@$(HOSTNAME)
SCHEDD.Q2.SPOOL = $(SPOOL_ROOT)/schedd2
SCHEDD.Q2.SCHEDD_LOG = $(SCHEDD2_LOG)

...etc

but the checkpointing files always seem to get written under the common $(SPOOL)
directory rather than separate ones causing the clash.

Interestingly Condor does seem to put these files in indvidual directories (not
the common spool area):

job_queue.log job_queue.log.1 local_univ_execute spool_version

so it seems to be aware of SCHEDD.Q1.SCHEDD_LOG if not SCHEDD.Q2.SPOOL

If I take out the default spool/ directory and remove the $(SPOOL) definition,
the negotiator fails on start up. Since there's only one negotiator I would
expect it to use a common directory ???

Any suggestions would be very useful.

thanks in advance,

-ian.

---------------------------------------
Dr Ian C. Smith,
Advanced Research Computing,
University of Liverpool.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/