[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SPOOL file clash with multiple submitters



Thanks for this. As you say the nub of the matter seems to be how the different daemons

interpret the value of $(SPOOL) - particularly the scheduler, the negotiator and the shadows.

Can anyone from the Condor team shed any light on this - I can't find much info in the

manual.

 

regards,

 

-ian.

 

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 27 January 2012 13:19
To: Condor-Users Mail List
Subject: Re: [Condor-users] SPOOL file clash with multiple submitters

 

I don't ever use checkpointing so this was never tested with the suggested config in that post, sorry.

 

I thought the shadow was responsible for stashing the checkpoint files -- it sounds like, with the suggested configuration, the shadows spawned by the schedd  are not inheriting the schedd settings and getting a unique SPOOL directory.

 

One thing you could try is to use a SPOOL setting that's unique for every single shadow:

 

SPOOL = $(LOCAL_DIR)/checkpoints/$(CurrentTime)/$(PID)

 

That'd stop PID collisions.

 

Honestly, I'm not sure that'll work but that's probably moving in the right direction.

 

There's a more convoluted way of setting up multiple schedd's that involves point the schedd at a unique configuration file. It was what we did pre-7.6.x for the 7.2 and 7.4 series. That may be a better way to propagate a unique SPOOL setting to shadows on a per-schedd basis.

 

Regards,

- Ian

 

---

Ian Chesal

 

Cycle Computing, LLC

Leader in Open Compute Solutions for Clouds, Servers, and Desktops

Enterprise Condor Support and Management Tools

 

http://www.cyclecomputing.com

http://www.cyclecloud.com

http://twitter.com/cyclecomputing

 

On Friday, 27 January, 2012 at 7:52 AM, Smith, Ian wrote:

Hello All,

 

I am trying to set up mutiple schedulers on our SMP central manager/submit

host along the lines suggested by Cycle Computing

 

This seemed to be working well until I noticed there was a clash between the

checkpoint files of jobs from one schedd and those of another. As far as I

can see the job IDs of jobs in separate queues are not unique so if a user of one

scheduler has a checkpointed job with say ID 3.1, its checkpoint files will be in

 

$(SPOOL_ROOT)/3/1/cluster...

 

But then another user on another schedd has a job with same ID 3.1 and it

attempts to use the same directory which fails because of file permissions.

 

I've configured Condor with

 

SPOOL_ROOT = /condor_scratch/spool

 

SCHEDD1 = $(SBIN)/condor_schedd1

SCHEDD1_ARGS = -f -local-name Q1

SCHEDD1_LOG = $(LOG)/ScheddLog.1

SCHEDD.Q1.SCHEDD_NAME = Q1@$(HOSTNAME)

SCHEDD.Q1.SPOOL = $(SPOOL_ROOT)/schedd1

SCHEDD.Q1.SCHEDD_LOG = $(SCHEDD1_LOG)

 

SCHEDD2 = $(SBIN)/condor_schedd2

SCHEDD2_ARGS = -f -local-name Q2

SCHEDD2_LOG = $(LOG)/ScheddLog.2

SCHEDD.Q2.SCHEDD_NAME = Q2@$(HOSTNAME)

SCHEDD.Q2.SPOOL = $(SPOOL_ROOT)/schedd2

SCHEDD.Q2.SCHEDD_LOG = $(SCHEDD2_LOG)

 

...etc

 

but the checkpointing files always seem to get written under the common $(SPOOL)

directory rather than separate ones causing the clash.

 

Interestingly Condor does seem to put these files in indvidual directories (not

the common spool area):

 

job_queue.log job_queue.log.1 local_univ_execute spool_version

 

so it seems to be aware of SCHEDD.Q1.SCHEDD_LOG if not SCHEDD.Q2.SPOOL

 

If I take out the default spool/ directory and remove the $(SPOOL) definition,

the negotiator fails on start up. Since there's only one negotiator I would

expect it to use a common directory ???

 

Any suggestions would be very useful.

 

thanks in advance,

 

-ian.

 

---------------------------------------

Dr Ian C. Smith,

Advanced Research Computing,

University of Liverpool.

_______________________________________________

Condor-users mailing list

To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

 

The archives can be found at: