[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd crashes when using SOAP



I've noticed when a transaction is not properly closed (committed or
aborted), Schedd has a tendency to crash. Shouldn't condor_master
notice this and bring condor_schedd back up? If I kill condor_schedd,
condor_master does its job and brings up a new one. This seems like a
very bad state to get in - schedd cannot receive new jobs (via SOAP or
condor_submit) but condor_master does not see it as failed.

Here's the stack trace from SchedLog: http://pastebin.com/wtgheafq

Thoughts?

Cheers
David

On Wed, Sep 22, 2010 at 4:05 PM, Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Sep 22, 2010 at 2:24 PM, David Arthur <mumrah@xxxxxxxxx> wrote:
>>
>> My use case is: I have a few low priority long running jobs that will
>> always be running, as well as occasional short running high priority
>> jobs. I would like for the high priority jobs to be able to preempt
>> the lower priority jobs, but I don't want to lose any progress on the
>> low priority ones (since they are costly). I feel like this is
>> possible, but I'm a bit confused on the vocabulary.
>
> Once a job is running a slot, it owns the slot and Condor can't suspend it
> and give the slot to another job. So in order to achieve what you're after
> you have to make slots that only deal with certain types of jobs, but have
> policies that interact with each other. It's not impossible, but it's not
> trivial either.
> Lets say you've got a 2 CPU machine that you'd normally advertise 2 slots
> from. In order to achieve your goals you'll want to consider forcing the
> machine to advertise 4 slots instead. "Slot pairs" if you will. Slots 1 & 2
> will be a pair and slots 3 & 4 will be a pair.
> To advertise 4 identical slots:
> NUM_CPUS = 4
> This has the side-effect of causing the memory and disk in the machine to
> now be divided 4 ways instead of two. So may also want to double the memory
> Condor thinks the machine has with:
> MEMORY = DETECTED_MEMORY * 2
> There's not much you can do about disk except perhaps write your job
> requirement expressions to reference TotalDisk instead of Disk from the
> machine's ad.
> In a slot pair the first slot (the lower numbered slot) will *only* run long
> running jobs. How do we know a job is long running? You'll have to tell the
> system when you submit a job:
> +LongRunningJob = True
> And the START expression for the slot will be:
> START = LongRunningJob == True && ...whatever other slot stuff you usually
> have...
> The other slot in the pair will only run fast running jobs. Same deal:
> you'll need to identify them at submit time and tune your start expression
> to look for the attribute in jobs.
> You'll also want to cross-advertise the state of each slot in each other
> slot's ad. So that you can write START/SUSPEND/RESUME expressions for slot 1
> that reference the state of slot 2.
> Still with me?
> To advertise the necessary attributes across all the slots you use
> STARTD_SLOT_ATTRIBUTES:
> STARTD_SLOT_ATTRS = State, Activity, EnteredCurrentActivity
> That would make the state and activity of Slot 2 available in the Slot 1 ad
> as:
> Slot2_State
> Slot2_Activity
> So lets try writing a bit of policy around this. First: lets say that we
> won't start long running jobs a short running job is using the slot. This
> translates to: jobs won't run in Slot 1 if Slot 2 is running a job already.
> So:
> START = (SlotID == 1 && (LongRunningJob =?= True && (Slot2_State ==
> "Unclaimed" && Slot2_Activity == "Idle")) || (SlotID != 1)
> Interesting, eh? Because settings are shared among all the slots (we don't
> have per-slot config files) we need to write an expression that's different
> depending on the slot ID. In this case Slot 1 gets the first bit, and every
> other slot gets True.
> Now what if Slot 1 is running a job and something lands in Slot 2? We want
> to write a policy that suspends the job in Slot 1 while Slot 2 is busy. Not
> a problem:
> SUSPEND = (SlotID == 1 && (Slot2_State == "Claimed" && Slot2_State ==
> "Busy")) || (SlotID != 1 && False)
> WANT_SUSPEND = SUSPEND
> CONTINUE = (SlotID == 1 && (Slot2_State == "Unclaimed" && Slot2_State ==
> "Idle")) || (SlotID != 1 && True)
> That's, more or less, right I think. I haven't actually tested it but it's
> in the ballpark of what you're after.
> And hopefully you can extrapolate from that to see how you'd expand your
> setup to control the other slot pair (slots 3 & 4) to behave the same way.
> I'd like to point out some caveats though:
> 1. This is infinite suspension. As long as you have jobs running in slot 2,
> slot 1 is on hold. You can use the PREEMPT setting to remove a slot 1 job
> that's been suspended for a long time and maybe give it a chance to run on
> some other machine.
> 2. Suspending a job just gets you back CPU. It doesn't get you back the
> memory used by the suspended job. And, depending on the tool, it sometimes
> doesn't get you back the licenses it's using either. Worth keeping in the
> back of your mind if you find you're running out of machine or shared
> resources.
> Hopefully that wasn't too much to follow.
> Regards,
> - Ian
>
> ________________________________
> Cycle Computing, LLC
> The Leader in Open Compute Solutions for Clouds, Servers, and Desktops
> Enterprise Condor Support and Management Tools
>
> http://www.cyclecomputing.com
> http://www.cyclecloud.com
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>



-- 
David Arthur