Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dynamic Slots in Parallel Universe

Date: Mon, 12 Mar 2018 11:11:28 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dynamic Slots in Parallel Universe

On 3/9/2018 11:40 AM, Larne Pekowsky wrote:

Hi Todd,
Iâm resurrecting this thread because I think weâre still seeing relatedproblems. ÂOne of our users has a parallel universe job that has beenidle for almost a day. ÂThe StartLog on the available nodes seem toindicate that the nodes are held for a wile and then released withoutever having enough nodes to start the job

[snip]>

Any suggestions? ÂIf you need any additional information please let me know.

Cheers,

- Larne


Hi Larne,

Look like your schedd is indeed running with Greg's v8.7.7 code patch here
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6517
so it should be working for you...

Does your condor_config on your central manager include
  ALLOW_PSLOT_PREEMPTION = True
?

And the condor_config on all your execute nodes have a RANK expressionthat prefers your dedicated scheduler submit machine? (e.g. like theexample at http://tinyurl.com/yaolvshk ) ?

If the answer to both of the above questions is yes, then the next stepis Greg will likely have more questions for you to get to the bottom ofthis... After the above patch Greg observed parallel universe jobsworking here at UW with partitionable slots, so imagine he will need tofigure out what is different at Syracuse...


Thanks
Todd

References:
- Re: [HTCondor-users] Dynamic Slots in Parallel Universe
  - From: Larne Pekowsky

Prev by Date: Re: [HTCondor-users] Shadow Exception: Create_Process failed to register the job with the ProcD
Next by Date: Re: [HTCondor-users] "Failed to receive remote ad" runtime error when querying history with the python api
Previous by thread: Re: [HTCondor-users] Dynamic Slots in Parallel Universe
Next by thread: Re: [HTCondor-users] "Failed to receive remote ad" runtime error when querying history with the python api
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Dynamic Slots in Parallel Universe