[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] mpi job stuck as idle



The only configuration setting that's relevant on the submit node is
UNUSED_CLAIM_TIMEOUT. This tells the execute node how long it can
remain both idle and claimed by the dedicated scheduler before it goes
back into the unclaimed state. The default is 10 minutes. In the same
place you found condor_config.local.dedicated.resource, there should
be an example condor_config.local.dedicated.submit config file that
explains this setting more.

When you run condor_status, by default, only the Startd ClassAds
(which contain information about your execute slots) in your condor
pool are queried. The DedicatedScheduler config knob is only relevant
for the Startd daemon, so you won't see it defined if you query the
Schedd (submit daemon) or other daemons.

Jason

On Mon, Jan 22, 2018 at 3:16 PM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
>>This config will need to be on all the execute machines that should be
>>allowed to run parallel universe jobs, and then condor_reconfig should
>>be run on them. The config tells the execute node to trust the submit
>>node (what I think you mean by frontend) as the dedicated scheduler
>>for parallel universe jobs.
>
>
> Great :)
> compute-0-0 has been added successfully. I can now see that the undefined
> word is replaced by the dedicated scheduler name.
>
> One more thing. Although I did the same thing on the submit node (rocks7),
> but I can not see that in the list.
>
>
>
> [root@rocks7 etc]# condor_config_val  -config
> Configuration source:
>     /opt/condor/etc/condor_config
> Local configuration sources:
>     /opt/condor/etc/config.d/000Rocks.conf
>     /opt/condor/etc/config.d/99Rocks.conf
>     /opt/condor/etc/config.d/condor_config.local.dedicated.resource
>     /opt/condor/etc/condor_config.local
>
> [root@rocks7 etc]# condor_status -af:h Machine DedicatedScheduler
> Machine           DedicatedScheduler
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
>
>
>
> I can now see that the hellompi is running on  compute-0-0
>
> Regards,
> Mahmood
>