[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor Using Parallel Universe



Hi Edier,

Thanks for your input.  I ran condor_q --better-analyze and got:

2067.000:  Run analysis summary.  Of 208 machines,
      0 are rejected by your job's requirements
      1 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
    207 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job

The following attributes are missing from the job ClassAd:

CheckpointPlatform

I will check with the system administrator about the directives you mentioned. Let me know if the condor_q -better-analyze output gives you any insight.

Thanks,
Sara

On Sep 22, 2011, at 5:26 PM, Edier Zapata wrote:

Hi sara,
can you run a condor_q --better-analyze?
Do you add this directives to the Manager's condor_config.local?

-- PARALLEL DIRECTIVES FOR EXECUTE CENTRAL MANAGER WITH SUBMIT --
UNUSED_CLAIM_TIMEOUT = 0
MPI_CONDOR_RSH_PATH = \$(LIBEXEC)
ALTERNATE_STARTER_2 = \$(SBIN)/condor_starter
STARTER_2_IS_DC = TRUE
SHADOW_MPI = \$(SBIN)/condor_shadow

And this to the Execute node's condor_config.local?
-- PARALLEL DIRECTIVES FOR EXECUTE NODE--
DedicatedScheduler = "DedicatedScheduler@YOUR_SCHEDULER'S_NAME"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
SUSPEND	 = False
CONTINUE	 = True
PREEMPT	 = False
KILL		 = False
WANT_SUSPEND = False
WANT_VACATE	= False
RANK		 = Scheduler =?= \$(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = \$(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = \$(STARTD_EXPRS), DedicatedScheduler

Hope this help you.
Bye


On 9/22/11, Sara Rolfe <smrolfe@xxxxxxxxxxxxxxxx> wrote:
Hello,

I'm trying to get a program to run using the parallel universe.  I've
had no problems using the vanilla universe. When I submit my parallel
job, it hangs in idle.

I've tried the "Sleep 30" example usign two machines from the manual,
but this isn't working either. When I get the run analysis summary it
says:

2067.000:  Run analysis summary.  Of 208 machines,
      0 are rejected by your job's requirements
      2 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
    206 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job

Does anyone have ideas on how to debug this?

Thanks,
Sara

--
Edier Alberto Zapata Hernández
Ingeniero de Sistemas
Universidad de Valle
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/