Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] submit to a remote pool, from a linux machine, to an osx manager
- Date: Thu, 16 Sep 2010 16:43:50 -0400
- From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] submit to a remote pool, from a linux machine, to an osx manager
On Sep 16, 2010, at 15:07 , Peter Doherty wrote:
Hi,
Okay, so I've got an all mac (intel xeons, running OS X 10.5)
cluster. I set up the headnode as a central manager running the
schedd/collector, etc.
It works pretty good. But I've got a linux box that I want to use
to submit jobs from. So I installed condor on the linux box and set
the condor_config variables to point to the remote pool.
No daemons run on the linux box, I just changed CONDOR_HOST to
reference the mac central manager.
I can use condor_submit and run jobs, and everything works great.
But when I want to submit a DAG, I run into problems.
1.) the dagman scheduler universe job has requirements (OpSys, Arch)
that are looking for a LINUX X86_64 machine, but the CM is INTEL, OSX.
2.) the CMD it wants to run is the linux binary of condor_dagman
(because it pulled the config from the linux box i presume)
So the job just gets stuck eternally in the queue.
........
Thanks.
Peter
In true mailing list form, I made some headway shortly after sending
this email.
I changed ARCH and OPSYS in the condor_config on the local box, so the
job now gets matched and runs. (i'm sure there is a more proper way to
do this, however)
But then the job fails to run, because it's a linux binary on os x.
So I found the -dagman flag (and passed it the full path for the osx
binary) for condor_submit_dag, but then I get an error in
job.dag.dagman.out file that shows that dagman started, ran, and then
exited due to -Dagman being an unknown option. So it looks like
condor_submit_dag passed it's option onto condor_dagman, which doesn't
support the flag? Although the fact that it produced the error tells
me that at least condor_dagman actually started and ran on the CM
before producing the error.
At least i'm getting closer.
Peter