[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor/SGE cluster





On Mon, 6 Jan 2014, Lukas Koschmieder wrote:

Hi,

I'm trying to set up Condor in order to be able to submit jobs to a local SGE cluster. The SGE cluster is already up and running, and I can execute Vanilla universe Condor jobs (e.g. "/usr/bin/condor_run -u vanilla -a periodic_remove=JobStatus==5 /bin/hostname &). But if I try to submit a Grid universe job (grid_resource=sge), the job always ends up in hold state.

condor_status -analyze
Hold reason: Attempts to submit failed:

(...)

[27666] (77.0) blah_job_submit() failed: submission command failed (exit code = 1) (stdout:) (stderr:)

You have to understand first why the 'sge_submit.sh'
(/usr/libexec/condor/glite/bin/sge_submit.sh) script is failing.
This means that the script is either unable to find 'qsub' or that
the generated submit file is incorrect.

The script expects to find the SGE root directory and cell name via
the batch_gahp.config (/usr/libexec/condor/glite/etc/batch_gahp.config)
file. These default to the SGE_ROOT and SGE_CELL environment
variables. If these variables are not defined, '/usr/local/sge/pro' and 'default' are used for the root path and cell name. You can set these (sge_root and sge_cellname) in batch_gahp.config as appropriate.

If these settings are correct and sge_submit.sh is still failing can try to execute it directly by giving a simple command as an argument, say sge_submit.sh -c /bin/date If you wish to inspect the generated submit file you should modify sge_submit.sh so that $bls_tmp_file is either copied away or not removed in the script.

I unfortunately have no hands-on experience with SGE. However, if these
scripts contain assumptions that don't make sense in your environment
I can make sure they get fixed.

Hope this helps.
Francesco Prelz
INFN Milano