[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Submission to remote slurm cluster is Failing consistently



Hi Jaime,

I have done as you said and it now gives me the following error: 

Testing bosco submission...Passed!
Submission and log files for this job are in /home/cloudcopasi/bosco/local.bosco/bosco-test/boscotest.GX3EFH
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
02/03/21 11:41:48 [28657] (12.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
02/03/21 11:41:48 [28657] (12.0) blah_job_submit() failed: submission command failed (exit code = 2) (stdout:) (stderr:/home/FCAM/hbaig/bosco/glite/libexec/slurm_submit.sh: No such file or directory)
02/03/21 11:41:53 [28657] No jobs left, shutting down
02/03/21 11:41:53 [28657] Got SIGTERM. Performing graceful shutdown.
02/03/21 11:41:53 [28657] **** condor_gridmanager (condor_GRIDMANAGER) pid 28657 EXITING WITH STATUS 0
[cloudcopasi@cloud-copasi-new ~]$ 

I think it is looking for slurm_submit.sh file under the wrong directory (../bosco/glite/libexec/). 

regards
hb

On Feb 3, 2021, at 11:26 AM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

So you cannot use the default QoS (and possibly the default partition) of this Slurm cluster. This means you have to explicitly say to use the âgeneralâ QoS/partition when submitting to Slurm. Currently, you canât tell HTCondor and Bosco to specify a non-default QoS for Slurm in the job submission on your local machine. But you can modify the Bosco installation on the Slurm login node by hand to always add the QoS and partition settings.

If youâre comfortable editing shell scripts, add the following lines at around line 80 in $(HOME)/bosco/glite/bin/slurm_submit.sh:

echo "#SBATCH -p general" >> $bls_tmp_file
echo "#SBATCH -q general" >> $bls_tmp_file

Then, run bosco_cluster --test on your own machine to test your changes.

 - Jaime

On Feb 2, 2021, at 10:10 PM, hasanbaigg@xxxxxxxxx wrote:

Hi again,

Running the same test.sh script with âpartition=general and qos=general switches worked fine. 

Can you please tell me now how to fix the original problem of testing the cluster which was failing? What changes am I suppose to make and where?

regards
Hasan

On Feb 2, 2021, at 10:25 PM, Hasan Baig <hasanbaigg@xxxxxxxxx> wrote:

Hi

I just tried that out and running the sbatch command gives me the following error: 

sbatch: error: Batch job submission failed: Invalid qos specification


regards
Hasan

On Feb 2, 2021, at 3:22 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On the Slurm login machine, create a script like this:

#!/bin/bash
#SBATCH -o test.out
#SBATCH -e test.err
/bin/date

Then, submit your script to Slurm using the sbatch command. You can check its status with squeue. As with HTCondor, if squeue doesnât show your job, that means itâs done. Assuming everything works, it should look something like this:

% cat test.sh
#!/bin/bash
#SBATCH -o test.out
#SBATCH -e test.err
/bin/date
% sbatch test.sh
Submitted batch job 12103
% squeue -j 12103
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
% cat test.out
Tue Feb  2 14:18:14 CST 2021
%

 - Jaime

On Feb 2, 2021, at 1:24 PM, hasanbaigg@xxxxxxxxx wrote:

I am not really sure how to do that. I will explore how to run a job directly to slurm and will get back to you with the output.


Meanwhile, if you know any relevant documentation, kindly share. 

thanks 

regards
Hasan

On Feb 2, 2021, at 1:53 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

BOSCO and HTCondor donât use the -q/--qos command-line option when submitting a job to Slurm. There is a way to have them set the -p/--partition option, but the bosco_cluster tool doesnât use it when testing a new setup.

Can you try submitting a job directly to Slurm on the cluster login node without using the -p or -q options of sbatch?

 - Jaime

On Feb 2, 2021, at 12:43 PM, hasanbaigg@xxxxxxxxx wrote:

Thanks for your concerns,

I will look into the qos documentation. Meanwhile, would expect any SLURM expert to chip in.

regards
Hasan

On Feb 2, 2021, at 1:32 PM, christoph.beyer@xxxxxxx wrote:

Uh,

hopefully someone with slurm experience can jump in here (I better stick to my alternative HTC facts) :D :D :D

As far as I know the partition you use is defined with a list of possible qos (default,debug) for ex. in the submit you can request the qos with the '-q' option.

scontrol show part <partname>

It looks like your job coming from the HTC side of things is requesting a qos that is not defined for the partition you want to use. 

Unfortunately that is even thinner ice for me than usualy on this list :(

Also I did not fully get the concept of slurm at the time and started using condor instead ;)

But you seem to be on the right track, and should check the documentation about qos and how to define them for partiions and users alike:


Best
christoph




--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: hasanbaigg@xxxxxxxxx
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
CC: "bosco-discuss" <bosco-discuss@xxxxxxxxxxxxxxxxxxx>
Gesendet: Dienstag, 2. Februar 2021 18:53:07
Betreff: Re: [HTCondor-users] Submission to remote slurm cluster is Failing consistently

Hi,

Now I ran it on the server (where bosco node is hosted) and it shows me the following output 
     User   Def Acct     Admin    Cluster    Account  Partition     Share MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
     hbaig  pi-mendes      None     xanadu  pi-mendes                    1                                                             general,himem,speci+           
-bash-4.2$ 


Could you please tell me if it tells something meaningful which needs to be fixed.

regards
Hasan

On Feb 2, 2021, at 12:37 PM, christoph.beyer@xxxxxxx wrote:

acctmgr show user <user> withassoc 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/