[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Submission to remote slurm cluster is Failing consistently



Hi Jaime,

First of all, sorry for my late reply as I got occupied in some family emergencies. 

That folder âlibexecâ does not exist. 

Everything actually worked when I ran things form remote server. But when i run it from localhost (my own computer) then I see this â../libexecâ missing error.

regards
Hasan 

On Feb 3, 2021, at 2:43 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

That is odd. The errors from your previous setup/test attempts ("Invalid qos specificationâ) indicate it was finding the slurm_submit.sh script before. The location of the slurm_*.sh scripts was moved in recent versions, but a consistent installation of Bosco on the Slurm machine shouldnât be confused about where some of its files are located.

On the Slurm machine, can you check whether $(HOME)/bosco/glite/libexec exists? If is does, what files are under it?

If it doesnât exist, running this command should provide a quick workaround for the error youâre seeing:

ln -s bin $(HOME)/bosco/glite/libexec

 - Jaime

On Feb 3, 2021, at 11:20 AM, hasanbaigg@xxxxxxxxx wrote:

ok thanks.

can you please tell me why it is looking in the wrong directory for slurm_submit.sh file (as mentioned in my last email) and how can I fix it?

regards
hb

On Feb 3, 2021, at 12:11 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

Umet is talking about components in HTCondor-CE, which is similar to Bosco, but is installed by the administrator of the Slurm cluster for use by all remote users. The Job Router is not part of Bosco.

But the suggestions about sacctmgr may be an alternative to manually editing Boscoâs slurm_submit.sh.

 - Jaime

On Feb 3, 2021, at 10:28 AM, hasanbaigg@xxxxxxxxx wrote:

Hi Umet,

Where are we supposed to specify this âJOB_ROUTER_ENTRIESâ parameter? 

Also, I donât see a folder âcondorâ under /usr/libexec/ directory. Could you please comment on that?

thanks

regards
Hasan

On Feb 2, 2021, at 2:39 PM, uemit.seren@xxxxxxxxxxxxxx wrote:

The SLURM partition can be specified in the htcondor job routing via the set_default_queue parameter (see below):
 
JOB_ROUTER_ENTRIES @=jre
[
    GridResource = "batch slurm";
    TargetUniverse = 9;
    name = "GRID jobs";
    set_default_queue = "grid";
]
@jre
 
AFAIK there is no way to specify the âqos in the job routing. One way would be to patch the /usr/libexec/condor/glite/bin/slurm_submit.sh blahp script and add it there.
 
However you can define a DefaultQOS on account/association level using sacctmgr which will be used when you donât define a âqos during submission:
 
sacctmgr modify account myAccount set DefaultQOS='grid
 
The above statement only works if the account has access to the QOS. You can set this with: sacctmgr modify account myaccount set QOS='debug,gridâ
 
You must also make sure that the partition allows the QOS. 
This is specified in /etc/slurm/slurm.conf on the controller.
You can check this with: scontrol show part 
 
Check the âAllowQosâ  parameter in the output if the QOS is included or the value is set to âALLâ
 
Hope this helps
Best
 
 
-- 
Ãmit Seren Msc
HPC Engineer
+4369910269552
Vienna BioCenter (GMI, IMP, IMBA)
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of "hasanbaigg@xxxxxxxxx" <hasanbaigg@xxxxxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, 2. February 2021 at 20:26
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: bosco-discuss <bosco-discuss@xxxxxxxxxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Submission to remote slurm cluster is Failing consistently
 
I am not really sure how to do that. I will explore how to run a job directly to slurm and will get back to you with the output.
 
 
Meanwhile, if you know any relevant documentation, kindly share. 
 
thanks 
 
regards
Hasan


On Feb 2, 2021, at 1:53 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
 
BOSCO and HTCondor donât use the -q/--qos command-line option when submitting a job to Slurm. There is a way to have them set the -p/--partition option, but the bosco_cluster tool doesnât use it when testing a new setup.
 
Can you try submitting a job directly to Slurm on the cluster login node without using the -p or -q options of sbatch?
 
 - Jaime


On Feb 2, 2021, at 12:43 PM, hasanbaigg@xxxxxxxxx wrote:
 
Thanks for your concerns, 
 
I will look into the qos documentation. Meanwhile, would expect any SLURM expert to chip in.
 
regards
Hasan


On Feb 2, 2021, at 1:32 PM, christoph.beyer@xxxxxxxwrote:
 
Uh,
 
hopefully someone with slurm experience can jump in here (I better stick to my alternative HTC facts) :D :D :D
 
As far as I know the partition you use is defined with a list of possible qos (default,debug) for ex. in the submit you can request the qos with the '-q' option.
 
scontrol show part <partname>
 
It looks like your job coming from the HTC side of things is requesting a qos that is not defined for the partition you want to use. 
 
Unfortunately that is even thinner ice for me than usualy on this list :(
 
Also I did not fully get the concept of slurm at the time and started using condor instead ;)
 
But you seem to be on the right track, and should check the documentation about qos and how to define them for partiions and users alike:
 
 
Best
christoph
 
 
 

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
 

Von: hasanbaigg@xxxxxxxxx
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
CC: "bosco-discuss" <bosco-discuss@xxxxxxxxxxxxxxxxxxx>
Gesendet: Dienstag, 2. Februar 2021 18:53:07
Betreff: Re: [HTCondor-users] Submission to remote slurm cluster is Failing consistently
 
Hi,
 
Now I ran it on the server (where bosco node is hosted) and it shows me the following output 
     User   Def Acct     Admin    Cluster    Account  Partition     Share MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
     hbaig  pi-mendes      None     xanadu  pi-mendes                    1                                                             general,himem,speci+           
-bash-4.2$ 
 
 
Could you please tell me if it tells something meaningful which needs to be fixed.
 
regards
Hasan
 
On Feb 2, 2021, at 12:37 PM,christoph.beyer@xxxxxxx wrote:
 
acctmgr show user <user> withassoc 
 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
 
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/