[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Q: BLAH configuration for non-shared submission to Slurm?



Thanks for the response Jaime.

Submitting jobs from the Scarf condor node in a directory that BLAH knows to be shared is fine; but I was trying to submit from an external Condor system, and that was silently failing; so I tried submitting from the internal node from an "unshared" location to see whether that was (part of) the problem.

After posting, I spotted a comment in slurm_submit.sh that said, "Assume all filesystems are shared." So that's that! Perhaps...

Whilst looking for more configuration examples, I found a more recent version of BLAH (https://github.com/prelz/BLAH/releases, v1.22.3). In this, slurm_submit.sh inserts commands of the form "scp submit-host:remote-file local-file" into the generated script for any non-shared files. Now, if only Scarf allowed the job owner to use scp at all, and ideally to copy from the condor spool... !

The Scarf node has HTCondor 8.6.5 installed (and condor-externals to match). I could try updating, but I suspect it won't solve the basic problem.

I probably don't need to transfer the jobscript at all: I think I can get away with just passing a command-line to be executed on the worker. However, getting output back will still be an issue...

Brian

----------------------------------------------------------------------

Message: 1
Date: Wed, 1 Aug 2018 21:10:02 +0000
From: Jaime Frey <jfrey@xxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Q: BLAH configuration for non-shared
	submission to	Slurm?
Message-ID: <5C0F8F76-A535-402E-B70D-AA6473CF9F1B@xxxxxxxxxxx>
Content-Type: text/plain; charset="utf-8"

To the best of my knowledge, SLURM has no facilities for transferring job files between machines. It assumes you have a shared filesystem for all job files. That?s why you don?t see any directives in slurm_submit.sh.
The BLAHP doesn?t copy job files from a local filesystem to a shared one on the submit machine. It should probably give an error if it detects that job files are on a local filesystem and the batch system can?t move them, but that currently doesn?t happen.

For your current testing, all of the job files (including the original job script) should be on the shared filesystem. In your ultimate setup, the HTCondor spool directory will need to be on the shared filesystem on your custom Scarf node. Also, submission from the other HTCondor node will have to include spooling of job files (either Condor-C or condor_submit -remote).

 - Jaime

On Jul 30, 2018, at 11:20 AM, Brian Ritchie - UKRI STFC <brian.ritchie@xxxxxxxxxx<mailto:brian.ritchie@xxxxxxxxxx>> wrote:

I'm trying to use HTCondor to submit jobs to our Scarf HPC. At
present, this uses Platform LSF, and (following initial work by Andrew
Lahiff) I've managed to get this to work (to some extent). However,
Scarf is replacing Platform LSF with Slurm, and I'm having trouble
getting submission to work with Slurm in the case where the jobscript
is in a directory that is not shared with the worker nodes. (I am
submitting from a custom Scarf node that has Condor
installed. Ultimately, jobs will be submitted to this node from an
HTCondor node that is external to Scarf, so sharing won't be an
option.)

The problem seems to be that the jobscript that is generated by BLAH's
slurm_submit.sh assumes that the original jobscript has been copied to
a (unique) filename in a sandbox folder, but the copy never happens.
The lsf_submit.sh script generates BSUB directives that (I think)
instruct LSF to perform the intial copy, but I see no equivalent in
slurm_submit.sh.

None of this is reflected in the files created by HTCondor: the log
file implies that the job ran OK (but consumed no resources), and the
output and error files are always empty. Only by modifying the blah
scripts to log to somewhere other than /dev/null (and copying the
generated jobscripts to file) was I able to get more information about
what was going wrong!

batch_gahp.config has many options for defining which directories are
shared, and for overriding default locations for sandboxes etc. I have
tried numerous permutations, to no avail.

Is there a better guide to configuration than the comments in batch_gahp.config?
What special considerations are required for Slurm?

Thanks,
  Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20180801/29392d9b/attachment.html>

------------------------------