[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Using condor to "bulk-submit" jobs to SLURM



Hello,

We run SLURM as our local batch system, but have many users who are
experienced with Condor due to its usage @ other institutions. We
would like to configure/deploy something to allow these users to
submit Condor jobs which will run on our SLURM nodes.

We run CMS and LIGO jobs through the OSG, so we're familiar with
HTCondor-CE, but we're wary of allowing users to directly submit this
way, since there is a 1:1 mapping of Condor jobs to SLURM jobs, and
particularly with SLURM's backfill scheduler, many and/or short jobs
showing up at once can severely hamper the responsiveness of the
scheduler. Additionally, for a reason I think we're close to
diagnosing, blahp submits/generates enough RPCs to account for >90% of
the entire scheduler load, even though CMS and LIGO occupy less than a
1/3rd of the total job count on the cluster.

Ideally, we could have a system like glideinWMS where the local
scheduler receives a long-running pilot who then executes multiple
user jobs inside, but it's my understanding that deploying such a
service is extremely non-trivial. I've also thought about possibly
merging several condor jobs into a single SLURM "job array" (which
somewhat behaves like a Condor cluster from the scheduling side), but
it appears the job router operates on a job-per-job basis and there's
not a good place to "coalesce" individual jobs into larger arrays.

Does anyone have good ideas on how to approach this?

Thanks,
Andrew Melo