Re: [HTCondor-users] Using condor to "bulk-submit" jobs to SLURM

HEPCloud has a system whereby we are using glideinwms and bosco/blahp to submit glideins to a number of big SLURM systems including NERSC, TACC Stampede2, TACC Frontera, PSC Bridges2, and SDSC Expanse. We have found scheduling is best (as long as we have load) if we are submitting a glidein that actually spawns a multi-node job--i.e. one slurm job -> 100 full nodes of stuff calling back to HTCondor. Of course we have the whole CMS global pool production to feed it.

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello,

We run SLURM as our local batch system, but have many users who are
experienced with Condor due to its usage @ other institutions. We
would like to configure/deploy something to allow these users to
submit Condor jobs which will run on our SLURM nodes.

We run CMS and LIGO jobs through the OSG, so we're familiar with
HTCondor-CE, but we're wary of allowing users to directly submit this
way, since there is a 1:1 mapping of Condor jobs to SLURM jobs, and
particularly with SLURM's backfill scheduler, many and/or short jobs
showing up at once can severely hamper the responsiveness of the
scheduler. Additionally, for a reason I think we're close to
diagnosing, blahp submits/generates enough RPCs to account for >90% of
the entire scheduler load, even though CMS and LIGO occupy less than a
1/3rd of the total job count on the cluster.

Ideally, we could have a system like glideinWMS where the local
scheduler receives a long-running pilot who then executes multiple
user jobs inside, but it's my understanding that deploying such a
service is extremely non-trivial. I've also thought about possibly
merging several condor jobs into a single SLURM "job array" (which
somewhat behaves like a Condor cluster from the scheduling side), but
it appears the job router operates on a job-per-job basis and there's
not a good place to "coalesce" individual jobs into larger arrays.

Does anyone have good ideas on how to approach this?

Thanks,
Andrew Melo
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwICAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=iAMCtce7dCEST-_F0tpFqppQy7Rn0D1XbbTnrRZR8G4&s=8i_FAcyF0fdz9aRJ6MjKI0TLFzuzCx9SuVfMCP3xUgk&e=

The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwICAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=iAMCtce7dCEST-_F0tpFqppQy7Rn0D1XbbTnrRZR8G4&s=8JfZaPZ-O_tn9RZfQf92aeJQjlhg2_Daur4lRAZirlE&e=

Mailing List Archives

Public Access

Re: [HTCondor-users] Using condor to "bulk-submit" jobs to SLURM