[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] how to change requested memory (cpus) for running job
- Date: Wed, 25 Jan 2017 19:02:10 +0000
- From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] how to change requested memory (cpus) for running job
I think we're in the same boat - the key is changing the machine attributes, rather than the job attributes, and I'm looking to do that for concurrency limits to deal with late-job license checkouts as opposed to memory allocations. It's starting to look like I may wind up building DAGs anyway, but it'd still be a useful trick to have.
I've tried a few things and have gotten some really wierd results with condor_update_machine_ad and condor_advertise, so I'm still hunting for the proper incantations.
One of the considerations is the permissions required to change the machine ad - a job owner can't change the machine ad even for the slot in which the job is running, so there'd need to be some sort of signaling mechanism, such as a custom job attribute, to allow the job to trigger a process with the necessary permissions to validate and make the changes on the machine ad.
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Harald van Pee
Sent: Wednesday, January 25, 2017 1:21 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] how to change requested memory (cpus) for running job
does no answer mean that there is no expert around these days or is it just not possible with htcondor to change any ClassAdds for a running job?
The idea is just to change the reserved memory in a way that the available memory decreases that no other job with big memory request can start which could crash the machine or a long running job. The available memory should not go to 0 if there is enough memory available and the available memory should just inrease again if the job finish.
Therefore a reread of the reservedMemory ClasAdd on the start machine, without killing any job, seems to be perfect, if possible.
We are working on checkpointing of our jobs, but for some it seems not possible.
Any ideas would be welcome