[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor and cuda vm usage



Is there any way to change how condor tracks virtual memory for job?
I ask, because we've encountered this scenario with condor and gpu's

we have slots assigned to gpu's, (nvidia tesla running cuda)

the job starts up, runs fine and then is preempted by another user

when the preemption occurs, condor updates the ImageSize classad to
include then entire "unified memory address space" that cuda maps out.

so this means, that when condor goes to reschedule the job, it
searches for a slot that (in our case) needs 75GB of memory (which we
don't have)

is there anyway to prevent this?  we don't want to turn of the
memory/slot checking off, just want to keep condor from tracking this
cuda uva system

i can restart the jobs by doing a condor_qedit, but i'd prefer not to
have to do that each time this happens (which isn't often, but enough
it's annoying)