[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor's calculated memory vs image size of jobs in queue

I'm wondering if others are noticing similar problems? How they're working with/around this? If I've just got something misconfigured?

I've apparantly got a whole in my config, but I'm wondering how others handle this.

I'm noticing an interesting edge case in our pool, where a user has lots of jobs queued up... some may get evicted after some amount of run time, fail to match when they try to pick up where they left off after a checkpoint/eviction as their SIZE had grown to larger than the "Memory" value determined on start up on the compute node. When that job has the lowest job id for that user in the queue, schedd will just spin from that point on, only trying to schedule that job, and no others...

As examples are far better than my description:

Compute node has SMP CPU and 2048M memory, in the condor init script we set a ulimit of 1300000, to keep the job from running the machine into the weeks. On startup condor auto determines 2 cpus, with 1004M per CPU (per condor_status).

A job starts running on machine, and eventually is checkpointed/evicted. Its image size is reported as 1220.7 from the schedd (via condor_q). Any jobs with a lower job id run to completion, but, any jobs with a higher job id from that point on fail to match, such that the queue appears as such for that user:

1515840.0   user            5/12 21:01   1+03:42:12 I  0   1220.7 net.sh
1517884.0   user            5/13 01:53   1+01:23:22 I  0   1025.4 net.sh
1517885.0   user            5/13 01:53   1+00:40:21 I  0   1220.7 net.sh
1582585.0   user            5/15 20:54   0+01:04:55 I  0   459.0 net.sh

From MatchLog on the collector/negotiator machine, I only see job
1515840.0 trying to be matched, and all other jobs by that user are ignored. If the user blows away jobs 1515840.0 1517884.0 1517885.0, all other jobs start getting scheduled/matched/run; until this happens again.