[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] java memory requirement


1. You can use condor_qedit to edit ImageSize attribute of jobs in the queue.

2. If you have NEGOTIATE_ALL_JOBS_IN_CLUSTER = False in Condor config, then Condor will stop negotiating jobs in a cluster if one of the jobs fails to match. Have a look at the config file's comments about the setting for more detail.


James Wilgenbusch wrote:
I've been running numerous java jobs under condor. Recently I ran into a bit of a snag. A recent power outage required that most of our dedicated compute nodes be shutdown. After the power and condor came backup I noticed that most of my java jobs would not start. The reason reported by condor_q's analyze is:

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = (HasJava) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)

The Memory requirement seems to be responsible for preventing the job from running. The image size for this job grow to 1.8 GB and most of our compute nodes have only a gig of memory.

Is there anyway that I can get the jobs in the queue to restart even if it means loosing the current image. I don't want to simply remove the jobs currently in the queue because then I'd have to figure out which jobs finished and which need to be restarted. I'd rather just remove the ImageSize requirement and have the jobs restart from scratch.

A second issue. I have many other java jobs in the queue that have not yet run and therefor are not constrained by the Memory requirement. Yet for some reason these jobs will not run. Here's the output from analyze.

5913.167: Run analysis summary. Of 354 machines,
20 are rejected by your job's requirements
14 reject your job because of their own requirements
2 match, but are serving users with a better priority in the pool
26 match, but prefer another specific job despite its worse user-priority
238 match, but will not currently preempt their existing job
54 are available to run your job

Any idea why these jobs will not pickup?


Attachment: signature.asc
Description: OpenPGP digital signature