[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] java memory requirement
- Date: Fri, 9 Apr 2004 22:33:23 -0400
- From: James Wilgenbusch <jwilgenb@xxxxxxxxxxxx>
- Subject: Re: [condor-users] java memory requirement
I applied both suggestions and things seemed to work for a while.
More recently, however, I'm running into serious problems with the
schedd. Here's a snippet from the schedlog:
4/9 22:23:03 DaemonCore: Command Socket at <18.104.22.168:34484>
4/9 22:23:18 ERROR "Error: bad record with op=103 in corrupt logfile"
at line 723 in file classad_log.C
I've now set things back to the previous state and would like to know
what log file I need to get rid of so that I can restart the schedd
without running into this issue?
1. You can use condor_qedit to edit ImageSize attribute of jobs in the queue.
2. If you have NEGOTIATE_ALL_JOBS_IN_CLUSTER = False in Condor
config, then Condor will stop negotiating jobs in a cluster if one
of the jobs fails to match. Have a look at the config file's
comments about the setting for more detail.
James Wilgenbusch wrote:
I've been running numerous java jobs under condor. Recently I ran
into a bit of a snag. A recent power outage required that most of
our dedicated compute nodes be shutdown. After the power and
condor came backup I noticed that most of my java jobs would not
start. The reason reported by condor_q's analyze is:
WARNING: Be advised:
No resources matched request's constraints
Check the Requirements expression below:
Requirements = (HasJava) && (Disk >= DiskUsage) && ((Memory *
1024) >= ImageSize) && (HasFileTransfer)
The Memory requirement seems to be responsible for preventing the
job from running. The image size for this job grow to 1.8 GB and
most of our compute nodes have only a gig of memory.
Is there anyway that I can get the jobs in the queue to restart
even if it means loosing the current image. I don't want to simply
remove the jobs currently in the queue because then I'd have to
figure out which jobs finished and which need to be restarted. I'd
rather just remove the ImageSize requirement and have the jobs
restart from scratch.
A second issue. I have many other java jobs in the queue that have
not yet run and therefor are not constrained by the Memory
requirement. Yet for some reason these jobs will not run. Here's
the output from analyze.
5913.167: Run analysis summary. Of 354 machines,
20 are rejected by your job's requirements
14 reject your job because of their own requirements
2 match, but are serving users with a better priority in the pool
26 match, but prefer another specific job despite its worse
238 match, but will not currently preempt their existing job
54 are available to run your job
Any idea why these jobs will not pickup?
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>