[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] java memory requirement



I applied both suggestions and things seemed to work for a while. More recently, however, I'm running into serious problems with the schedd. Here's a snippet from the schedlog:

4/9 22:23:03 DaemonCore: Command Socket at <144.174.160.147:34484>
4/9 22:23:18 ERROR "Error: bad record with op=103 in corrupt logfile" at line 723 in file classad_log.C


I've now set things back to the previous state and would like to know what log file I need to get rid of so that I can restart the schedd without running into this issue?

Thanks,
Jim


Hello!

1. You can use condor_qedit to edit ImageSize attribute of jobs in the queue.

2. If you have NEGOTIATE_ALL_JOBS_IN_CLUSTER = False in Condor config, then Condor will stop negotiating jobs in a cluster if one of the jobs fails to match. Have a look at the config file's comments about the setting for more detail.

Regards,
Alexander

James Wilgenbusch wrote:
I've been running numerous java jobs under condor. Recently I ran into a bit of a snag. A recent power outage required that most of our dedicated compute nodes be shutdown. After the power and condor came backup I noticed that most of my java jobs would not start. The reason reported by condor_q's analyze is:

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = (HasJava) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)

The Memory requirement seems to be responsible for preventing the job from running. The image size for this job grow to 1.8 GB and most of our compute nodes have only a gig of memory.

Is there anyway that I can get the jobs in the queue to restart even if it means loosing the current image. I don't want to simply remove the jobs currently in the queue because then I'd have to figure out which jobs finished and which need to be restarted. I'd rather just remove the ImageSize requirement and have the jobs restart from scratch.

A second issue. I have many other java jobs in the queue that have not yet run and therefor are not constrained by the Memory requirement. Yet for some reason these jobs will not run. Here's the output from analyze.

5913.167: Run analysis summary. Of 354 machines,
20 are rejected by your job's requirements
14 reject your job because of their own requirements
2 match, but are serving users with a better priority in the pool
26 match, but prefer another specific job despite its worse user-priority
238 match, but will not currently preempt their existing job
54 are available to run your job


Any idea why these jobs will not pickup?

Thanks,
Jim




--
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>