[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ImageSize Problems (+ documentation typo)




The manual is referring to the problem of jobs not matching to machines, because the default memory requirements reference the job's ImageSize, which may not be an accurate measure of the job's real memory requirements. The solution suggested by the manual is to override the default memory requirements.

In addition to that problem, you have an additional problem, which is the job going on hold because your condor system is configured to put jobs on hold with a large image size. As you found, adjusting the requirements expression does not help, because the job still goes on hold.

Possible solutions:

1. condor_qedit and change ImageSize to something smaller than the value that causes jobs to go on hold. Obviously this is not a very good solution, because when the job starts running, ImageSize may get big again and cause the job to go on hold.

2. adjust the SYSTEM_PERIODIC_HOLD expression. Perhaps it should look at ResidentSetSize instead of ImageSize? (ResidentSetSize is available in Condor 7.6.) Unfortunately, that is not perfect. A job might be able to cause thrashing but not have a strikingly huge resident set size. It also may over-count memory that is shared between processes in the job. In Condor 7.7.0, there will be a new job attribute ProportionalSetSizeKb, which measures PSS as reported by Linux. This is like ImageSize (i.e. VSIZE), but it has the advantage of not over-counting memory that is shared between the multiple processes in a job.

PS (Typo: where the documentation reads "You will need to change 1024 to a reasonably good estimate of the actual image size of your program, in kilobytes" I think that should be megabytes - as earlier we are told that Memory is measured in megabytes).

Unfortunately, the Memory attribute of the machine is in megabytes. The ImageSize attribute in the job is in kilobytes. Confusing!

--Dan

On 5/20/11 1:11 PM, Dan O'Donovan wrote:
Dear Condor users,

I'm currently running into difficulties with jobs running on our local OSX 10.5 Condor 7.6.0 grid.

I found the FAQ "Why does my Linux job have an enormous ImageSize and refuse to run anymore?" http://www.cs.wisc.edu/condor/manual/v7.6/7_3Running_Condor.html#SECTION008310000000000000000
which appeared to address my issue, but unfortunately my jobs are still being held with

SYSTEM_PERIODIC_HOLD expression '(JobStatus == 2)&&  (ImageSize>  3048000)' evaluated to TRUE

The documentation suggests setting a ClassAd as below:

Requirements            = (Memory>  250)

However, this routinely results in the job being held after the SYSTEM_PERIODIC_HOLD on the first
SYSTEM_PERIODIC_HOLD cycle (after 5 minutes).

condor_q -l tells me that

ImageSize 	= 4750000
ImageSize_RAW 	= 4561220

so I understand why the job is being held, but not why setting Requirements would help or correct ImageSize.

I've spent some time monitoring my job, and though there are two processes (the shell script and then the actual program that it launches), the resident memory never exceeds 210 megabytes.  The virtual memory is about 1.2 GB.

It appears the hold must be due to the (ImageSize>  3048000) statement (hold jobs which are using>  3GB), however I am fairly certain that my job is not, and I can't seem to find any way around this.

Can anyone suggest a way to get these jobs to dodge SYSTEM_PERIODIC_HOLD when they're actual resident memory (observed through top) is closer to 210 megabytes - I'm not keen on adjusting the HOLD expression as I don't want other errant jobs thrashing swap.

Thanks in advance for any suggestions on how to debug this further or any possible fixes.

Regards,

Dan

PS (Typo: where the documentation reads "You will need to change 1024 to a reasonably good estimate of the actual image size of your program, in kilobytes" I think that should be megabytes - as earlier we are told that Memory is measured in megabytes).

Dan O'Donovan Ph.D
SBGrid Consortium
Harvard Medical School



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/