Re: [HTCondor-users] Memory requests increasing

Thanks for this but I'm still not quite sure what's going on so maybe an example helps.

I've got a job which was submitted yesterday and the requirements line in the submit file was:

Requirements = starccm == "yes" && Memory >=900

It started running but is now shown as idle; running condor_q gives me the info below. What I don't understand is why it's now asking for 2686MB of RAM (at least, I do understand; the .log file shows that the job was using 2686MB when it was evicted yesterday. The eviction was probably because another user started to use the machine but why does Condor assume it's got to have that much RAM in order to re-run?)


C:\scripts>condor_q  1534 -analysebetter

-- Submitter: HTCONDOR.cc.ic.ac.uk : <> : HTCONDOR.cc.ic.ac.uk
1534.000:  Run analysis summary.  Of 2815 machines,
   2815 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job
        Last successful match: Wed Feb 27 17:40:13 2013
        Last failed match: Thu Feb 28 11:55:58 2013

        Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( target.starccm == "yes" && target.Memory >= 900 ) &&
( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "WINDOWS" ) &&
( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( TARGET.HasFileTransfer )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Memory >= 2686 )         0                   MODIFY TO 1018
2   target.starccm == "yes"           1876
3   target.Memory >= 900              2532
4   ( TARGET.Arch == "X86_64" )       2815
5   ( TARGET.OpSys == "WINDOWS" )     2815
6   ( TARGET.Disk >= 500000 )         2815
7   ( TARGET.HasFileTransfer )        2815

> We're running Condor 7.8.2 and seeing that some jobs never complete. The log file below is from a job using Abaqus. I submit the job via Condor and it gets picked up by a machine. Provided that no-one reboots the machine then the file gets processed in about 3 hours on a machine with 4GB of RAM. There's a a lot of swapping to disk but it all works.
> I'm not sure that I understand what the log below is telling me; the final lines are easy - the user aborted because nothing had happened but is there anything significant about the increasing "ResidentSetSize"? 

The ResidentSetSize is just reporting the maximum RAM used by the job so far. A ResidentSetSize of 3.5GB agrees with your report that the job causes swapping on a machine with 4GB of RAM, but can run successfully (depending on what else is is using memory on the machine).
When the Image size events cease, it means the job's RAM usage has plateaued or declined. The job is still running. If it was running for longer than expected, maybe additional load on the machine slowed down execution (due to contention for CPU or RAM).

