[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Maximum jobs on submit machine



Thanks for the ideas Micheal.  I’ll poke around and see what else I can learn. 

 

Eric

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Donnell
Sent: Tuesday, January 24, 2012 4:31 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Maximum jobs on submit machine

 

Eric,

You might verify that you are changing the correct heap value (I am assuming you are). When you submit the jobs watch the task manager and memory use. I would check whether jobs are executed but then dying or whether the jobs sit in idle. I have run into problems on windows with too many files open and depending on how you are setting MAX_JOBS_RUNNING and submit file log files this may be causing some problems. I am curious if all jobs complete successfully with the exception that you cannot use all the machines concurrently. You could also double check that the pool password is stored for all machines (if this is not the case then jobs will not run on these machines). Use the following command and if UNDEF returned then the password not set correctly.
condor_status -f "%s\t" Name -f "%s\n" ifThenElse(isUndefined(LocalCredd),\"UNDEF\",LocalCredd)


1. Double check that the configuration setting for condor is not limiting the number of jobs you can run: MAX_JOBS_RUNNING = 500

2. Make sure you are changing the non-interactive desktop heap size (there are three heap size values you can change) on the submit machine:
SharedSection= 1st value, 2nd value, 3rd value
The first value defines the system heap size, the second value controls the interactive desktop heap (windows objects) and the third value controls the non-interactive desktop heap size. Change the third value following the description below.

Note:
Sixty Condor_shadow daemons use 256 Kb and therefore 120 shadows will consume the default (512Kb). To run 300 jobs set the heap size to 1280 Kb (32bit) and I use 3072 (64bit).

If you have already done this maybe try simplifying your jobs and submitting a lot of hello world jobs or something. My pool is entirely made up of windows (XP, 7, R2, 32bit and 64bit)  and I have been able to run close to 300 jobs if not more so I am pretty sure there is no limitation with the number of jobs that can run concurrently.

Maybe there are other ideas, but I have not seen this problem. We run most jobs in Vanilla and we require RunAsOwner. Maybe this problem is tied to the type of universe or other constraints used in your pool.

Hopefully you can get this figured out,
mike



From:

Eric Abel <Eric.Abel@xxxxxxxxxx>

To:

Condor-Users Mail List <condor-users@xxxxxxxxxxx>

Date:

01/24/2012 05:07 PM

Subject:

Re: [Condor-users] Maximum jobs on submit machine

Sent by:

condor-users-bounces@xxxxxxxxxxx

 





Hi all,

A much delayed follow-up here, but I tried increasing the HEAP size to 8 MB, but the maximum number of simultaneous jobs (for this particular machine) is 105-110.  I checked the Microsoft website, and apparently there about 40 MB is the upper limit to this value.  I'm not sure what setting it this high would do, but it doesn't matter because increasing to 8 MB did nothing to help my problem.  Any other ideas?  It would be nice to reach that 300 machine limit!!!

Thanks,

Eric

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Gore, Brooklin
Sent: Thursday, January 19, 2012 7:40 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Maximum jobs on submit machine

Try some Google searches, I didn't get anything in the first hit, but this
additional tip:

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com
.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/wsisinst_config_
winreg.html

I suggest trying 2048, 4096, etc. to see if this helps get you more jobs.

~B

On 1/18/12 5:12 PM, "Eric Abel" <Eric.Abel@xxxxxxxxxx> wrote:

>Thanks for the tip.  I changed the SharedSection value from 512 to 1280
>following the instructions on the link you provided, and now the number
>of jobs seems to peak at about 110.  However, I am not able to go much
>higher...is there a maximum to the value SharedSection can have?
>
>Eric
>
>-----Original Message-----
>From: condor-users-bounces@xxxxxxxxxxx
>[
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Gore, Brooklin
>Sent: Friday, January 13, 2012 11:49 AM
>To: Condor-Users Mail List
>Subject: Re: [Condor-users] Maximum jobs on submit machine
>
>Eric,
>
>While your maximum jobs running (50-85) is a bit lower than the 120
>usually associated with the Windows HEAP size issue, it could be related.
>
>Check the last article here:
>
http://research.cs.wisc.edu/condor/manual/v6.8/7_4Condor_on.html
>
>A silly question: There are more than 50-85 machines available to actually
>run these jobs, right?
>
>Best, ~Brooklin
>
>On 1/13/12 10:17 AM, "Eric Abel" <Eric.Abel@xxxxxxxxxx> wrote:
>
>>Lukas, Micheal, Matthew, and Mark,
>>
>>Thank you for your responses.  I will respond to all of you in a single
>>email if possible.
>>
>>First, this is a windows pool.  The problem I am having is a maximum
>>number of jobs running concurrently on a submit machine.  All of the
>>execute machines are capped at the number of available CPU's, and they
>>are working fine.  Like most places, each machine is set up with an
>>anti-virus software, in this case Symantec.  The anti-virus utility is
>>set up to handle the firewall, so windows firewall is disabled.  I have
>>had to get IT to enable exceptions for all condor processes.  I have been
>>running the pool for about 8-9 months now, but only recently have I
>>recruited enough CPU's for this problem to surface.
>>
>>I have validated that the MaxJobsRunning value is not the limiter by
>>setting its value first to 30, which definitely capped the number of
>>running jobs at 30, then setting it to 2000, in which case the number of
>>jobs simply floated to its maximum which are the 85 and 50 that I
>>initially reported.
>>
>>Mark, if I were to temporarily disable Symantec, then this would test
>>whether or not it's a firewall issue, correct?
>>
>>Thank you all for your ideas.  Hopefully we can find a resolution here.
>>
>>Eric
>>
>>
>>-----Original Message-----
>>From: condor-users-bounces@xxxxxxxxxxx
>>[
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Lukas Slebodnik
>>Sent: Friday, January 13, 2012 8:03 AM
>>To: Condor-Users Mail List
>>Subject: Re: [Condor-users] Maximum jobs on submit machine
>>
>>On Fri, Jan 13, 2012 at 10:49:32AM -0500, Matthew Farrellee wrote:
>>> On 01/13/2012 10:22 AM, Eric Abel wrote:
>>> >Fellow condor users,
>>> >
>>> >I am finding that there is a limit to the number of jobs that will run
>>> >on a given submit machine, and that number is different depending on
>>>the
>>> >machine. I have already verified that this limit is well below the
>>> >default MaxJobsRunning value. For example on one machine the maximum
>>> >seems to be about 85, and on another it¹s about 50. Any ideas on this?
>>> >
>>> >Thanks,
>>> >
>>> >Eric
>>>
>>> [MAX_JOBS_RUNNING]
>>> default=ceiling(ifThenElse( $(DETECTED_MEMORY)*0.8*1024/800 < 10000,
>>> $(DETECTED_MEMORY)*0.8*1024/800, 10000 ))
>>>
>>> So the MaxJobsRunning is a function of RAM in the box. If you're on
>>> Windows it is more complicated. Generally, I recommend using a
>>> non-Windows machine for hosting the condor_schedd.
>>
>>You can view values for all schedd daemons by executing command
>>condor_status -sched -f "%s " Name -f "%s\n" MaxJobsRunning
>>
>>On Windows platforms, the number of running jobs is capped at 200.
>>A 64-bit version of Windows is recommended in order to raise the value
>>above
>>the default.
>>
>>Details:
>>
http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#1825
>>3
>>
>>Regards,
>>Lukas
>>
>>>
>>> Best,
>>>
>>>
>>> matt
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>
https://lists.cs.wisc.edu/archive/condor-users/
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>
https://lists.cs.wisc.edu/archive/condor-users/
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
>
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
>

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/