[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Maximum jobs on submit machine



Eric,

How many jobs are you trying to run concurrently from the submit machine? I have a windows pool (~300 slots) and I can run jobs on all these when not in use. For example, I may submit a DAG with 15k jobs and for our pool no more than 300 will run (I generally set maxjobs for the DAG because I do not have enough slots to support more) and the remaining jobs stay in idle. So for this size of pool I have not had a problem.

We also use symantec, but I have added exceptions for the Condor processes and install directory. After making these changes I have not had any problems on this front. Desktop firewalls are disabled within our LAN. I would guess this is not the problem since it sounds like you have already addressed this.

Can you describe what your task manager is doing when you submit these jobs. What is your machine's memory and/or CPU doing and what changes do you see when the jobs die off.

You could also try looking at the job log files or condor log files. These may have info as to why jobs die.

mike






From: Mark Calleja <mc321@xxxxxxxxx>
To: condor-users@xxxxxxxxxxx
Date: 01/13/2012 10:14 AM
Subject: Re: [Condor-users] Maximum jobs on submit machine
Sent by: condor-users-bounces@xxxxxxxxxxx





Hi Eric,

I'm not familiar with Windows+Symantec (I'm from a Linux+iptables
background), but if you're sure that Symantec is the only application
that's limiting port availability then I'd certainly be minded to try
relaxing its settings on at least one of the submit hosts (is there no
way to determine what port range it's currently configured to?).
However, please clear this with your IT department first, i.e.please
don't sue me if something gets exploited while the test is taking place!

Best wishes,
Mark

On 13/01/2012 16:17, Eric Abel wrote:
> Lukas, Micheal, Matthew, and Mark,
>
> Thank you for your responses.  I will respond to all of you in a single email if possible.
>
> First, this is a windows pool.  The problem I am having is a maximum number of jobs running concurrently on a submit machine.  All of the execute machines are capped at the number of available CPU's, and they are working fine.  Like most places, each machine is set up with an anti-virus software, in this case Symantec.  The anti-virus utility is set up to handle the firewall, so windows firewall is disabled.  I have had to get IT to enable exceptions for all condor processes.  I have been running the pool for about 8-9 months now, but only recently have I recruited enough CPU's for this problem to surface.
>
> I have validated that the MaxJobsRunning value is not the limiter by setting its value first to 30, which definitely capped the number of running jobs at 30, then setting it to 2000, in which case the number of jobs simply floated to its maximum which are the 85 and 50 that I initially reported.
>
> Mark, if I were to temporarily disable Symantec, then this would test whether or not it's a firewall issue, correct?
>
> Thank you all for your ideas.  Hopefully we can find a resolution here.
>
> Eric
>
>
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Lukas Slebodnik
> Sent: Friday, January 13, 2012 8:03 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Maximum jobs on submit machine
>
> On Fri, Jan 13, 2012 at 10:49:32AM -0500, Matthew Farrellee wrote:
>> On 01/13/2012 10:22 AM, Eric Abel wrote:
>>> Fellow condor users,
>>>
>>> I am finding that there is a limit to the number of jobs that will run
>>> on a given submit machine, and that number is different depending on the
>>> machine. I have already verified that this limit is well below the
>>> default MaxJobsRunning value. For example on one machine the maximum
>>> seems to be about 85, and on another it’s about 50. Any ideas on this?
>>>
>>> Thanks,
>>>
>>> Eric
>> [MAX_JOBS_RUNNING]
>> default=ceiling(ifThenElse( $(DETECTED_MEMORY)*0.8*1024/800<  10000,
>> $(DETECTED_MEMORY)*0.8*1024/800, 10000 ))
>>
>> So the MaxJobsRunning is a function of RAM in the box. If you're on
>> Windows it is more complicated. Generally, I recommend using a
>> non-Windows machine for hosting the condor_schedd.
> You can view values for all schedd daemons by executing command
> condor_status -sched -f "%s " Name -f "%s\n" MaxJobsRunning
>
> On Windows platforms, the number of running jobs is capped at 200.
> A 64-bit version of Windows is recommended in order to raise the value above
> the default.
>
> Details:
>
http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#18253
>
> Regards,
> Lukas
>
>> Best,
>>
>>
>> matt
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/

--
Mark Calleja - Scientific Computing Group
University of Cambridge Computing Service
New Museums Site, Pembroke Street
Cambridge CB2 3QH, UK
Tel. (+44/0) 1223 761254

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/