[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Computing speed slows down gradually.
- Date: Tue, 26 Jan 2010 19:16:38 -0500
- From: Mag Gam <magawake@xxxxxxxxx>
- Subject: Re: [Condor-users] Computing speed slows down gradually.
My jobs were just slow because the jobs were being assigned to the same host.
So, I had 100 hosts with 16 cores and I submit 500 jobs, those jobs
get assigned to hosts 0,1,2,3,4,5,6,etc... in order. Thats the reason
I am not sure if this is possible but I would like to have something
like this in my ad:
And I would like the scheduler to prefer this job over any other
"JobType" and then place it on different hosts.
On Tue, Jan 26, 2010 at 12:49 PM, Genie Jhang <geniejhang@xxxxxxxxxxx> wrote:
> Thanks, Ian.
> I think I might find what is the problem.
> My source code the program, for geant4, was somewhat wrong.
> It gives me "Sigmentation fault" after it's done.
> So, I rewrote all the code I fixed it.
> I'm now testing my code.
> I anticipate it will be done about 16 hours after from now.
> Thank you all.
> I'll try to answer the problem if there's I already know.
> Thanks again.
> 2010/1/26 Ian Chesal <ian.chesal@xxxxxxxxx>
>> On Sun, Jan 24, 2010 at 3:05 PM, Genie Jhang <geniejhang@xxxxxxxxxxx>
>>> Thanks, Ian.
>>> After I read your reply, I checked the machines.
>>> It has 2 dual-core 3.0GHz Xeon CPUs, and 2GB memory,
>>> and when the machine works, it only takes about 160MB of memory.
>>> I first though that when the job ends, it doesn't fully return the
>>> occupied memory.
>>> So, I also checked it but no memory leak was found.
>> Clendon Gibson pointed out that the OS will reclaim the memory after the
>> job terminates. You need to observe it while it's running.
>> Also: what OS are you running? If it's Windows, and you happen to be doing
>> a lot of CIFS or SMB file transfers with your jobs you can run into system
>> resource limits pretty quickly with 8 jobs on one machine doing a lot of
>> CIFS back-and-forth stuff. We see our Windows boxes go AWOL every few days
>> just from running out of network descriptors. They need to be rebooted to
>> periodically to reclaim them. We'll also kill them doing a lot of junction
>> and hard link stuff with jobs. Again: reboot solves it every time. MKS
>> blames MS, MS blames MKS -- welcome to the world of "it's someone else's
>> problem" tech support. :)
>>> And I also set require one cpu for one job, it can't come from sharing
>> No, but you're running 8 jobs in parallel aren't you? Even if each job is
>> only claiming one CPU.
>>> Still, I submit jobs, computing speed getting slower.
>>> So, first job ends whthin about 1days but the larger the job number, the
>>> longer it takes time to be ended.
>>> What do you think is problem?
>> What do your jobs do? Are the getting input data from some external
>> source? Did you check the external sources? Are they completely
>> self-contained? If so: check on the memory leak situation. Do they do
>> cumulative processing? Where one job produces input for a later job? If yes:
>> could you have a runaway loop in the code for the later job? Can you watch
>> the jobs running? Are they using 100% of the CPU (or whatever % you expect)?
>> Or are they spending their time idle or blocked for I/O? Do your jobs do
>> lots of local disk reading/writing? Could you be severely fragmenting your
>> local disk causing IO times to skyrocket?
>> You've got to give us far more information here. I'm just making some
>> guesses based on the vague bit of info I've got here.
>> What happens if you run your jobs serially, outside of Condor? One after
>> the other, on a single machine. Do they slow down then? Serially but with 8
>> in parallel -- slow down?
>>> Anyone else, except Mag Gam, doesn't have problems like this?
>> I've certainly seen my fair share of buggy code bog down machines. It's
>> never been because it was running under Condor though. And Condor has been
>> solid for years now.
>> - Ian
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> The archives can be found at:
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: