[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computing speed slows down gradually.

On Sun, Jan 24, 2010 at 3:05 PM, Genie Jhang <geniejhang@xxxxxxxxxxx> wrote:
Thanks, Ian.
After I read your reply, I checked the machines.
It has 2 dual-core 3.0GHz Xeon CPUs, and 2GB memory,
and when the machine works, it only takes about 160MB of memory.
I first though that when the job ends, it doesn't fully return the occupied memory.
So, I also checked it but no memory leak was found.

Clendon Gibson pointed out that the OS will reclaim the memory after the job terminates. You need to observe it while it's running.

Also: what OS are you running? If it's Windows, and you happen to be doing a lot of CIFS or SMB file transfers with your jobs you can run into system resource limits pretty quickly with 8 jobs on one machine doing a lot of CIFS back-and-forth stuff. We see our Windows boxes go AWOL every few days just from running out of network descriptors. They need to be rebooted to periodically to reclaim them. We'll also kill them doing a lot of junction and hard link stuff with jobs. Again: reboot solves it every time. MKS blames MS, MS blames MKS -- welcome to the world of "it's someone else's problem" tech support. :)
And I also set require one cpu for one job, it can't come from sharing cpus.

No, but you're running 8 jobs in parallel aren't you? Even if each job is only claiming one CPU.
Still, I submit jobs, computing speed getting slower.
So, first job ends whthin about 1days but the larger the job number, the longer it takes time to be ended.
What do you think is problem?

What do your jobs do? Are the getting input data from some external source? Did you check the external sources? Are they completely self-contained? If so: check on the memory leak situation. Do they do cumulative processing? Where one job produces input for a later job? If yes: could you have a runaway loop in the code for the later job? Can you watch the jobs running? Are they using 100% of the CPU (or whatever % you expect)? Or are they spending their time idle or blocked for I/O? Do your jobs do lots of local disk reading/writing? Could you be severely fragmenting your local disk causing IO times to skyrocket?

You've got to give us far more information here. I'm just making some guesses based on the vague bit of info I've got here.

What happens if you run your jobs serially, outside of Condor? One after the other, on a single machine. Do they slow down then? Serially but with 8 in parallel -- slow down?

Anyone else, except Mag Gam, doesn't have problems like this?

I've certainly seen my fair share of buggy code bog down machines. It's never been because it was running under Condor though. And Condor has been solid for years now.

- Ian