[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computing speed slows down gradually.



My jobs were just slow because the jobs were being assigned to the same host.

So, I had 100 hosts with 16 cores and I submit 500 jobs, those jobs
get assigned to hosts 0,1,2,3,4,5,6,etc... in order. Thats the reason
why.

I am not sure if this is possible but I would like to have something
like this in my ad:

+JobType="physics"

And I would like the scheduler to prefer this job over any other
"JobType" and then place it on different hosts.



On Tue, Jan 26, 2010 at 12:49 PM, Genie Jhang <geniejhang@xxxxxxxxxxx> wrote:
> Thanks, Ian.
>
> I think I might find what is the problem.
>
> My source code the program, for geant4, was somewhat wrong.
>
> It gives me "Sigmentation fault" after it's done.
>
> So, I rewrote all the code I fixed it.
>
> I'm now testing my code.
>
> I anticipate it will be done about 16 hours after from now.
>
> Thank you all.
>
> I'll try to answer the problem if there's I already know.
>
> Thanks again.
>
> 2010/1/26 Ian Chesal <ian.chesal@xxxxxxxxx>
>>
>> On Sun, Jan 24, 2010 at 3:05 PM, Genie Jhang <geniejhang@xxxxxxxxxxx>
>> wrote:
>>>
>>> Thanks, Ian.
>>>
>>> After I read your reply, I checked the machines.
>>>
>>> It has 2 dual-core 3.0GHz Xeon CPUs, and 2GB memory,
>>>
>>> and when the machine works, it only takes about 160MB of memory.
>>>
>>> I first though that when the job ends, it doesn't fully return the
>>> occupied memory.
>>>
>>> So, I also checked it but no memory leak was found.
>>
>> Clendon Gibson pointed out that the OS will reclaim the memory after the
>> job terminates. You need to observe it while it's running.
>> Also: what OS are you running? If it's Windows, and you happen to be doing
>> a lot of CIFS or SMB file transfers with your jobs you can run into system
>> resource limits pretty quickly with 8 jobs on one machine doing a lot of
>> CIFS back-and-forth stuff. We see our Windows boxes go AWOL every few days
>> just from running out of network descriptors. They need to be rebooted to
>> periodically to reclaim them. We'll also kill them doing a lot of junction
>> and hard link stuff with jobs. Again: reboot solves it every time. MKS
>> blames MS, MS blames MKS -- welcome to the world of "it's someone else's
>> problem" tech support. :)
>>
>>>
>>> And I also set require one cpu for one job, it can't come from sharing
>>> cpus.
>>
>> No, but you're running 8 jobs in parallel aren't you? Even if each job is
>> only claiming one CPU.
>>
>>>
>>> Still, I submit jobs, computing speed getting slower.
>>>
>>> So, first job ends whthin about 1days but the larger the job number, the
>>> longer it takes time to be ended.
>>>
>>> What do you think is problem?
>>
>> What do your jobs do? Are the getting input data from some external
>> source? Did you check the external sources? Are they completely
>> self-contained? If so: check on the memory leak situation. Do they do
>> cumulative processing? Where one job produces input for a later job? If yes:
>> could you have a runaway loop in the code for the later job? Can you watch
>> the jobs running? Are they using 100% of the CPU (or whatever % you expect)?
>> Or are they spending their time idle or blocked for I/O? Do your jobs do
>> lots of local disk reading/writing? Could you be severely fragmenting your
>> local disk causing IO times to skyrocket?
>> You've got to give us far more information here. I'm just making some
>> guesses based on the vague bit of info I've got here.
>> What happens if you run your jobs serially, outside of Condor? One after
>> the other, on a single machine. Do they slow down then? Serially but with 8
>> in parallel -- slow down?
>>>
>>> Anyone else, except Mag Gam, doesn't have problems like this?
>>
>> I've certainly seen my fair share of buggy code bog down machines. It's
>> never been because it was running under Condor though. And Condor has been
>> solid for years now.
>> - Ian
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>