[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems on Mac OS X



dautret wrote:

1. I use Condor on G5' s cluster with Mac OS X. After few hours, condor_master crashes on the submitter machine what leads the break of the jobs. It seems that this pb occurs with many applications on mac os X but no one in forum Mac has got an idea to solve this pb.
This pb occurs with condor6.6.5 condor6.6.7 and condor6.7.2…
So, has anyone use condor on mac os x and have you got these crashes ?


I run Condor 6.7.3 on a Power book G4 with 1 gig of memory and Condor seems very stable. It was also stable under 6.7.2. I have it running jobs where a single node computes 400 processes in a row without a crash. My jobs are also fairly memory intensive. It is using the Shake program by Apple.
If you update to Condor 6.7.3 beware there is a slight bug and you will need to add this to your local config file or the classadd for OS type will be wrong:


OpSys="OSX"
STARTD_EXPRS=$(STARTD_EXPRS) OpSys


Here are some questions whihc might help debug the problem:

1) Do all programs cause crashes? Try to monitor teh ram usage on the machines..are they running out of memory or is the CPU activity suddenly always 100% which idicates some sort of crash....Perhaps the program you are running has some sort of bug in it which eventually causes the Mac's to crash. Like a memory leak..This was your error message Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000000

2) Do all your G5's crash consistently? or is it a specific set of machines? Perhaps a few of your machines have bad memory chips..that happened to me on a fairly new computer and it caused teh system to crash many times.


2. Sometimes, only the manager machine crashes but not the submitter machine… At this moment, condor stops… but when I launch again condor_master on the manager machine, jobs restart although I have a vanilla configuration…!

The jobs should not restart from the first process in the job cluster..they should continue from the process # just before the crash i think. Is this what is happening or do your jobs restart from the first process in the cluster of jobs.