[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs become idle after some time



Raman Sehgal <sc.ramansehgal@xxxxxxxxx> wrote:
> We have a small condor cluster, it works fine.
> But it suffers from a problem that when i submit some jobs
> say 100 jobs, then all the jobs starts runnning but after sometime
> around 30% of jobs become idle. I dont know why !!!!!!

Have you checked the job individual log files, the one specified
with log=filename in the submit file.  If a job goes from Running
to Idle, there should be a reason there.

If that doesn't help, a few things:

- Is your submit computer rebooting or your daemons crashing and
  restarting?  Your MasterLog on the submit computer should give
  you some clues.

- Check the SchedLog and ShadowLog on the submit computer to see
  if there are any errors.

> Another problem is that sometimes condor shows job status as
> running but in reality the jobs are not running at all, it stuck at a
> place.

When you say the job is not running, what are you observing?

Again, the job's log may have further information about what is
going on.

Is there any possibility that the execute computer has suspended
the job?  This would be controlled by the SUSPEND and CONTINUE
settings in your configuration file on the execute computer.

You might also try condor_ssh_to_job (assuming you're on Linux)
to observe the running job directly.

-- 
Alan De Smet                 Center for High Throughput Computing
adesmet@xxxxxxxxxxx                       http://chtc.cs.wisc.edu