[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGMAN memory



On Tue, 10 Jun 2008, Aengus McCullough wrote:

I have been running large DAGMAN job collections comprised of 500 - 1500 individual jobs running concurrently. On initial runs of the job I noticed that several of these jobs were failing. I have managed to resolve the issue by restricting the maximum number of concurrent jobs to 80 and setting the maximum number of retries to 3. I understand that this issue is a result of DAGMAN memory limitations; can any one confirm this? Is this a limitation on the central manager or elsewhere? Is there any way to resolve this issue aside from restricting the maximum number of concurrent jobs?

Hmm, I'd be really surprised if this problem was a result of memory limitations in DAGMan itself -- other users are successfully running DAGs with several hundred thousand nodes. It could be the result of some other resource limitation, though.

When you say that jobs are failing, by "job" you mean an individual node job in the DAG, right? (As opposed to DAGMan itself crashing.) If that is the case, you need to look at the user log(s) from those jobs, and any other info you may have (stdout, stderr, etc.). When a job is submitted by DAGMan, there is *very* little difference between than and just submitting a job by hand. So the issue is exactly what is causing the jobs to fail -- once you narrow that down, you can attack the problem.

Kent Wenger
Condor Team