Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGMAN memory

Date: Tue, 10 Jun 2008 10:56:12 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] DAGMAN memory

On Tue, 10 Jun 2008, Aengus McCullough wrote:

I have been running large DAGMAN job collections comprised of 500 -1500 individual jobs running concurrently. On initial runs of the jobI noticed that several of these jobs were failing. I have managed toresolve the issue by restricting the maximum number of concurrent jobsto 80 and setting the maximum number of retries to 3. I understandthat this issue is a result of DAGMAN memory limitations; can any oneconfirm this? Is this a limitation on the central manager or elsewhere?Is there any way to resolve this issue aside from restricting themaximum number of concurrent jobs?

Hmm, I'd be really surprised if this problem was a result of memorylimitations in DAGMan itself -- other users are successfully runningDAGs with several hundred thousand nodes. It could be the result of someother resource limitation, though.

When you say that jobs are failing, by "job" you mean an individual nodejob in the DAG, right? (As opposed to DAGMan itself crashing.) If thatis the case, you need to look at the user log(s) from those jobs, and anyother info you may have (stdout, stderr, etc.). When a job is submittedby DAGMan, there is *very* little difference between than and justsubmitting a job by hand. So the issue is exactly what is causing thejobs to fail -- once you narrow that down, you can attack the problem.


Kent Wenger
Condor Team

References:
- [Condor-users] DAGMAN memory
  - From: Aengus McCullough

Prev by Date: Re: [Condor-users] Machine Classads in Jobs info (spool/history file)
Next by Date: [Condor-users] Condor 7.0.2 Released!
Previous by thread: [Condor-users] DAGMAN memory
Next by thread: Re: [Condor-users] DAGMAN memory
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] DAGMAN memory