[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAGMAN memory

Hi all,
I have been running large DAGMAN job collections comprised of 500 - 1500 individual jobs running concurrently.   On initial runs of the job I noticed that several of these jobs were failing.  I have managed to resolve the issue by restricting the maximum number of concurrent jobs to 80 and setting the maximum number of retries to 3.  I understand that this issue is a result of DAGMAN memory limitations; can any one confirm this?  Is this a limitation on the central manager or elsewhere?  Is there any way to resolve this issue aside from restricting the maximum number of concurrent jobs?