[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] DAGMAN memory
I have been running large DAGMAN job collections comprised of 500 - 1500 individual jobs running concurrently. On initial runs of the job I noticed that several of these jobs were failing. I have managed to resolve the issue by restricting the maximum number of concurrent jobs to 80 and setting the maximum number of retries to 3. I understand that this issue is a result of DAGMAN memory limitations; can any one confirm this? Is this a limitation on the central manager or elsewhere? Is there any way to resolve this issue aside from restricting the maximum number of concurrent jobs?