[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed

The disk is not full. I am writing all files to an NTFS SAN (log, err, out, and files created by each job). The IO was a problem and I used maxjobs to throttle the number of concurrently running jobs.

There is no problem with the permissions for both condor and the jobs. The jobs also run via RunAsOwner.

There is no information (empty) in the dprintf_failure.DAGMAN file.


From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 06/15/2011 10:39 AM
Subject: Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed
Sent by: condor-users-bounces@xxxxxxxxxxx

On Wed, 15 Jun 2011, Michael O'Donnell wrote:

> Thank you for your comments. It looks like I am going to have to spend a
> lot more time investigating this because it is not evident what has
> happened. Most of the jobs did complete, but something happened to the
> communication between the jobs and the condor_dagman.exe. I do not know
> the communication process yet, but I did not see any errors in the dagman
> log or anything. Basically the dagman went into recovery mode and could
> never exit this recovery loop.  When it went into recovery mode it
> generated this file: dprintf_failure.DAGMAN. If I delete the file it would
> generate it again on the next recovery attempt.
> When I released the condor_dagman job, a recovery file was not generated.
> I then tried to rerun the dag and the following happened:
> dprintf_failure.DAGMAN was generated again
> condor_dagman job went into idle
> no dag jobs were submitted
> condor_dagman.exe would not exit without forcing it

Hmm, something else to check:  is your disk full?  And are file
permissions set to reasonable values?  (DAGMan monitors the node jobs by
reading their user log files.)

Also, what are the contents of the dprintf_failure.DAGMAN file?

Kent Wenger
Condor Team
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: