[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed



Hi Michael,

On Tuesday, June 14, 2011 at 10:09 AM, Michael O'Donnell wrote:
I am running a DAG that has taken approximately 6 days. All the submit
jobs completed last night, but the condor_dagman.exe is not exiting and it
is in idle.
That's definitely not right. Your condor_dagman.exe job should stay running until all of your DAG is complete. If it gets restarted DAGMan goes in to recovery mode, trying to resync to the state of your jobs. Based on the dagman.out information you posted that appears to be what it's doing.

Typically you'd see additional messages in dagman.out like so:

1/29 17:19:53 Bootstrapping...
1/29 17:19:53 Number of pre-completed nodes: 0
1/29 17:19:53 Running in RECOVERY mode...
1/29 17:19:53 Event: ULOG_SUBMIT for Condor Node Setup (20295.0)
1/29 17:19:53 Number of idle job procs: 1
1/29 17:19:53 Event: ULOG_EXECUTE for Condor Node Setup (20295.0)
1/29 17:19:53 Number of idle job procs: 0
1/29 17:19:53 Event: ULOG_JOB_TERMINATED for Condor Node Setup (20295.0)
1/29 17:19:53 Node Setup job proc (20295.0) completed successfully.
1/29 17:19:53 Node Setup job completed
1/29 17:19:53 Number of idle job procs: 0
1/29 17:19:53     ------------------------------
1/29 17:19:53        Condor Recovery Complete
1/29 17:19:53     ------------------------------

Are you not getting anything in the dagman.out file after the "Running in RECOVERY mode..." message?

It looks like your DAG manager might be having trouble sync'ing up the state of the jobs. If you look at the process tree is condor_dagman  using any CPU at this point? Can you view file handles the process has open (using Process Explorer perhaps)? It should be trying to read the job logs for the submissions to sync up -- are they showing up in the open file handles list for the dagman process? Have those files disappeared from disk? Are there error messages in the *.dagman.log file for the run?

You could try running condor_rm against the condor_dagman job -- this should trigger it to write out a rescue dag if it thinks there's still work to be done. That might shed some light on what part of the DAG the manager thinks hasn't completed.

You can re-submit the original *.dag file and DAGMan should enter rescue mode and resync.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools