[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dagman aborts without creating a rescue dag



Hi,

I was running a DAG on my submitting machine (Red Hat Enterprise Linux AS release 3, condor version 6.7.8) whereas all the jobs shall be executed on a remote machine (Fedora Core release 3 (Heidelberg), condor version 6.7.8). Almost the full DAG completed, but then the dagman aborts. Here are the last few lines from the dagman.out-file:

8/4 22:04:14 Job submit try 5/6 failed, will try again in >= 16 seconds.
8/4 22:04:32 Submitting Condor Job new_rc_tx_lalapps_inca_ID000261_0 ...
8/4 22:04:32 submitting: condor_submit -a 'dag_node_name = new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGManJobID = 1897' -a 'submit_event_notes = DAG Node: new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGParentNodeNames = "lalapps_inca_ID000261"' new_rc_tx_lalapps_inca_ID000261_0.sub 2>&1
8/4 22:04:32 failed while reading from pipe.
8/4 22:04:32 Read so far: Submitting job(s)ERROR: can't determine proxy filenamex509 user proxy is required for globus, gt2, gt3, gt4 or nordugrid jobs
8/4 22:04:32 condor_submit try failed
8/4 22:04:32 submit command was: condor_submit -a 'dag_node_name = new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGManJobID = 1897' -a 'submit_event_notes = DAG Node: new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGParentNodeNames = "lalapps_inca_ID000261"' new_rc_tx_lalapps_inca_ID000261_0.sub 2>&1
8/4 22:04:32 Job submit failed after 6 tries.
8/4 22:04:32 Running POST script of Job new_rc_tx_lalapps_inca_ID000261_0...
8/4 22:04:32 Of 1024 nodes total:
8/4 22:04:32 Done Pre Queued Post Ready Un-Ready Failed
8/4 22:04:32 === === === === === === ===
8/4 22:04:32 1002 0 3 1 0 18 0
8/4 22:04:37 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job lalapps_inspiral_ID000224 (-1.-1)
8/4 22:04:37 ERROR "Assertion ERROR on (job->GetStatus() == Job::STATUS_POSTRUN || recovery)" at line 772 in file dag.C



The user proxies on botch machines were still valid for a long time, and then the dagman aborts without creating a rescue dag. Is there possibly a bug in the file dag.C or whats going on?


Regards
Alexander Dietz