[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman with jobs queued >1 fail



hello

I have a problem with dagman and jobs in it, which are queued >1. Then the dagman fails, but the jobs will be submitted.
Dagman with job queued 1 work fine, there are no problems.


my dagman:
job main /opt/blast_workspace/admin/results/pov_thomas/thomas44#2004Nov19-1330/pov.sub
script post main /opt/condor_jobs/pov_thomas/post


All I want is to launch a script from the submit- and mastermachine when pov.sub is done. Therefore the dag.

Logfile from Dag with pov.sub (queue 1), which works fine:
11/18 16:10:42 Submitting Condor Job main ...
11/18 16:10:42 submitting: condor_submit -a 'dag_node_name = main' -a '+DAGManJobID = 660.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' /opt/bla
st_workspace/admin/results/pov_thomas/test_der_architekten#2004Nov18-1610/pov.sub 2>&1
11/18 16:10:42 assigned Condor ID (661.0.0)
11/18 16:10:42 Just submitted 1 job this cycle...
11/18 16:10:42 Event: ULOG_SUBMIT for Condor Job main (661.0.0)
11/18 16:10:42 Of 1 nodes total:
11/18 16:10:42 Done Pre Queued Post Ready Un-Ready Failed
11/18 16:10:42 === === === === === === ===
11/18 16:10:42 0 0 1 0 0 1 0
11/18 16:11:07 Event: ULOG_EXECUTE for Condor Job main (661.0.0)
11/18 16:11:17 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 16:12:27 Event: ULOG_JOB_EVICTED for Condor Job main (661.0.0)
11/18 16:12:27 Event: ULOG_JOB_HELD for Condor Job main (661.0.0)
11/18 16:12:37 Event: ULOG_JOB_RELEASED for Condor Job main (661.0.0)
11/18 16:12:42 Event: ULOG_EXECUTE for Condor Job main (661.0.0)
11/18 16:12:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 16:32:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 16:52:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 17:12:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 17:32:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 17:52:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 18:12:52 Event: ULOG_IMAGE_SIZE for Condor Job main (661.0.0)
11/18 18:19:47 Event: ULOG_JOB_TERMINATED for Condor Job main (661.0.0)
11/18 18:19:47 Job main completed successfully.


And now a logfile from Dag with pov.sub (queue 2), which fails:
11/19 13:30:33 Submitting Condor Job main ...
11/19 13:30:33 submitting: condor_submit -a 'dag_node_name = main' -a '+DAGManJobID = 670.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' /opt/bla
st_workspace/admin/results/pov_thomas/thomas44#2004Nov19-1330/pov.sub 2>&1
11/19 13:30:33 ERROR: condor_submit failed:
2 job(s) submitted to cluster 671.
11/19 13:30:33 condor_submit failed after 1 try.
11/19 13:30:33 submit command was: condor_submit -a 'dag_node_name = main' -a '+DAGManJobID = 670.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)'
/opt/blast_workspace/admin/results/pov_thomas/thomas44#2004Nov19-1330/pov.sub 2>&1
11/19 13:30:33 Event: ULOG_SUBMIT for Condor Job main (671.0.0)
11/19 13:30:33 Unrecognized submit event (for job "main") found in log (none expected)
11/19 13:30:33 Event: ULOG_SUBMIT for Condor Job main (671.1.0)
11/19 13:30:33 Unrecognized submit event (for job "main") found in log (none expected)
11/19 13:30:33 Of 1 nodes total:
11/19 13:30:33 Done Pre Queued Post Ready Un-Ready Failed
11/19 13:30:33 === === === === === === ===
11/19 13:30:33 0 0 0 0 0 0 1
11/19 13:30:33 ERROR: the following job(s) failed:



As you read, condor_submit failed, but the 2 jobs are submitted (and completed) successfully. Because of that the whole Dag fails and my post script doenst start.


Anyone an idea?

Thanks/greetz
Thomas