[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman only one queue command?



Steve,

This is a documented constraint of DAGMan (see [1]). DAGMan cannot be used to manage Condor submit files which queue more than one job at a time.

This is necessary since DAGMan needs to be able to resubmit any individual job in the event of a failure -- which it cannot do if the submit file can only submit them in groups.

The workaround is to define a distinct node for each job. (And if your jobs are identical, you can reference the same submit file for each one -- you don't need to define N identical submit files.)

-Peter

[1] http://www.cs.wisc.edu/condor/manual/ v6.7/2_12DAGMan_Applications.html#4590


On Sep 21, 2005, at 2:16 PM, Steve Gertz wrote:

Hello all,

I was working with Dagman to schedule out work and I'm running into a little
oddity. It seems that the submit file Dagman is submitting can have only
one queue statement; very inconvient when submitting a job to run with many
input / output pairs.


Has anyone run into this, am I just doing something wrong? Is this fixed in
the 6.7 branch?


Thanks in advance,

Steve

Here are the dagman.out files:

9/21 12:13:00 ******************************************************
9/21 12:13:00 ** condor_scheduniv_exec.56.0 (CONDOR_DAGMAN) STARTING UP
9/21 12:13:00 ** /data/condor/home/spool/cluster56.ickpt.subproc0
9/21 12:13:00 ** $CondorVersion: 6.6.10 Jun 13 2005 $
9/21 12:13:00 ** $CondorPlatform: I386-LINUX_RH80 $
9/21 12:13:00 ** PID = 11826
9/21 12:13:00 ******************************************************
9/21 12:13:00 Using config file: /data/condor/etc/condor_config
9/21 12:13:00 Using local config files:
/data/condor/home/condor_config.local
9/21 12:13:00 DaemonCore: Command Socket at <192.168.34.187:41471>
9/21 12:13:00 argv[0] == "condor_scheduniv_exec.56.0"
9/21 12:13:00 argv[1] == "-Debug"
9/21 12:13:00 argv[2] == "3"
9/21 12:13:00 argv[3] == "-Lockfile"
9/21 12:13:00 argv[4] == "run_client.dag.lock"
9/21 12:13:00 argv[5] == "-Dag"
9/21 12:13:00 argv[6] == "run_client.dag"
9/21 12:13:00 argv[7] == "-Rescue"
9/21 12:13:00 argv[8] == "run_client.dag.rescue"
9/21 12:13:00 argv[9] == "-Condorlog"
9/21 12:13:00 argv[10] == "run_client.dag.dummy_log"
9/21 12:13:00 DAG Lockfile will be written to run_client.dag.lock
9/21 12:13:00 DAG Input file is run_client.dag
9/21 12:13:00 Rescue DAG will be written to run_client.dag.rescue
9/21 12:13:00 All DAG node user log files:
9/21 12:13:00 /data/smx_customers_run_1127168781/condor.log
9/21 12:13:00 /data/smx_customers_run_1127168781/"transfer.log"
9/21 12:13:00 Parsing run_client.dag ...
9/21 12:13:00 Dag contains 2 total jobs
9/21 12:13:00 Deleting any older versions of log files...
9/21 12:13:00 ReadMultipleUserLogs: deleting older version of
/data/smx_customers_run_1127168781/condor.log
9/21 12:13:00 Bootstrapping...
9/21 12:13:00 Number of pre-completed jobs: 0
9/21 12:13:00 Registering condor_event_timer...
9/21 12:13:01 Submitting Condor Job XMLGEN ...
9/21 12:13:01 submitting: condor_submit -a 'dag_node_name = XMLGEN' -a
'+DAGManJobID = 56.0' -a 'submit_event_notes = DAG Node: $ (dag_node_name)'
condor_submit.cmd 2>&1
9/21 12:13:02 ERROR: condor_submit failed:
10 job(s) submitted to cluster 57.
9/21 12:13:02 condor_submit try 1/6 failed, will try again in 1 second
9/21 12:13:04 ERROR: condor_submit failed:
10 job(s) submitted to cluster 58.
9/21 12:13:04 condor_submit try 2/6 failed, will try again in 2 seconds
9/21 12:13:07 ERROR: condor_submit failed:
10 job(s) submitted to cluster 59.
9/21 12:13:07 condor_submit try 3/6 failed, will try again in 4 seconds
9/21 12:13:11 ERROR: condor_submit failed:
10 job(s) submitted to cluster 60.
9/21 12:13:11 condor_submit try 4/6 failed, will try again in 8 seconds



_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users