[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Tracking DAGMan jobs
- Date: Mon, 30 Dec 2013 12:16:47 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] Tracking DAGMan jobs
I wish to submit DAGs and track them in a database. When each DAG
completes, I want the database to update, and record the success/fail
status. I'm sure I can't be the first person to want to do this :-)
However I'm having trouble working out the best way to interact with Condor.
1. I could add a FINAL node to the DAG itself - ideally a NOOP job with
a SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.
However, when I go to update the database, I'll want to know the
clusterID of the dagman process itself (to find the corresponding row
for the job submission). This won't be known until condor_submit_dag is
run, so I can't hardcode it in the FINAL node unless I allocate my own
independent IDs. Is there a way to get this?
$JOBID doesn't seem to help - it's the ID of an individual DAG node, not
the dagman job itself. Indeed, dagman rejects it:
12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PRE
12/30/13 11:23:36 ERROR: Warning is fatal error because of
Similarly, $CLUSTER and $(CLUSTER) are also rejected.
Now, I've done a bit of experimentation:
$ cat testfinal.dag
FINAL final_node /dev/null NOOP
SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT
$ cat do_final.sh
echo "Args: $@"
$ condor_submit_dag testfinal.dag
With this, I find the dagman cluster ID is in environment variable
"CONDOR_ID" (without a leading underscore). This seems to be completely
undocumented; the manual only talks about the CONDOR_IDS setting, which
Looking at the source, this behaviour happens *only* for scheduler
// stick a CONDOR_ID environment variable in job's environment
Furthermore, I don't see any code which makes use of this value. How
safe is it to rely on this? If it's used by some well-known external
application (e.g. Pegasus) then it could be dependable.
Quick look at Pegasus source: yes, I think that's what it's there for.
$ grep -R CONDOR_ID .
./test/exitcode/largecode.out: <env key="CONDOR_ID">511497.0</env>
2. I could use -append or -insert_sub_file to modify the dagman
submission file. This will have $(cluster) available, and I could try to
use +PostCmd. But the documentation says this is only for vanilla
universe jobs (and the PostCmd runs on the execute machine for the job),
whereas dagman is a scheduler universe job, and runs on the scheduler host.
Also, I can't see any way to get at the DAGman exit code in a macro
which could be passed to PostArgs.
3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the
*.dagman.out file, and poll it periodically. Or I can poll the queue and
look for the dagman clusterID, wait for it to vanish from the queue,
then check the file. Both of these seem pretty messy to me.
4. I could send out an E-mail on completion to a special address which
triggers a handler script which parses the mail. I really really don't
want to do this.
Anybody else done something like this?
It's really only DAGs I'm worried about for now, although I suppose it
would be good to be able to track one-off jobs in the same way. They
could always be wrapped in a DAG.