[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Tracking DAGMan jobs

I wish to submit DAGs and track them in a database. When each DAG completes, I want the database to update, and record the success/fail status. I'm sure I can't be the first person to want to do this :-)

However I'm having trouble working out the best way to interact with Condor.

1. I could add a FINAL node to the DAG itself - ideally a NOOP job with a SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.

However, when I go to update the database, I'll want to know the clusterID of the dagman process itself (to find the corresponding row for the job submission). This won't be known until condor_submit_dag is run, so I can't hardcode it in the FINAL node unless I allocate my own independent IDs. Is there a way to get this?

$JOBID doesn't seem to help - it's the ID of an individual DAG node, not the dagman job itself. Indeed, dagman rejects it:

12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PRE script argument! 12/30/13 11:23:36 ERROR: Warning is fatal error because of DAGMAN_USE_STRICT setting

Similarly, $CLUSTER and $(CLUSTER) are also rejected.

Now, I've done a bit of experimentation:

$ cat testfinal.dag
FINAL final_node /dev/null NOOP
SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT

$ cat do_final.sh
exec >>/tmp/do_final.out
echo "Args: $@"

$ condor_submit_dag testfinal.dag

With this, I find the dagman cluster ID is in environment variable "CONDOR_ID" (without a leading underscore). This seems to be completely undocumented; the manual only talks about the CONDOR_IDS setting, which is unrelated.

Looking at the source, this behaviour happens *only* for scheduler universe jobs:

# src/condor_schedd.V6/schedd.cpp
Scheduler::start_sched_universe_job(PROC_ID* job_id)
        // stick a CONDOR_ID environment variable in job's environment
        char condor_id_string[PROC_ID_STR_BUFLEN];

Furthermore, I don't see any code which makes use of this value. How safe is it to rely on this? If it's used by some well-known external application (e.g. Pegasus) then it could be dependable.

Quick look at Pegasus source: yes, I think that's what it's there for.

$ grep -R CONDOR_ID .
./bin/pegasus-dagman: arguments.insert(0, "condor_scheduniv_exec."+os.getenv("CONDOR_ID")) ./bin/pegasus-dagman: dagman_bin=os.path.join(os.getcwd(),"condor_scheduniv_exec."+os.getenv("CONDOR_ID"))
./test/exitcode/largecode.out:    <env key="CONDOR_ID">511497.0</env>

2. I could use -append or -insert_sub_file to modify the dagman submission file. This will have $(cluster) available, and I could try to use +PostCmd. But the documentation says this is only for vanilla universe jobs (and the PostCmd runs on the execute machine for the job), whereas dagman is a scheduler universe job, and runs on the scheduler host.


Also, I can't see any way to get at the DAGman exit code in a macro which could be passed to PostArgs.

3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the *.dagman.out file, and poll it periodically. Or I can poll the queue and look for the dagman clusterID, wait for it to vanish from the queue, then check the file. Both of these seem pretty messy to me.

4. I could send out an E-mail on completion to a special address which triggers a handler script which parses the mail. I really really don't want to do this.

Anybody else done something like this?

It's really only DAGs I'm worried about for now, although I suppose it would be good to be able to track one-off jobs in the same way. They could always be wrapped in a DAG.