[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Tracking DAGMan jobs
- Date: Mon, 30 Dec 2013 05:17:38 -0800 (PST)
- From: nathan.panike@xxxxxxxxx
- Subject: Re: [HTCondor-users] Tracking DAGMan jobs
Wrap it in a nested dag and this should be pretty easy: The toplevel DAG will handle all the messy details.
subdag external mydag the-original-dag.dag
script post mydag post.script
script pre mydag pre.script
> From: Brian Candler <b.candler-e+AXbWqSrlAAvxtiuMwx3w@xxxxxxxxxxxxxxxx>
> Subject: Tracking DAGMan jobs
> Date: Mon, 30 Dec 2013 12:16:47 +0000
> I wish to submit DAGs and track them in a database. When each DAG
> completes, I want the database to update, and record the success/fail
> status. I'm sure I can't be the first person to want to do this :-)
> However I'm having trouble working out the best way to interact with Condor.
> 1. I could add a FINAL node to the DAG itself - ideally a NOOP job with
> a SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.
> However, when I go to update the database, I'll want to know the
> clusterID of the dagman process itself (to find the corresponding row
> for the job submission). This won't be known until condor_submit_dag is
> run, so I can't hardcode it in the FINAL node unless I allocate my own
> independent IDs. Is there a way to get this?
> $JOBID doesn't seem to help - it's the ID of an individual DAG node, not
> the dagman job itself. Indeed, dagman rejects it:
> 12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PRE
> script argument!
> 12/30/13 11:23:36 ERROR: Warning is fatal error because of
> DAGMAN_USE_STRICT setting
> Similarly, $CLUSTER and $(CLUSTER) are also rejected.
> Now, I've done a bit of experimentation:
> $ cat testfinal.dag
> FINAL final_node /dev/null NOOP
> SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT
> $ cat do_final.sh
> exec >>/tmp/do_final.out
> echo "Args: $@"
> $ condor_submit_dag testfinal.dag
> With this, I find the dagman cluster ID is in environment variable
> "CONDOR_ID" (without a leading underscore). This seems to be completely
> undocumented; the manual only talks about the CONDOR_IDS setting, which
> is unrelated.
> Looking at the source, this behaviour happens *only* for scheduler
> universe jobs:
> # src/condor_schedd.V6/schedd.cpp
> Scheduler::start_sched_universe_job(PROC_ID* job_id)
> // stick a CONDOR_ID environment variable in job's environment
> char condor_id_string[PROC_ID_STR_BUFLEN];
> Furthermore, I don't see any code which makes use of this value. How
> safe is it to rely on this? If it's used by some well-known external
> application (e.g. Pegasus) then it could be dependable.
> Quick look at Pegasus source: yes, I think that's what it's there for.
> $ grep -R CONDOR_ID .
> ../bin/pegasus-dagman: arguments.insert(0,
> ../test/exitcode/largecode.out: <env key="CONDOR_ID">511497.0</env>
> 2. I could use -append or -insert_sub_file to modify the dagman
> submission file. This will have $(cluster) available, and I could try to
> use +PostCmd. But the documentation says this is only for vanilla
> universe jobs (and the PostCmd runs on the execute machine for the job),
> whereas dagman is a scheduler universe job, and runs on the scheduler host.
> Also, I can't see any way to get at the DAGman exit code in a macro
> which could be passed to PostArgs.
> 3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the
> *.dagman.out file, and poll it periodically. Or I can poll the queue and
> look for the dagman clusterID, wait for it to vanish from the queue,
> then check the file. Both of these seem pretty messy to me.
> 4. I could send out an E-mail on completion to a special address which
> triggers a handler script which parses the mail. I really really don't
> want to do this.
> Anybody else done something like this?
> It's really only DAGs I'm worried about for now, although I suppose it
> would be good to be able to track one-off jobs in the same way. They
> could always be wrapped in a DAG.