[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tracking DAGMan jobs

Wrap it in a nested dag and this should be pretty easy: The toplevel DAG will handle all the messy details.

subdag external mydag the-original-dag.dag
script post mydag post.script
script pre mydag pre.script

Nathan Panike

> From: Brian Candler <b.candler-e+AXbWqSrlAAvxtiuMwx3w@xxxxxxxxxxxxxxxx>
> Subject: Tracking DAGMan jobs
> Date: Mon, 30 Dec 2013 12:16:47 +0000
> I wish to submit DAGs and track them in a database. When each DAG 
> completes, I want the database to update, and record the success/fail 
> status. I'm sure I can't be the first person to want to do this :-)
> However I'm having trouble working out the best way to interact with Condor.
> 1. I could add a FINAL node to the DAG itself - ideally a NOOP job with 
> a SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.
> However, when I go to update the database, I'll want to know the 
> clusterID of the dagman process itself (to find the corresponding row 
> for the job submission). This won't be known until condor_submit_dag is 
> run, so I can't hardcode it in the FINAL node unless I allocate my own 
> independent IDs. Is there a way to get this?
> $JOBID doesn't seem to help - it's the ID of an individual DAG node, not 
> the dagman job itself. Indeed, dagman rejects it:
> 12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PRE 
> script argument!
> 12/30/13 11:23:36 ERROR: Warning is fatal error because of 
> Similarly, $CLUSTER and $(CLUSTER) are also rejected.
> Now, I've done a bit of experimentation:
> $ cat testfinal.dag
> FINAL final_node /dev/null NOOP
> SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT
> $ cat do_final.sh
> #!/bin/sh
> exec >>/tmp/do_final.out
> echo "Args: $@"
> printenv
> $ condor_submit_dag testfinal.dag
> With this, I find the dagman cluster ID is in environment variable 
> "CONDOR_ID" (without a leading underscore). This seems to be completely 
> undocumented; the manual only talks about the CONDOR_IDS setting, which 
> is unrelated.
> Looking at the source, this behaviour happens *only* for scheduler 
> universe jobs:
> # src/condor_schedd.V6/schedd.cpp
> Scheduler::start_sched_universe_job(PROC_ID* job_id)
> ....
>          // stick a CONDOR_ID environment variable in job's environment
>          char condor_id_string[PROC_ID_STR_BUFLEN];
>          ProcIdToStr(*job_id,condor_id_string);
>          envobject.SetEnv("CONDOR_ID",condor_id_string);
> Furthermore, I don't see any code which makes use of this value. How 
> safe is it to rely on this? If it's used by some well-known external 
> application (e.g. Pegasus) then it could be dependable.
> Quick look at Pegasus source: yes, I think that's what it's there for.
> $ grep -R CONDOR_ID .
> ../bin/pegasus-dagman:        arguments.insert(0, 
> "condor_scheduniv_exec."+os.getenv("CONDOR_ID"))
> ../bin/pegasus-dagman: 
> dagman_bin=os.path.join(os.getcwd(),"condor_scheduniv_exec."+os.getenv("CONDOR_ID"))
> ../test/exitcode/largecode.out:    <env key="CONDOR_ID">511497.0</env>
> 2. I could use -append or -insert_sub_file to modify the dagman 
> submission file. This will have $(cluster) available, and I could try to 
> use +PostCmd. But the documentation says this is only for vanilla 
> universe jobs (and the PostCmd runs on the execute machine for the job), 
> whereas dagman is a scheduler universe job, and runs on the scheduler host.
> http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html#80133
> Also, I can't see any way to get at the DAGman exit code in a macro 
> which could be passed to PostArgs.
> 3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the 
> *.dagman.out file, and poll it periodically. Or I can poll the queue and 
> look for the dagman clusterID, wait for it to vanish from the queue, 
> then check the file. Both of these seem pretty messy to me.
> 4. I could send out an E-mail on completion to a special address which 
> triggers a handler script which parses the mail. I really really don't 
> want to do this.
> Anybody else done something like this?
> It's really only DAGs I'm worried about for now, although I suppose it 
> would be good to be able to track one-off jobs in the same way. They 
> could always be wrapped in a DAG.
> Thanks,
> Brian.