Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Tracking DAGMan jobs

Date: Mon, 30 Dec 2013 12:16:47 +0000
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: [HTCondor-users] Tracking DAGMan jobs

I wish to submit DAGs and track them in a database. When each DAGcompletes, I want the database to update, and record the success/failstatus. I'm sure I can't be the first person to want to do this :-)


However I'm having trouble working out the best way to interact with Condor.

1. I could add a FINAL node to the DAG itself - ideally a NOOP job witha SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.

However, when I go to update the database, I'll want to know theclusterID of the dagman process itself (to find the corresponding rowfor the job submission). This won't be known until condor_submit_dag isrun, so I can't hardcode it in the FINAL node unless I allocate my ownindependent IDs. Is there a way to get this?

$JOBID doesn't seem to help - it's the ID of an individual DAG node, notthe dagman job itself. Indeed, dagman rejects it:

12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PREscript argument!12/30/13 11:23:36 ERROR: Warning is fatal error because ofDAGMAN_USE_STRICT setting


Similarly, $CLUSTER and $(CLUSTER) are also rejected.

Now, I've done a bit of experimentation:

$ cat testfinal.dag
FINAL final_node /dev/null NOOP
SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT

$ cat do_final.sh
#!/bin/sh
exec >>/tmp/do_final.out
echo "Args: $@"
printenv

$ condor_submit_dag testfinal.dag

With this, I find the dagman cluster ID is in environment variable"CONDOR_ID" (without a leading underscore). This seems to be completelyundocumented; the manual only talks about the CONDOR_IDS setting, whichis unrelated.

Looking at the source, this behaviour happens *only* for scheduleruniverse jobs:


# src/condor_schedd.V6/schedd.cpp
Scheduler::start_sched_universe_job(PROC_ID* job_id)
...
        // stick a CONDOR_ID environment variable in job's environment
        char condor_id_string[PROC_ID_STR_BUFLEN];
        ProcIdToStr(*job_id,condor_id_string);
        envobject.SetEnv("CONDOR_ID",condor_id_string);

Furthermore, I don't see any code which makes use of this value. Howsafe is it to rely on this? If it's used by some well-known externalapplication (e.g. Pegasus) then it could be dependable.


Quick look at Pegasus source: yes, I think that's what it's there for.

$ grep -R CONDOR_ID .

./bin/pegasus-dagman: arguments.insert(0,"condor_scheduniv_exec."+os.getenv("CONDOR_ID"))./bin/pegasus-dagman:dagman_bin=os.path.join(os.getcwd(),"condor_scheduniv_exec."+os.getenv("CONDOR_ID"))

./test/exitcode/largecode.out:    <env key="CONDOR_ID">511497.0</env>

2. I could use -append or -insert_sub_file to modify the dagmansubmission file. This will have $(cluster) available, and I could try touse +PostCmd. But the documentation says this is only for vanillauniverse jobs (and the PostCmd runs on the execute machine for the job),whereas dagman is a scheduler universe job, and runs on the scheduler host.


http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html#80133

Also, I can't see any way to get at the DAGman exit code in a macrowhich could be passed to PostArgs.

3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the*.dagman.out file, and poll it periodically. Or I can poll the queue andlook for the dagman clusterID, wait for it to vanish from the queue,then check the file. Both of these seem pretty messy to me.

4. I could send out an E-mail on completion to a special address whichtriggers a handler script which parses the mail. I really really don'twant to do this.


Anybody else done something like this?

It's really only DAGs I'm worried about for now, although I suppose itwould be good to be able to track one-off jobs in the same way. Theycould always be wrapped in a DAG.


Thanks,

Brian.

Follow-Ups:
- Re: [HTCondor-users] Tracking DAGMan jobs
  - From: Brian Candler

Prev by Date: [HTCondor-users] getting/passing the SlotID
Next by Date: Re: [HTCondor-users] Tracking DAGMan jobs
Previous by thread: Re: [HTCondor-users] getting/passing the SlotID
Next by thread: Re: [HTCondor-users] Tracking DAGMan jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Tracking DAGMan jobs