[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python API submission of DAGs



> Submitting a job via the python bindings is really not for the faint of heart

Sure :-) The main thing I want is to submit a DAG and then accurately capture the cluster ID for future processing. So I could spawn condor_submit_dag and parse the results; this just seems a pretty ugly thing to do if there's an API sitting there waiting to be used, given that I'm writing stuff in Python anyway.

That's unless condor_submit_dag has a machine-parseable output format (e.g. XML?) I don't think it does.

> Oddly enough, I have a project which does precisely this.  The code which generates the classad is here:
>
>
> Note that I actually submit a wrapper script that does a few devious things before exec'ing condor_dagman.

I can see two submission approaches in that code:

(1) submitDirect() --> schedd.submit(dagAd, ...)

(2) schedd.submitRaw( ... jdl ... )

where jdl appears to be the contents of a regular submit text file, but dagAd is a constructed classad structure.

The decision about which approach to use is based on loc.getScheddObj(task['tm_taskname']), and I'm not sure what this is deciding; perhaps it's to do with whether the schedd is running on the same host or a remote host? But I don't understand why the same API can't be used in both cases.

> Looks like you've run afoul of HTCondor's default Requirements _expression_.  Try this:

dagAd["Requirements"] = classad.ExprTree('true || false')


That works. That is an ugly, ugly, ugly hack. Somebody should be ashamed :-)

I can see (from condor_q -long) that the requirements _expression_ expands to:

Requirements = true || false && TARGET.OPSYS == "LINUX" && TARGET.ARCH == "X86_64" && ( TARGET.HasFileTransfer || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) && TARGET.Disk >= RequestDisk && TARGET.Memory >= RequestMemory

Basically, this relies on the fact that somebody forgot to add parentheses around the existing value when appending conditions in condor_submit. This could clearly break in future.

Then after adding that, I find that -CsdVersion is a mandatory argument to condor_dagman; and then the submission works.

Below is the python code as-is. I have still not worked out how to replace +OtherJobRequirements = "DAGManJobId == $(cluster)" in classad format; the presentation says to use strcat() but I couldn't get it to work.

I am starting to conclude the python API is (a) not sufficiently documented, and (b) is different from the built-in tools in subtle and confusing ways (e.g. the Requirements hack is not required when you use condor_submit to submit a scheduler universe job)

Regards,

Brian.

#!/usr/bin/env python

from __future__ import print_function
import htcondor, classad
import os, sys

DAGMAN="/usr/bin/condor_dagman"
dag = sys.argv[1]
os.stat(dag)  # test for existence
schedd = htcondor.Schedd()
ad = classad.ClassAd({
  "JobUniverse": 7,
  "Cmd": DAGMAN,
  "Arguments": "-f -l . -Lockfile %s.lock -AutoRescue 1 -DoRescueFrom 0 " \
    "-Dag %s -Suppress_notification -CsdVersion '%s' -Force -Dagman %s" %
    (dag, dag, htcondor.version(), DAGMAN),
  "Env": "_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_DAGMAN_LOG=%s.dagman.out;" \
    "_CONDOR_SCHEDD_DAEMON_AD_FILE=%s;_CONDOR_SCHEDD_ADDRESS_FILE=%s" %
    (dag, htcondor.param["SCHEDD_DAEMON_AD_FILE"],
    htcondor.param["SCHEDD_ADDRESS_FILE"]),
  "EnvDelim": ";",
  "Out": "%s.lib.out" % dag,
  "Err": "%s.lib.err" % dag,
  "ShouldTransferFiles": "IF_NEEDED",
  "UserLog": os.path.abspath("%s.dagman.log" % dag),
  "KillSig": "SIGTERM",
  "RemoveKillSig": "SIGUSR1",
  #"OtherJobRemoveRequirements": classad.ExprTree('eval(strcat("DAGManJobId == ", ClusterId))'),
  "OnExitRemove": classad.ExprTree('( ExitSignal =?= 11 || ( ExitCode =!= undefined && ExitCode >= 0 && ExitCode <= 2 ) )'),
  "FileSystemDomain": htcondor.param['FILESYSTEM_DOMAIN'],
  "Requirements": classad.ExprTree('true || false'),
})
cluster = schedd.submit(ad)
print("Submitted as cluster %d" % cluster)