[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python API submission of DAGs



> You can get an example here:
http://spinningmatt.wordpress.com/2011/09/16/submitting-a-dag-via-aviary-using-python/

That's pointing me in the right direction. However, if I submit a dagman job using the python API it fails to start, and I am now stuck.

Here's my code:

---- 8< ----
#!/usr/bin/env python

from __future__ import print_function
import htcondor, classad
import os, sys

DAGMAN="/usr/bin/condor_dagman"
dag = sys.argv[1]
os.stat(dag)  # test for existence
schedd = htcondor.Schedd()
ad = classad.ClassAd({
  "JobUniverse": 7,
  "Cmd": DAGMAN,
  "Arguments": "-f -l . -Lockfile %s.lock -AutoRescue 1 -DoRescueFrom 0 " \
    "-Dag %s -Suppress_notification -Force -Dagman %s" % (dag, dag, DAGMAN),
  "Env": "_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_DAGMAN_LOG=%s.dagman.out;" \
    "_CONDOR_SCHEDD_DAEMON_AD_FILE=%s;_CONDOR_SCHEDD_ADDRESS_FILE=%s" %
    (dag, htcondor.param["SCHEDD_DAEMON_AD_FILE"],
    htcondor.param["SCHEDD_ADDRESS_FILE"]),
  "EnvDelim": ";",
  "Out": "%s.lib.out" % dag,
  "Err": "%s.lib.err" % dag,
  "ShouldTransferFiles": "IF_NEEDED",
  "UserLog": os.path.abspath("%s.dagman.log" % dag),
  "KillSig": "SIGTERM",
  "RemoveKillSig": "SIGUSR1",
  #"OtherJobRemoveRequirements": classad.ExprTree('eval(strcat("DAGManJobId == ", ClusterId))'),
  "OnExitRemove": classad.ExprTree('( ExitSignal =?= 11 || ( ExitCode =!= undefined && ExitCode >= 0 && ExitCode <= 2 ) )'),
  "FileSystemDomain": htcondor.param['FILESYSTEM_DOMAIN'],
  #"TransferIn": classad.ExprTree('false'),
  #"TransferInputSizeMB": 0,
})
cluster = schedd.submit(ad)
print("Submitted as cluster %d" % cluster)
---- 8< ----

This happily submits a job, but it sits in the queue in Idle (I) state indefinitely.

/var/log/condor/SchedLog shows:
12/17/13 16:16:56 (pid:20910) The Requirements attribute for job 528436.0 did not evaluate. Unable to start job

condor_q -analyze shows:

---- 8< ----
528436.000:  Request has not yet been considered by the matchmaker.

User priority for brian@xxxxxxxxxxx is not available, attempting to analyze without it.
---
528436.000:  Run analysis summary.  Of 12 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     12 are available to run your job

WARNING: Analysis is meaningless for Scheduler universe jobs.
---- 8< ----

condor_q -long shows:

Requirements = true && TARGET.OPSYS == "LINUX" && TARGET.ARCH == "X86_64" && ( TARGET.HasFileTransfer || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) && TARGET.Disk >= RequestDisk && TARGET.Memory >= RequestMemory

and related attributes:

RequestDisk = DiskUsage
DiskUsage = 1
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)
ImageSize = 100

This requirements _expression_ is slightly different to what I get if I submit the job using condor_submit: then I get

Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory )

If I try setting, for example, "Requirements": classad.ExprTree("wombat"), then it becomes

Requirements = wombat && TARGET.OPSYS == "LINUX" && TARGET.ARCH == "X86_64" && ( TARGET.HasFileTransfer || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) && TARGET.Disk >= RequestDisk && TARGET.Memory >= RequestMemory

so it looks like the remainder of this _expression_ is being set by condor at submission time. But I don't know why a job submitted via the python API should have a different requirements _expression_ - and in any case, I can't tell if this _expression_ is failing, or there's some other reason.

I have also tried:

  "Requirements": classad.ExprTree('( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory )'),

but this this case it still gets

&& ( TARGET.HasFileTransfer || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

appended when I look at condor_q -long.

Clues gratefully received. I am using condor 8.0.4-189770 under Ubuntu 12.04 x86_64.

Thanks,

Brian.