[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistencies: hold versus abort



On 02/12/2014 22:05, R. Kent Wenger wrote:
On Tue, 2 Dec 2014, Brian Candler wrote:

Case (1): missing local file
...
If you set a NODE_STATUS_FILE it won't help you: it shows

 DagStatus = 3; /* "STATUS_SUBMITTED ()" */
...
  NodeStatus = 1; /* "STATUS_READY" */

It seems odd that the NODE_STATUS_FILE is not updated when dagman terminates - I'd expected the DagStatus to show STATUS_ERROR, and probably also the node which couldn't be submitted.

What version of DAGMan are you running? In 8.2.3 we fixed a bug that could cause the node status file to not get updated when DAGMan exits.
It's 8.2.4 everywhere.

Production cluster: 8.2.4-281588 package under Ubuntu 12.04 x86_64
My laptop: condor-8.2.4-x86_64_MacOSX-stripped.tar.gz

When I try this, I get the following for the node status:

[
  Type = "NodeStatus";
  Node = "NodeA";
  NodeStatus = 6; /* "STATUS_ERROR" */
  StatusDetails = "Job submit failed";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]

Hopefully this is what you would want.


Yes, that would be correct. Let me retest this.

[on OSX personal condor]

Brians-MacBook-Air:tmp $ cat t1.log
1418057606 INTERNAL *** DAGMAN_STARTED 8.0 ***
1418057619 t1 SUBMIT_FAILURE - - - 1
1418057624 t1 SUBMIT_FAILURE - - - 1
1418057629 t1 SUBMIT_FAILURE - - - 1
1418057634 t1 SUBMIT_FAILURE - - - 1
1418057645 t1 SUBMIT_FAILURE - - - 1
1418057662 t1 SUBMIT_FAILURE - - - 1
1418057662 INTERNAL *** DAGMAN_FINISHED 1 ***
Brians-MacBook-Air:tmp $ cat t1.status
[
  Type = "DagStatus";
  DagFiles = {
    "t1.dag"
  };
  Timestamp = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
  NodesTotal = 1;
  NodesDone = 0;
  NodesPre = 0;
  NodesQueued = 0;
  NodesPost = 0;
  NodesReady = 1;
  NodesUnready = 0;
  NodesFailed = 0;
  JobProcsHeld = 0;
  JobProcsIdle = 0;
]
[
  Type = "NodeStatus";
  Node = "t1";
  NodeStatus = 1; /* "STATUS_READY" */
  StatusDetails = "";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]
[
  Type = "StatusEnd";
  EndTime = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
  NextUpdate = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
]



[on Ubuntu/production cluster]

brian@proliant:~/tmp$ cat t1.log
1418057691 INTERNAL *** DAGMAN_STARTED 5274647.0 ***
1418057704 t1 SUBMIT_FAILURE - - - 1
1418057709 t1 SUBMIT_FAILURE - - - 1
1418057714 t1 SUBMIT_FAILURE - - - 1
1418057719 t1 SUBMIT_FAILURE - - - 1
1418057730 t1 SUBMIT_FAILURE - - - 1
1418057747 t1 SUBMIT_FAILURE - - - 1
1418057747 INTERNAL *** DAGMAN_FINISHED 1 ***
brian@proliant:~/tmp$ cat t1.status
[
  Type = "DagStatus";
  DagFiles = {
    "t1.dag"
  };
  Timestamp = 1418057704; /* "Mon Dec  8 16:55:04 2014" */
  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
  NodesTotal = 1;
  NodesDone = 0;
  NodesPre = 0;
  NodesQueued = 0;
  NodesPost = 0;
  NodesReady = 1;
  NodesUnready = 0;
  NodesFailed = 0;
  JobProcsHeld = 0;
  JobProcsIdle = 0;
]
[
  Type = "NodeStatus";
  Node = "t1";
  NodeStatus = 1; /* "STATUS_READY" */
  StatusDetails = "";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]
[
  Type = "StatusEnd";
  EndTime = 1418057704; /* "Mon Dec  8 16:55:04 2014" */
  NextUpdate = 1418057704; /* "Mon Dec  8 16:55:04 2014" */
]


Nope... it definitely seems to be the same as I reported. It looks like dagman isn't writing the final version of the node_status_file.

Here are all the files from the OSX run. Interestingly, t1.dag.metrics *does* have the correct DAG status (2 = DAG_STATUS_NODE_FAILED)

==> t1.dag <==
JOB t1 t1.sub
JOBSTATE_LOG t1.log
NODE_STATUS_FILE t1.status

==> t1.dag.condor.sub <==
# Filename: t1.dag.condor.sub
# Generated by condor_submit_dag t1.dag
universe    = scheduler
executable    = /usr/local/bin/condor_dagman
getenv        = True
output        = t1.dag.lib.out
error        = t1.dag.lib.err
log        = t1.dag.dagman.log
remove_kill_sig    = SIGUSR1
+OtherJobRemoveRequirements    = "DAGManJobId =?= $(cluster)"
# Note: default on_exit_remove expression:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool    = False
arguments = "-f -l . -Lockfile t1.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag t1.dag -Suppress_notification -CsdVersion $CondorVersion:' '8.2.4' 'Nov' '07' '2014' 'BuildID:' '281203' '$ -Dagman /usr/local/bin/condor_dagman" environment = _CONDOR_DAGMAN_LOG=t1.dag.dagman.out;_CONDOR_SCHEDD_ADDRESS_FILE=/Users/brian/Build/condor-8.2.4-x86_64_MacOSX7-stripped/local.brians-macbook-air/spool/.schedd_address;_CONDOR_SCHEDD_DAEMON_AD_FILE=/Users/brian/Build/condor-8.2.4-x86_64_MacOSX7-stripped/local.brians-macbook-air/spool/.schedd_classad;_CONDOR_MAX_DAGMAN_LOG=0
queue

==> t1.dag.dagman.log <==
000 (008.000.000) 12/08 16:53:25 Job submitted from host: <10.26.1.63:52393>
...
001 (008.000.000) 12/08 16:53:25 Job executing on host: <10.26.1.63:52393>
...
005 (008.000.000) 12/08 16:54:22 Job terminated.
    (1) Normal termination (return value 1)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

==> t1.dag.dagman.out <==
12/08/14 16:53:26 Can't open directory "/Users/brian/Build/condor-8.2.4-x86_64_MacOSX7-stripped/local.brians-macbook-air/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory) 12/08/14 16:53:26 Cannot open /Users/brian/Build/condor-8.2.4-x86_64_MacOSX7-stripped/local.brians-macbook-air/config: No such file or directory
12/08/14 16:53:26 ******************************************************
12/08/14 16:53:26 ** condor_scheduniv_exec.8.0 (CONDOR_DAGMAN) STARTING UP
12/08/14 16:53:26 ** /usr/local/condor/bin/condor_dagman
12/08/14 16:53:26 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 12/08/14 16:53:26 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
12/08/14 16:53:26 ** $CondorVersion: 8.2.4 Nov 07 2014 BuildID: 281203 $
12/08/14 16:53:26 ** $CondorPlatform: x86_64_MacOSX7 $
12/08/14 16:53:26 ** PID = 34242
12/08/14 16:53:26 ** Log last touched time unavailable (No such file or directory)
12/08/14 16:53:26 ******************************************************
12/08/14 16:53:26 Using config source: /etc/condor/condor_config
12/08/14 16:53:26 Using local config sources:
12/08/14 16:53:26 /Users/brian/Build/condor-8.2.4-x86_64_MacOSX7-stripped/local.Brians-MacBook-Air/condor_config.local 12/08/14 16:53:26 config Macros = 63, Sorted = 62, StringBytes = 2025, TablesBytes = 2316
12/08/14 16:53:26 CLASSAD_CACHING is ENABLED
12/08/14 16:53:26 Daemon Log is logging: D_ALWAYS D_ERROR
12/08/14 16:53:26 DaemonCore: command socket at <10.26.1.63:52409>
12/08/14 16:53:26 DaemonCore: private command socket at <10.26.1.63:52409>
12/08/14 16:53:26 DAGMAN_USE_STRICT setting: 1
12/08/14 16:53:26 DAGMAN_VERBOSITY setting: 3
12/08/14 16:53:26 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
12/08/14 16:53:26 DAGMAN_DEBUG_CACHE_ENABLE setting: False
12/08/14 16:53:26 DAGMAN_SUBMIT_DELAY setting: 0
12/08/14 16:53:26 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
12/08/14 16:53:26 DAGMAN_STARTUP_CYCLE_DETECT setting: False
12/08/14 16:53:26 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
12/08/14 16:53:26 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
12/08/14 16:53:26 DAGMAN_DEFAULT_PRIORITY setting: 0
12/08/14 16:53:26 DAGMAN_ALWAYS_USE_NODE_LOG setting: True
12/08/14 16:53:26 DAGMAN_SUPPRESS_NOTIFICATION setting: True
12/08/14 16:53:26 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
12/08/14 16:53:26 DAGMAN_RETRY_SUBMIT_FIRST setting: True
12/08/14 16:53:26 DAGMAN_RETRY_NODE_FIRST setting: False
12/08/14 16:53:26 DAGMAN_MAX_JOBS_IDLE setting: 0
12/08/14 16:53:26 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
12/08/14 16:53:26 DAGMAN_MAX_PRE_SCRIPTS setting: 0
12/08/14 16:53:26 DAGMAN_MAX_POST_SCRIPTS setting: 0
12/08/14 16:53:26 DAGMAN_ALLOW_LOG_ERROR setting: False
12/08/14 16:53:26 DAGMAN_MUNGE_NODE_NAMES setting: True
12/08/14 16:53:26 DAGMAN_PROHIBIT_MULTI_JOBS setting: False
12/08/14 16:53:26 DAGMAN_SUBMIT_DEPTH_FIRST setting: False
12/08/14 16:53:26 DAGMAN_ALWAYS_RUN_POST setting: True
12/08/14 16:53:26 DAGMAN_ABORT_DUPLICATES setting: True
12/08/14 16:53:26 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True
12/08/14 16:53:26 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
12/08/14 16:53:26 DAGMAN_AUTO_RESCUE setting: True
12/08/14 16:53:26 DAGMAN_MAX_RESCUE_NUM setting: 100
12/08/14 16:53:26 DAGMAN_WRITE_PARTIAL_RESCUE setting: True
12/08/14 16:53:26 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log
12/08/14 16:53:26 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True
12/08/14 16:53:26 DAGMAN_MAX_JOB_HOLDS setting: 100
12/08/14 16:53:26 DAGMAN_HOLD_CLAIM_TIME setting: 20
12/08/14 16:53:26 ALL_DEBUG setting:
12/08/14 16:53:26 DAGMAN_DEBUG setting:
12/08/14 16:53:26 argv[0] == "condor_scheduniv_exec.8.0"
12/08/14 16:53:26 argv[1] == "-Lockfile"
12/08/14 16:53:26 argv[2] == "t1.dag.lock"
12/08/14 16:53:26 argv[3] == "-AutoRescue"
12/08/14 16:53:26 argv[4] == "1"
12/08/14 16:53:26 argv[5] == "-DoRescueFrom"
12/08/14 16:53:26 argv[6] == "0"
12/08/14 16:53:26 argv[7] == "-Dag"
12/08/14 16:53:26 argv[8] == "t1.dag"
12/08/14 16:53:26 argv[9] == "-Suppress_notification"
12/08/14 16:53:26 argv[10] == "-CsdVersion"
12/08/14 16:53:26 argv[11] == "$CondorVersion: 8.2.4 Nov 07 2014 BuildID: 281203 $"
12/08/14 16:53:26 argv[12] == "-Dagman"
12/08/14 16:53:26 argv[13] == "/usr/local/bin/condor_dagman"
12/08/14 16:53:26 Default node log file is: </Users/brian/tmp/./t1.dag.nodes.log>
12/08/14 16:53:26 DAG Lockfile will be written to t1.dag.lock
12/08/14 16:53:26 DAG Input file is t1.dag
12/08/14 16:53:26 Ignoring value of DAGMAN_LOG_ON_NFS_IS_ERROR.
12/08/14 16:53:26 Parsing 1 dagfiles
12/08/14 16:53:26 Parsing t1.dag ...
12/08/14 16:53:26 Dag contains 1 total jobs
12/08/14 16:53:26 Sleeping for 12 seconds to ensure ProcessId uniqueness
12/08/14 16:53:38 Warning: ProcessId not confirmed unique
12/08/14 16:53:38 Bootstrapping...
12/08/14 16:53:38 Number of pre-completed nodes: 0
12/08/14 16:53:38 Of 1 nodes total:
12/08/14 16:53:38  Done     Pre   Queued    Post   Ready Un-Ready   Failed
12/08/14 16:53:38   ===     ===      ===     ===     === ===      ===
12/08/14 16:53:38     0       0        0       0       1 0        0
12/08/14 16:53:38 0 job proc(s) currently held
12/08/14 16:53:38 Registering condor_event_timer...
12/08/14 16:53:39 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log) 12/08/14 16:53:39 MultiLogFiles: truncating log file /Users/brian/tmp/./t1.dag.nodes.log
12/08/14 16:53:39 Submitting Condor Node t1 job(s)...
12/08/14 16:53:39 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:39 From submit: Submitting job(s)
12/08/14 16:53:39 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:39 failed while reading from pipe.
12/08/14 16:53:39 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:39 ERROR: submit attempt failed
12/08/14 16:53:39 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:39 Job submit try 1/6 failed, will try again in >= 1 second.
12/08/14 16:53:39 Of 1 nodes total:
12/08/14 16:53:39  Done     Pre   Queued    Post   Ready Un-Ready   Failed
12/08/14 16:53:39   ===     ===      ===     ===     === ===      ===
12/08/14 16:53:39     0       0        0       0       1 0        0
12/08/14 16:53:39 0 job proc(s) currently held
12/08/14 16:53:44 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log)
12/08/14 16:53:44 Submitting Condor Node t1 job(s)...
12/08/14 16:53:44 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:44 From submit: Submitting job(s)
12/08/14 16:53:44 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:44 failed while reading from pipe.
12/08/14 16:53:44 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:44 ERROR: submit attempt failed
12/08/14 16:53:44 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:44 Job submit try 2/6 failed, will try again in >= 2 seconds.
12/08/14 16:53:49 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log)
12/08/14 16:53:49 Submitting Condor Node t1 job(s)...
12/08/14 16:53:49 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:49 From submit: Submitting job(s)
12/08/14 16:53:49 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:49 failed while reading from pipe.
12/08/14 16:53:49 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:49 ERROR: submit attempt failed
12/08/14 16:53:49 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:49 Job submit try 3/6 failed, will try again in >= 4 seconds.
12/08/14 16:53:54 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log)
12/08/14 16:53:54 Submitting Condor Node t1 job(s)...
12/08/14 16:53:54 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:54 From submit: Submitting job(s)
12/08/14 16:53:54 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:54 failed while reading from pipe.
12/08/14 16:53:54 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:53:54 ERROR: submit attempt failed
12/08/14 16:53:54 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:53:54 Job submit try 4/6 failed, will try again in >= 8 seconds.
12/08/14 16:54:05 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log)
12/08/14 16:54:05 Submitting Condor Node t1 job(s)...
12/08/14 16:54:05 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:54:05 From submit: Submitting job(s)
12/08/14 16:54:05 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:54:05 failed while reading from pipe.
12/08/14 16:54:05 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:54:05 ERROR: submit attempt failed
12/08/14 16:54:05 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub 12/08/14 16:54:05 Job submit try 5/6 failed, will try again in >= 16 seconds. 12/08/14 16:54:22 Unable to get log file from submit file t1.sub (node t1); using default (/Users/brian/tmp/./t1.dag.nodes.log)
12/08/14 16:54:22 Submitting Condor Node t1 job(s)...
12/08/14 16:54:22 submitting: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:54:22 From submit: Submitting job(s)
12/08/14 16:54:22 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:54:22 failed while reading from pipe.
12/08/14 16:54:22 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/08/14 16:54:22 ERROR: submit attempt failed
12/08/14 16:54:22 submit command was: condor_submit -a dag_node_name' '=' 't1 -a +DAGManJobId' '=' '8 -a DAGManJobId' '=' '8 -a submit_event_notes' '=' 'DAG' 'Node:' 't1 -a log' '=' '/Users/brian/tmp/./t1.dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a notification' '=' 'never t1.sub
12/08/14 16:54:22 Job submit failed after 6 tries.
12/08/14 16:54:22 Shortcutting node t1 retries because of submit failure(s)
12/08/14 16:54:22 Of 1 nodes total:
12/08/14 16:54:22  Done     Pre   Queued    Post   Ready Un-Ready   Failed
12/08/14 16:54:22   ===     ===      ===     ===     === ===      ===
12/08/14 16:54:22     0       0        0       0       0 0        1
12/08/14 16:54:22 0 job proc(s) currently held
12/08/14 16:54:22 Aborting DAG...
12/08/14 16:54:22 Writing Rescue DAG to t1.dag.rescue001...
12/08/14 16:54:22 Note: 0 total job deferrals because of -MaxJobs limit (0)
12/08/14 16:54:22 Note: 0 total job deferrals because of -MaxIdle limit (0)
12/08/14 16:54:22 Note: 0 total job deferrals because of node category throttles 12/08/14 16:54:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 12/08/14 16:54:22 Note: 0 total POST script deferrals because of -MaxPost limit (0)
12/08/14 16:54:22 Of 1 nodes total:
12/08/14 16:54:22  Done     Pre   Queued    Post   Ready Un-Ready   Failed
12/08/14 16:54:22   ===     ===      ===     ===     === ===      ===
12/08/14 16:54:22     0       0        0       0       0 0        1
12/08/14 16:54:22 0 job proc(s) currently held
12/08/14 16:54:22 Wrote metrics file t1.dag.metrics.
12/08/14 16:54:22 Metrics not sent because of PEGASUS_METRICS or CONDOR_DEVELOPERS setting. 12/08/14 16:54:22 **** condor_scheduniv_exec.8.0 (condor_DAGMAN) pid 34242 EXITING WITH STATUS 1

==> t1.dag.lib.err <==

==> t1.dag.lib.out <==
Executing condor dagman ...

==> t1.dag.metrics <==
{
    "client":"condor_dagman",
    "version":"8.2.4",
    "planner":"",
    "planner_version":"",
    "type":"metrics",
    "wf_uuid":"",
    "root_wf_uuid":"",
    "start_time":1418057606.034,
    "end_time":1418057662.381,
    "duration":56.347,
    "exitcode":1,
    "dagman_id":"8",
    "parent_dagman_id":"",
    "rescue_dag_number":0,
    "jobs":1,
    "jobs_failed":1,
    "jobs_succeeded":0,
    "dag_jobs":0,
    "dag_jobs_failed":0,
    "dag_jobs_succeeded":0,
    "total_jobs":1,
    "total_jobs_run":1,
    "total_job_time":0.000,
    "dag_status":2
}

==> t1.dag.nodes.log <==

==> t1.dag.rescue001 <==
# Rescue DAG file, created after running
#   the t1.dag DAG file
# Created 12/8/2014 16:54:22 UTC
# Rescue DAG version: 2.0.1 (partial)
#
# Total number of Nodes: 1
# Nodes premarked DONE: 0
# Nodes that failed: 1
#   t1,<ENDLIST>


==> t1.log <==
1418057606 INTERNAL *** DAGMAN_STARTED 8.0 ***
1418057619 t1 SUBMIT_FAILURE - - - 1
1418057624 t1 SUBMIT_FAILURE - - - 1
1418057629 t1 SUBMIT_FAILURE - - - 1
1418057634 t1 SUBMIT_FAILURE - - - 1
1418057645 t1 SUBMIT_FAILURE - - - 1
1418057662 t1 SUBMIT_FAILURE - - - 1
1418057662 INTERNAL *** DAGMAN_FINISHED 1 ***

==> t1.status <==
[
  Type = "DagStatus";
  DagFiles = {
    "t1.dag"
  };
  Timestamp = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
  NodesTotal = 1;
  NodesDone = 0;
  NodesPre = 0;
  NodesQueued = 0;
  NodesPost = 0;
  NodesReady = 1;
  NodesUnready = 0;
  NodesFailed = 0;
  JobProcsHeld = 0;
  JobProcsIdle = 0;
]
[
  Type = "NodeStatus";
  Node = "t1";
  NodeStatus = 1; /* "STATUS_READY" */
  StatusDetails = "";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]
[
  Type = "StatusEnd";
  EndTime = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
  NextUpdate = 1418057619; /* "Mon Dec  8 16:53:39 2014" */
]

==> t1.sub <==
universe = vanilla
executable = /bin/true
transfer_executable = false
transfer_input_files = /nonexistent
queue



But in both cases you don't get any indication of *why* it was held, and not in the <dag>.dagman.out file either. You have to use condor_q -analyze <pid> and parse its output:

You could find the hold reason in the DAGMan nodes.log file, or the log file specified in the submit file (if you specify one).

OK thanks, I see this in t2.dag.nodes.log:

...
012 (4299200.000.000) 12/02 19:06:09 Job was held.
Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.6.213 failed to receive file /var/lib/condor/execute/dir_9329/nonexistent: FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
        Code 12 Subcode 0
...

Unfortunately that's not in one of the files which is intended to be machine-parseable. Perhaps if I keep separate log files per DAG node, I can locate the error more easily.

Also, as far as I can tell there are no automatic retries (those would have to be done by condor_startd, presumably?)

As far as DAGMan is concerned, a job that's on hold is still possibly going to succeed. So if you want the job to fail, you need to put a
periodic_remove expression into your submit file that removes the job
after it's been on hold for a certain amount of time. Then you could add retries to your DAG node.

What I meant was, the action which caused the job to go into hold (i.e. transferring the file) isn't retried. After one attempt, it goes straight into hold indefinitely. You can effectively force a "retry" by removing the hold manually, and if it fails again, it will go back into hold; but this doesn't happen automatically.

It might be simpler to put the file transfer in a PRE script, in which case the node can be retried by dagman (although I don't think that will give the exponential backoff)

Regards,

Brian.