[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman BAD EVENT: executed before submitted



We've been seeing this behavior in Condor 7.2 for the last several months (on Red Had FC8), and now I'm testing 7.4 (on Red Hat FC12) and seeing it again. It is very sensitive to the exact hardware that is playing the role of collector. Some machines hit this problem in 100% of dags, others in 0%. All of our machines are identical, or intended to be so. We've worked around it by abandoning the use of some machines as collectors.

The symptom is that the dagman job is itself evicted from the collector. The eviction is logged in the .dagman.log file just like any other eviction:

==
000 (694.000.000) 01/25 12:47:35 Job submitted from host: <192.168.129.17:53465>
...
001 (694.000.000) 01/25 12:47:35 Job executing on host: <192.168.129.17:53465>
...
004 (694.000.000) 01/25 13:26:08 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (694.000.000) 01/25 13:26:13 Job executing on host: <192.168.129.17:53465>
...
==

In the .dagman.out I see this -- the job that caused the problem is 838 and I snipped non-838-related activity:

==
01/25 13:25:38 From submit: Submitting job(s).
01/25 13:26:08 From submit: Logging submit event(s).
01/25 13:26:08 From submit: 1 job(s) submitted to cluster 838.
01/25 13:26:08  assigned Condor ID (838.0)

[snip]

01/25 13:26:08 Event: ULOG_EXECUTE for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) executing, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 Number of idle job procs: 2
01/25 13:26:08 Event: ULOG_JOB_TERMINATED for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) ended, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3119 in file dag.cpp
==

Note that the ULOG_EXECUTE event came in with no preliminary ULOG_SUBMIT. I see another manifestation of this in the monolithic log file that all my .condor files set as the "log = " destination:

==
001 (838.000.000) 01/25 13:25:49 Job executing on host: <192.168.131.114:52465>
...
005 (838.000.000) 01/25 13:25:55 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
000 (838.000.000) 01/25 13:25:38 Job submitted from host: <192.168.129.17:53465>
    DAG Node: MyJob
...
==

The executing message is there before the submitted message. The timestamps look OK but they came into the log file out of order. The job was very small and took just 6 seconds.

Where can I go next to get more details to find the root cause? Maybe it's some kind of networking configuration in my data center, and therefore fixable. Help!

Thanks,
Greg Langmead
Language Weaver, Inc.