[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman BAD EVENT: executed before submitted



On Mon, 25 Jan 2010, Greg Langmead wrote:

We've been seeing this behavior in Condor 7.2 for the last several months (on Red Had FC8), and now I'm testing 7.4 (on Red Hat FC12) and seeing it again. It is very sensitive to the exact hardware that is playing the role of collector. Some machines hit this problem in 100% of dags, others in 0%. All of our machines are identical, or intended to be so. We've worked around it by abandoning the use of some machines as collectors.

The symptom is that the dagman job is itself evicted from the collector. The eviction is logged in the .dagman.log file just like any other eviction:

I think that the fact that DAGMan is getting evicted is just a result of the assertion.

In the .dagman.out I see this -- the job that caused the problem is 838 and I snipped non-838-related activity:

==
01/25 13:25:38 From submit: Submitting job(s).
01/25 13:26:08 From submit: Logging submit event(s).
01/25 13:26:08 From submit: 1 job(s) submitted to cluster 838.
01/25 13:26:08  assigned Condor ID (838.0)

[snip]

01/25 13:26:08 Event: ULOG_EXECUTE for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) executing, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 Number of idle job procs: 2
01/25 13:26:08 Event: ULOG_JOB_TERMINATED for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) ended, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3119 in file dag.cpp
==

Note that the ULOG_EXECUTE event came in with no preliminary ULOG_SUBMIT. I see another manifestation of this in the monolithic log file that all my .condor files set as the "log = " destination:

==
001 (838.000.000) 01/25 13:25:49 Job executing on host: <192.168.131.114:52465>
...
005 (838.000.000) 01/25 13:25:55 Job terminated.
       (1) Normal termination (return value 0)
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
...
000 (838.000.000) 01/25 13:25:38 Job submitted from host: <192.168.129.17:53465>
   DAG Node: MyJob
...
==

The executing message is there before the submitted message. The timestamps look OK but they came into the log file out of order. The job was very small and took just 6 seconds.

Where can I go next to get more details to find the root cause? Maybe it's some kind of networking configuration in my data center, and therefore fixable. Help!

There are at least two issues here:
1) Why are the events written out of order?
2) Why does this cause DAGMan to assert?

As far as #1 goes, the reason that this can happen is that the submit event is written by condor_submit (after the job is actually submitted, so that you don't get a submit event if things go wrong), and the executing and terminated events are written by either the schedd or the shadow (I can't remember which off the top of my head). Anyhow, because events are written by different processes, there's a race condition for which one gets written first. In the past we've seen cases where the executing event gets written before the submit event, but it's unusual for the terminated event to come before the submit event. (I guess that's possible in your case because the jobs are so short.)

For case #2, DAGMan is supposed to tolerate this if you have the default value for DAGMAN_ALLOW_EVENTS. Can you send me the DAG file, node job log file(s), and dagman.out file for one of the cases where DAGMan asserted? I'd like to take a look at this, and see if that case is not handled correctly.

Kent Wenger
Condor Team