[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-g job status unknown caused dagman to exit



To the condor experts,

I'm using Condor-G and this job (see logs below) caused a large dag to exit unexpectedly this morning. I understand the concept of BAD EVENTS ( in this case the job started executing after it was supposed to be done)
But I want to know how to prevent this from happening.

From what I can see the job's state became unknown (What causes this? I see it happen a lot actually) and then Condor abandoned the job, but then the job called back home and instead of just ignoring it, it heard the call, but then got confused, and dagman exited.
It seems like a bug to me.

Thanks
Peter


job log file:

000 (935774.000.000) 04/19 16:55:14 Job submitted from host: <10.0.10.39:56607>
    DAG Node: 4bbn-01360

017 (935774.000.000) 04/19 18:14:17 Job submitted to Globus
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
    JM-Contact: https://ff-grid2.unl.edu:38818/1135/1271715184/
    Can-Restart-JM: 1

027 (935774.000.000) 04/19 18:14:17 Job submitted to grid resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
    GridJobId: gt2 ff-grid2.unl.edu/jobmanager-pbs https://ff-grid2.unl.edu:38818/1135/1271715184/

001 (935774.000.000) 04/19 20:25:12 Job executing on host: gt2 ff- grid2.unl.edu/jobmanager-pbs

029 (935774.000.000) 04/19 20:54:49 The job's remote status is unknown

020 (935774.000.000) 04/20 01:51:03 Detected Down Globus Resource
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs

026 (935774.000.000) 04/20 01:51:03 Detected Down Grid Resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs

019 (935774.000.000) 04/20 08:52:40 Globus Resource Back Up
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs

025 (935774.000.000) 04/20 08:52:40 Grid Resource Back Up
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs

012 (935774.000.000) 04/20 09:03:34 Job was held.
Globus error 31: the job manager failed to cancel the job as requested
        Code 2 Subcode 31

013 (935774.000.000) 04/20 09:07:50 Job was released.
The job attribute PeriodicRelease expression '(NumGlobusSubmits <= 7)' evaluated to TRUE

009 (935774.000.000) 04/20 09:08:17 Job was aborted by the user.
The job attribute PeriodicRemove expression '(JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE

030 (935774.000.000) 04/20 09:08:42 The job's remote status is known again

001 (935774.000.000) 04/20 09:08:42 Job executing on host: gt2 ff- grid2.unl.edu/jobmanager-pbs

009 (935774.000.000) 04/20 09:08:42 Job was aborted by the user.
The job attribute PeriodicRemove expression '(JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE




dagman.out file


04/19 16:56:13 Event: ULOG_SUBMIT for Condor Node 4bbn-01360 (935774.0)

04/19 18:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Node 4bbn-01360 (935774.0)

04/19 18:14:20 Event: ULOG_GRID_SUBMIT for Condor Node 4bbn-01360 (935774.0)

04/19 20:25:14 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)

04/19 20:54:52 Event: ULOG_JOB_STATUS_UNKNOWN for Condor Node 4bbn-01360 (935774.0)

04/20 01:51:07 Event: ULOG_GLOBUS_RESOURCE_DOWN for Condor Node 4bbn-01360 (935774.0)

04/20 01:51:07 Event: ULOG_GRID_RESOURCE_DOWN for Condor Node 4bbn-01360 (935774.0)

04/20 08:52:41 Event: ULOG_GLOBUS_RESOURCE_UP for Condor Node 4bbn-01360 (935774.0)

04/20 08:52:41 Event: ULOG_GRID_RESOURCE_UP for Condor Node 4bbn-01360 (935774.0)

04/20 09:03:35 Event: ULOG_JOB_HELD for Condor Node 4bbn-01360 (935774.0)

04/20 09:07:53 Event: ULOG_JOB_RELEASED for Condor Node 4bbn-01360 (935774.0)

04/20 09:08:23 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360 (935774.0)
04/20 09:08:23 Node 4bbn-01360 job completed
04/20 09:08:23 Unable to get log file from submit file ../.dag/4bbn.ca (node 4bbn-01360); using default (/opt/osg-shared/home/site/doherty/ tmp/phaser/clean/4bbn/group/nodes/../.dag/4bbn.dag.nodes.log)
04/20 09:08:23 Running POST script of Node 4bbn-01360...

04/20 09:08:43 Event: ULOG_JOB_STATUS_KNOWN for Condor Node 4bbn-01360 (935774.0)
04/20 09:08:43 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)
04/20 09:08:43 BAD EVENT: job (935774.0.0) executing, total end count ! = 0 (1) 04/20 09:08:43 ERROR: aborting DAG because of bad event (BAD EVENT: job (935774.0.0) executing, total end count != 0 (1)) 04/20 09:08:43 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360 (935774.0) 04/20 09:08:43 BAD EVENT: job (935774.0.0) ended, total end count != 1 (2) 04/20 09:08:43 Continuing with DAG in spite of bad event (BAD EVENT: job (935774.0.0) ended, total end count != 1 (2)) because of allow_events setting
04/20 09:08:43 Aborting DAG...