Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-g job status unknown caused dagman to exit

Date: Tue, 20 Apr 2010 10:24:01 -0400
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor-g job status unknown caused dagman to exit

To the condor experts,

I'm using Condor-G and this job (see logs below) caused a large dag toexit unexpectedly this morning. I understand the concept of BADEVENTS ( in this case the job started executing after it was supposedto be done)

But I want to know how to prevent this from happening.

From what I can see the job's state became unknown (What causesthis? I see it happen a lot actually)and then Condor abandoned the job, but then the job called back homeand instead of just ignoring it, it heard the call, but then gotconfused, and dagman exited.

It seems like a bug to me.

Thanks
Peter


job log file:

000 (935774.000.000) 04/19 16:55:14 Job submitted from host:<10.0.10.39:56607>

    DAG Node: 4bbn-01360

017 (935774.000.000) 04/19 18:14:17 Job submitted to Globus
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
    JM-Contact: https://ff-grid2.unl.edu:38818/1135/1271715184/
    Can-Restart-JM: 1

027 (935774.000.000) 04/19 18:14:17 Job submitted to grid resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
    GridJobId: gt2 ff-grid2.unl.edu/jobmanager-pbs https://ff-grid2.unl.edu:38818/1135/1271715184/

001 (935774.000.000) 04/19 20:25:12 Job executing on host: gt2 ff-grid2.unl.edu/jobmanager-pbs


029 (935774.000.000) 04/19 20:54:49 The job's remote status is unknown

020 (935774.000.000) 04/20 01:51:03 Detected Down Globus Resource
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs

026 (935774.000.000) 04/20 01:51:03 Detected Down Grid Resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs

019 (935774.000.000) 04/20 08:52:40 Globus Resource Back Up
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs

025 (935774.000.000) 04/20 08:52:40 Grid Resource Back Up
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs

012 (935774.000.000) 04/20 09:03:34 Job was held.

Globus error 31: the job manager failed to cancel the job asrequested

        Code 2 Subcode 31

013 (935774.000.000) 04/20 09:07:50 Job was released.

The job attribute PeriodicRelease expression'(NumGlobusSubmits <= 7)' evaluated to TRUE


009 (935774.000.000) 04/20 09:08:17 Job was aborted by the user.

The job attribute PeriodicRemove expression '(JobStatus == 2)&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE

030 (935774.000.000) 04/20 09:08:42 The job's remote status is knownagain

001 (935774.000.000) 04/20 09:08:42 Job executing on host: gt2 ff-grid2.unl.edu/jobmanager-pbs


009 (935774.000.000) 04/20 09:08:42 Job was aborted by the user.

The job attribute PeriodicRemove expression '(JobStatus == 2)&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE





dagman.out file


04/19 16:56:13 Event: ULOG_SUBMIT for Condor Node 4bbn-01360 (935774.0)

04/19 18:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Node 4bbn-01360(935774.0)

04/19 18:14:20 Event: ULOG_GRID_SUBMIT for Condor Node 4bbn-01360(935774.0)


04/19 20:25:14 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)

04/19 20:54:52 Event: ULOG_JOB_STATUS_UNKNOWN for Condor Node4bbn-01360 (935774.0)

04/20 01:51:07 Event: ULOG_GLOBUS_RESOURCE_DOWN for Condor Node4bbn-01360 (935774.0)

04/20 01:51:07 Event: ULOG_GRID_RESOURCE_DOWN for Condor Node4bbn-01360 (935774.0)

04/20 08:52:41 Event: ULOG_GLOBUS_RESOURCE_UP for Condor Node4bbn-01360 (935774.0)

04/20 08:52:41 Event: ULOG_GRID_RESOURCE_UP for Condor Node 4bbn-01360(935774.0)

04/20 09:03:35 Event: ULOG_JOB_HELD for Condor Node 4bbn-01360(935774.0)

04/20 09:07:53 Event: ULOG_JOB_RELEASED for Condor Node 4bbn-01360(935774.0)

04/20 09:08:23 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360(935774.0)

04/20 09:08:23 Node 4bbn-01360 job completed

04/20 09:08:23 Unable to get log file from submit file ../.dag/4bbn.ca(node 4bbn-01360); using default (/opt/osg-shared/home/site/doherty/tmp/phaser/clean/4bbn/group/nodes/../.dag/4bbn.dag.nodes.log)

04/20 09:08:23 Running POST script of Node 4bbn-01360...

04/20 09:08:43 Event: ULOG_JOB_STATUS_KNOWN for Condor Node 4bbn-01360(935774.0)

04/20 09:08:43 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)

04/20 09:08:43 BAD EVENT: job (935774.0.0) executing, total end count != 0 (1)04/20 09:08:43 ERROR: aborting DAG because of bad event (BAD EVENT:job (935774.0.0) executing, total end count != 0 (1))04/20 09:08:43 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360(935774.0)04/20 09:08:43 BAD EVENT: job (935774.0.0) ended, total end count != 1(2)04/20 09:08:43 Continuing with DAG in spite of bad event (BAD EVENT:job (935774.0.0) ended, total end count != 1 (2)) because ofallow_events setting

04/20 09:08:43 Aborting DAG...

Follow-Ups:
- Re: [Condor-users] condor-g job status unknown caused dagman to exit
  - From: R. Kent Wenger

Prev by Date: Re: [Condor-users] PREEMPT vs PREEMPTION_REQUIREMENTS
Next by Date: Re: [Condor-users] Migrating 7.2->7.4 job submission woe
Previous by thread: [Condor-users] adding to clutser
Next by thread: Re: [Condor-users] condor-g job status unknown caused dagman to exit
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] condor-g job status unknown caused dagman to exit