Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman lost track of one of it's nodes

Date: Fri, 19 Nov 2010 10:38:54 -0500
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] Dagman lost track of one of it's nodes

I was running a large DAG yesterday (100K nodes, on about 800 workernodes) After all the jobs finished, the dagman process was still inthe queue, but all the nodes were done.A little digging revealed the answer. Dagman thought one node wasstill running.


From the dagman.out file:
11/19/10 09:48:28 Pending DAG nodes:

11/19/10 09:48:28 Node e111_4805t3-2gmva1, Condor ID 9818391, statusSTATUS_SUBMITTED



and looking at the user job log:

000 (9818391.000.000) 11/18 18:03:47 Job submitted from host: <[ipremoved]:40621>001 (9818391.000.000) 11/18 18:03:49 Job executing on host: <[ipremoved]:37699?CCBID=[ip removed]:9639#6840>

006 (9818391.000.000) 11/18 18:08:57 Image size of job updated: 177728

The job never wrote out status 005 when it exited.

But the Schedd saw it exit:

SchedLog
11/18/10 18:03:47 (pid:4586) Starting add_shadow_birthdate(9818391.0)

11/18/10 18:03:47 (pid:4586) Started shadow for job 9818391.0 on machine@xxxxxxxxxxxxx<[ip removed]:37699?CCBID=[ip removed]:9639#6840> foruser@xxxxxxxxxxx, (shadow pid = 7471)11/18/10 18:11:01 (pid:4586) Shadow pid 7471 for job 9818391.0 exitedwith status 100


ShadowLog:
11/18/10 18:03:47 Initializing a VANILLA shadow for job 9818391.0

11/18/10 18:03:47 (9818391.0) (7471): Request to run on machine@xxxxxxxxxxxxx<[ip removed]:37699?CCBID=[ip removed]:9639#6840> was ACCEPTED11/18/10 18:11:01 (9818391.0) (7471): Job 9818391.0 terminated: exitedwith status 011/18/10 18:11:01 (9818391.0) (7471): **** condor_shadow(condor_SHADOW) pid 7471 EXITING WITH STATUS 100

The job wrote out it's stdout,stderr, and other job specific filesokay. Why did this job get orphaned?

I've seen similar things happen with other recent jobs.

I'm running Condor 7.5.4. My guy instinct is that it's related torunning things over NFS. But it's been just one or two jobs in these100K node dags that seem to exhibit this behavior, and I haven'treally seen any other odd behavior. I don't see anything in the logfiles that show an error related to writing out the data.


-Peter

Follow-Ups:
- Re: [Condor-users] Dagman lost track of one of it's nodes
  - From: Cathrin Weiss

Prev by Date: Re: [Condor-users] job's log file: date does not indicate the year !?!
Next by Date: Re: [Condor-users] Setup Advice Needed
Previous by thread: Re: [Condor-users] vanilla jobs running only on submit machine
Next by thread: Re: [Condor-users] Dagman lost track of one of it's nodes
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Dagman lost track of one of it's nodes