[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Dagman stalling with shadow exception messages?



On Wednesday 07 April 2004 09:37, Dan Bradley wrote:
> Michael S. Root wrote:
> >The only workaround seems to be to delete the dag
> >job from the queue and re-submit the remaining jobs (which then proceed
> > to run fine).
>
> Do you mean that you are having to manually submit each of the remaining
> jobs?  DAGMan should be creating a rescue DAG when you remove it from
> the queue (with condor_rm).  You can run the rescue DAG and DAGMan will
> submit jobs that were not successfully finished in the first attempt.

I haven't tried to use the rescue DAG yet.  When we first started using 
Condor, I wrote a python module that allows us to easily submit jobs to 
our farm from our preexisting and somewhat extensive tool set.  These 
existing tools are already cabable of determining what needs to be done, 
and will automatically skip existing output files.  The long and short of 
it is that it's easy for me to kill and restart a DAG where it left off, 
but if it gets into the weird 'stuck' state right after I leave, a whole 
night's worth of rendering gets lost...

>
> Of course, the real problem is why the DAG is not completing in the
> first place, but I just want to make sure everything else is sane.  If
> DAGMan is in some crazy state where it can't even generate the rescue
> DAG, then this is an important point.

I checked the dagman.out log for that job that got stuck, and it appears to 
have died cleanly (I killed it with condor_rm):
------------------
4/6 21:05:20 Event: ULOG_SHADOW_EXCEPTION for Condor Job 
st006_comp_tk25__296-30
0 (22190.0.0)
4/6 21:05:40 Received SIGUSR1
4/6 21:05:40 Aborting DAG...
4/6 21:05:40 Writing Rescue DAG to 
/net/volatile/condor/mike_0052/CondorDAG.resc
ue...
4/6 21:05:40 Removing submitted jobs...
4/6 21:05:40 Removing any/all submitted Condor jobs...
4/6 21:05:40 Executing: condor_rm -const 'DAGManJobID == "22170.0"'
4/6 21:05:40 Running: condor_rm -const 'DAGManJobID == "22170.0"'
4/6 21:05:40 **** condor_scheduniv_exec.22170.0 (condor_DAGMAN) EXITING 
WITH STATUS 1
------------------

> >4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling
> >handler (command_handler)
> >4/6 21:00:02 Error: can't find resource with capability
> >(<192.168.1.111:32771>#7698602094)
> >----------------------
> >Note: That last line puzzles me.  I don't know what the #7698602094
> > referrs to.
>
> This is perfectly normal (both the message and the puzzlement).
> Glancing at your two log files, it looks to me like the times don't
> match up, so we can't see what happened on the execution side when the
> shadow lost contact with the starter.

True, the times don't match exactly.  The logs have all wrapped by now, so 
if it happens again, I'll try and be more precise.  The same messages in 
the dagman.out log and the ShadowLog on the submit host were repeated many 
times, so I just grabbed one that looked close.

> Whatever may have happened to cause the run attempt to fail, this
> shouldn't have caused DAGMan to get stuck, but if you are seeing a
> correlation, then there may be a problem.

By run attempt failing, do you mean Condor's failing to start the remote 
job, or our render executable failing when it's run?  I've looked at all 
the logs, I haven't found any correlation between one of our executables 
failing and dagman getting into this state.  When our executables fail, 
Condor does the right thing and obeys the on_exit_remove rule and requeues 
the job accordingly.

> Is there's any chance that the disk containing the job state log file(s)
> was ever full?

I've looked into that.  I have a script that tells me the available disk 
space on all of our machines (desktop & renderfarm), and all have at least 
several gigabytes free locally, and no partition on our data server has 
less than 15 Gig free.  

-Mike

> --Dan
>
>
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>