[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] how to kill job when output dir removed ?
- Date: Wed, 17 Jan 2007 11:14:28 -0600
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] how to kill job when output dir removed ?
Dr Ian C. Smith wrote:
Unfortunately the DAGMan bug in 6.8 (mentioned
again here recently) is a show stopper for us.
Just to be clear, there are two manifestations of this bug. In 6.8, it
is causing DAGMan to go on hold when it exits abnormally. In 6.9.1,
this bug causes the schedd to crash :(
The problem will be fixed in 6.8.4 and 6.9.2.
You say that Condor 6.8 will continue to attempt to write to the submit host
if the filesystem is full. Does Condor actually detect this error
separately ? In other words - if the directory is missing (or unwriteable)
will it put the job on hold or just keep trying ?
In 6.8, Condor detects the error writing output, but it doesn't put the
job on hold. The job goes back to the idle state and will try to run
again. In the common case where the failure to write the output was
because the initial working directory had been deleted, the job will go
on hold when it tries to run again. However, if you are writing your
output to some other directory, and there are no problems fetching input
files, then the job will run again, and possibly fail again when it
tries to write output.
In 6.9.1, the job goes on hold when it fails to write output.
Just to make matters more confusing (sorry), standard, pvm, and local
universes have not yet been incorporated into the new hold-on-error
regime, so jobs in these universes currently exhibit the traditional
behavior of going back to the idle state and trying to run again when
they hit errors caused by missing input files/directories and/or output