[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to kill job when output dir removed ?





Dr Ian C. Smith wrote:
Unfortunately the DAGMan bug in 6.8 (mentioned
again here recently) is a show stopper for us.

Just to be clear, there are two manifestations of this bug. In 6.8, it is causing DAGMan to go on hold when it exits abnormally. In 6.9.1, this bug causes the schedd to crash :(

The problem will be fixed in 6.8.4 and 6.9.2.

You say that Condor 6.8 will continue to attempt to write to the submit host
if the filesystem is full. Does Condor actually detect this error
separately ? In other words - if the directory is missing (or unwriteable)
will it put the job on hold or just keep trying ?

In 6.8, Condor detects the error writing output, but it doesn't put the job on hold. The job goes back to the idle state and will try to run again. In the common case where the failure to write the output was because the initial working directory had been deleted, the job will go on hold when it tries to run again. However, if you are writing your output to some other directory, and there are no problems fetching input files, then the job will run again, and possibly fail again when it tries to write output.

In 6.9.1, the job goes on hold when it fails to write output.

Just to make matters more confusing (sorry), standard, pvm, and local universes have not yet been incorporated into the new hold-on-error regime, so jobs in these universes currently exhibit the traditional behavior of going back to the idle state and trying to run again when they hit errors caused by missing input files/directories and/or output failures.

--Dan