[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Vanilla - jobs disappear without completing.

Rob Stevenson wrote:
> Dear all,
> Do you know if there is there a file size limit for condor runs? If so,
> is there a line that can be added to submit files to increase this?
> (Something to do with "ImageSize"?). Or, perhaps I've missed the mark
> completely?
> Our pool is setup to not allow preempting as we are in the vanilla
> universe without the ability to compile with condor-specific libraries.
> Recently I've been seeing a few occurrences of a problem whereby some
> jobs that were running seem to be kicked off their current processor and
> then either disappear from the queue, stay in a permanent state of "H"
> or its still reported as running but all but one or two files have been
> deleted from the /execute/dir[xxxx] directory.
> The jobs haven't successfully completed, output isn't copied back to its
> original location and there doesn't appear to be any log output to give
> me a clue.
> The only thing that seems to be common between the failures at the
> moment is that the jobs have all been running for more than 4 or 5 days
> and all were taking up near, or in excess of 2GB of space in the execute
> directory: ./execute/dir[xxxx].
> Does anyone have any ideas? Or any advice on how to increase logging so
> that I can catch what ever is happening?
> Many thanks to everyone for reading,
> Rob Stevenson - Systems Administrator
> IS Services

Doesn't sound familiar. If you want to figure out what was happening to
the jobs you should check the Start[er]Log files on the execute node. If
there's not much info there you can up the debug level to D_FULLDEBUG
and monitor the logs. Also, you should see some mention of the jobs in
the SchedLog. If they completed and were removed from the queue on
purpose you should be able to find them in the history file, typically
SPOOL/history accessible with the condor_history command.

Good luck.