[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor w/o shared filesystem



On 07/19/2011 12:32 PM, Ian Chesal wrote:
On Monday, July 18, 2011 at 9:14 PM, Rita wrote:
One feature we would like to see is having condor jobs running
completely independent of the scheduler -- related to file systems.

Currently, the job log depends needs to be shared on the scheduler and
execution hosts.
This is not true. Only scheduler-side processes require access to the
job log. Perhaps I'm mis-understanding your request?

Condor will try to be smart about using shared filesystems if
FILESYSTEM_DOMAIN is the same between the scheduler and the execute
nodes. You can set this to $(FULL_HOSTNAME) on every machine to let
Condor know that none of your machines share a filesystem.
If the scheduler host dies for any reason the execution of the jobs halts.
Condor is highly fault tolerant and the disappearance of a scheduler can
be tolerated. The default behaviour can be changed if you don't like it.
Specifically you want to look at MAX_CLAIM_ALIVES_MISSED
(http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#17538)
and ALIVE_INTERVAL
(http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:AliveInterval)
-- these control how long a condor_startd will keep a job running while
the scheduler they came from is no longer able to connect to the startd.

It would be nice to place all files in spool/ (this includes,
output,error, and job (if possible)) . I am not sure if there is a
better way.
You can have this happen if you submit with the -remote option -- Condor
will spool everything in the spool directory and hold it there for you
until you fetch the files with condor_transfer_data.

Regards,
- Ian

If you're not using -remote/-name now, you likely only need -spool.

Best,


matt