[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] file transferring (or not) in vanilla and mpi universes



I am running Condor 6.6.1 (6.6.0 had same "problems") in my pool:
Linux 7.3 cluster comprising:
* dual headnode
* 8x single workernodes

headnode is configured as dedicated scheduler for cluster, it is setup
(as in the manual) for running opportunistic jobs as well. headnode
is also the master for the condor pool (negotiator and collector run here).

There are no other schedulers in the pool.

Start daemons are currently on all nodes (incl headnode).

Users on the headnode do not neccessarily have a useraccount on the
workers.

Some filestore is shared between the headnode and the workers.
* /home/* (including headnode users and condor)
* /opt/<some> (including condor and mpi)

I have the following settings on my headnode condor_config.local:
FILESYSTEM_DOMAIN = ibmcluster
UID_DOMAIN = $(HOSTNAME).dl.ac.uk

I have the following settings on my headnode condor_config.local:
FILESYSTEM_DOMAIN = ibmcluster
UID_DOMAIN = $(HOSTNAME).ibmcluster

I have then done some experiments to get to grips with
the file transferring options for vanilla and mpi universes.

I tried varying the following:

* whether the lines:
    should_transfer_files = YES
    when_to_transfer_output = ON_EXIT
were commented out or not

whether the log, output and error files existed before
submitting and
if the log, output and error files already existed, whether
they had rw (666) permissions for group+world as well as owner.

As I had a shared filesystem, I thought that I would
be able to use file transfer, or just leave them where they were.

Here are the results:
=============================================
VANILLA

File transfer requested
    log, etc files: exist 666 perms - SUCCESS

    log, etc files: exist 644 perms - SUCCESS

    log, etc files: don't exist - SUCCESS

File transfer not requested
    log, etc files: exist 666 perms - SUCCESS

    log, etc files: exist 644 perms - FAILURE
        (log file:)
        Error from starter on node3.ibmcluster:
        Failed to open standard output file
        '/home/jktest/test/vanilla/2.output':
        Permission denied (errno 13)
        [ad inf]

    log, etc files: don't exist - FAILURE
        First it creates all the files correctly, but leaves them as
        644, then it fails as follows:
        (log file:)
        Error from starter on node3.ibmcluster:
        Failed to open standard output file
        '/home/jktest/test/vanilla/2.output':
        Permission denied (errno 13)
        [ad inf]

OK, so far no major surprises except that it'd be nice
if Condor didn't create files with permissions that it
had no chance of writing to later. BTW this made no
difference if the user had an account on the other machine (presumably
because I had set the 2 UID_DOMAINs to be separate).

I then expected my MPI programs to behave in a similar way,

...


=============================================
MPI

File transfer requested
    log, etc files: exist 666 perms - FAILURE
    log, etc files: exist 644 perms - FAILURE
    log, etc files: don't exist - FAILURE
        all fail in the same way:
        appear to have succeeded (according to email message)
        but output files are not right
        initially sets 0.output and 0.error to empty,
            leaving permissions as before
        eventually creates:
---xr--r--    1 jmk27    dlarcg          0 Feb 18 14:25 #MpInOdE#.error
-rw---x---    1 jmk27    dlarcg         51 Feb 18 14:25 #MpInOdE#.output
--w---x---    1 jmk27    dlarcg         51 Feb 18 14:25 #MpInOdE#.output
-r----x---    1 jmk27    dlarcg         51 Feb 18 14:25 #MpInOdE#.output
-r----x--T    1 jmk27    dlarcg         51 Feb 18 14:25 #MpInOdE#.output
--w---x--T    1 jmk27    dlarcg         51 Feb 18 14:25 #MpInOdE#.output
        and then completes. The above output file is consistent with
        one of the jobs having completed successfully (presumably all
        write to the same file)


File transfer not requested
    log, etc files: exist 666 perms - SUCCESS

    log, etc files: exist 644 perms - FAILURE
    as analogous vanilla test above

    log, etc files: don't exist - FAILURE
    as analogous vanilla test above

=============================================

submit files

-----------------------------
UNIVERSE = vanilla
EXECUTABLE = vanillatest
REQUIREMENTS = ( OpSys == "LINUX" )

LOG = $(UNIVERSE)/log
ERROR = $(UNIVERSE)/$(PROCESS).error
INPUT = $(UNIVERSE)/$(PROCESS).input
OUTPUT = $(UNIVERSE)/$(PROCESS).output

# Following 2 lines are not needed for a shared filesystem
# as long as either:
# a) output and error files already exist with 666 permissions, or
# b) (presumably) same uid_domain
#SHOULD_TRANSFER_FILES = YES
#WHEN_TO_TRANSFER_OUTPUT = ON_EXIT

QUEUE 4
-----------------------------
UNIVERSE = mpi
EXECUTABLE = mpitest
REQUIREMENTS = ( OpSys == "LINUX" )

LOG = $(UNIVERSE)/log
ERROR = $(UNIVERSE)/$(NODE).error
INPUT = $(UNIVERSE)/$(NODE).input
OUTPUT = $(UNIVERSE)/$(NODE).output

MACHINE_COUNT = 4

SHOULD_TRANSFER_FILES = YES
WHEN_TO_TRANSFER_OUTPUT = ON_EXIT

QUEUE
-----------------------------

So, finally (sorry!):

* is the above behaviour what people would expect for the semantics
  of file transfer and non file transfer modes?

* Should there be a difference in this between mpi and vanilla universes?

* Why are the 0.output and 0.error files created, but not the others,
  and why aren't they written to?

Cheers

JK

John Kewley
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>