[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Idle Jobs & an Authentication Issue



Hello - I am setting up a Condor pool to do a demo for our Grid class this
Friday afternoon.  I am currently experimenting with 3 laptops in my pool.
 All 3 laptops see each other.

I have 2 issues:
1)  jobs that are submitted to run locally, remain on idle for about 20
minutes before executing.

2) jobs that are submitted to run remotely, receive a warning that the
output and error files are not writable by condor.

Below is the lab handout which has the changes I have made so far to the
config files.

                        Condor Scheduler
                        Install Procedure


1.  As user root
    Turn off torque
            $ /etc/init.d/pbs stop
        $ chkconfig --level 3 pbs off
        $ chkconfig --level 4 pbs off
        $ chkconfig --level 5 pbs off


2.  As user root
    Copy the rpm for the install over to your local machine
            $ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/ \
              condor-6.8.2-linux-x86-rhel3-dynamic-1.i386.rpm \
             /globus_ins/


3.  As user root
    Create the condor user
        $ useradd condor


4.  As user root
    Create a directory for condor
        $ mkdir /usr/local/condor


5.  As user root
    Move to the directory that has the condor rpm
        $ cd /globus_installs


6.  As user root
    Install the condor package
        $ rpm -i condor-6.8.2-linux-x86-rhel3-dynamic-1.i386.rpm
--prefix=/usr/local/condor

        Unable to find a valid Java installation
Java Universe will not work properly until the JAVA
(and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!

Condor has been installed into:
    /usr/local/condor

In order for Condor to work properly you must set your
CONDOR_CONFIG environment variable to point to your
Condor configuration file:
    /usr/local/condor/etc/condor_config
before running Condor commands/daemons.


7.  As user root
    Set the CONDOR & CONDOR_CONFIG environment variables.
            (remember to add the new environment variable CONDOR to your
PATH)
        (remember to add this new environment variable to the list of export
variables)
        $ cd ~
        $ vi .bash_profile
                CONDOR=/usr/local/condor
                CONDOR_CONFIG=/usr/local/condor/etc/condor_config
                PATH=$PATH:$CONDOR/bin:$CONDOR/sbin:$TORQUE_ROOT/bin:$TORQUE_ROOT/sbin:
                export CONDOR CONDOR_CONFIG
        $ source .bash_profile
        $ cp .bash_profile /home/globus/.bash_profile
                cp: overwrite `/home/globus/.bash_profile'? y
        $ cp .bash_profile /home/griduserxx/.bash_profile
                cp: overwrite `/home/griduserxx/.bash_profile'? y


8.  As user root
    Modify the Condor configuration file to allow other hosts to submit
jobs to your local host
        $ cp $CONDOR/etc/condor_config $CONDOR/etc/condor_config.SAV
        $ vi $CONDOR/etc/condor_config
           a) remove the comment on the below line of code on line 212
                212 ##  HOSTALLOW_WRITE = *

                212     HOSTALLOW_WRITE = *

           b) comment out the below line of code on line 215
                   215    HOSTALLOW_WRITE =
YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE

                215 ##HOSTALLOW_WRITE =
YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE


9.  As user root
    Start Condor
        $ condor_master
        $ ps ax | grep condor
 5204 ?        Ss     0:00 condor_master
 5205 ?        Ss     0:00 condor_collector -f
 5206 ?        Ss     0:00 condor_negotiator -f
 5207 ?        Ss     0:00 condor_schedd -f
 5208 ?        Ss     0:05 condor_startd -f
 5217 pts/1    S+     0:00 grep condor

                You should see the following condor daemons executing:
master, collector,
                negotiator, schedd, & startd


10.  As user root
    Check the status of your local condor pool - wait a few minutes after
initially starting the Condor
master before checking the status to allow all the daemons to initialize
        $ condor_status
Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

gridxx.local  LINUX       INTEL  Unclaimed  Idle       0.180  1002
0+00:00:04

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

         INTEL/LINUX     1     0       0         1       0          0
  0

               Total     1     0       0         1       0          0
  0


11.  As user griduserxx
                Create a user working directory for condor and copy some
condor files
from gridpresent
        $ mkdir condor
        $ cd condor
        $ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/runner* .
        $ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/looper* .


12.  As user griduserxx
    Notice the command script used by condor
        $ more runner.cmd
####################
##
## Test Condor command file
##
####################

universe        = vanilla
executable      = runner.exe
output          = runner.out
error           = runner.err
log             = runner.log
queue


13.  As user griduserxx
    Compile 2 Fortran 77 programs in condor
        $ condor_compile f77 -o runner.exe runner.f
LINKING FOR CONDOR : /usr/bin/ld -L/usr/local/condor/lib -Bstatic
--eh-frame-hdr . . .

        $  condor_compile f77 -o looper.exe looper.f
LINKING FOR CONDOR : /usr/bin/ld -L/usr/local/condor/lib -Bstatic
--eh-frame-hdr -m elf_i386 -dyn . . .

14.  As user griduserxx
     Submit a job in condor
        $ condor_submit runner.cmd
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 4.


15.  As user griduserxx
     Notice the files generated by the job
        $  more runner.out
 OUTPUT FROM PGM RUNNER

        $  more runner.err
Condor: Notice: Will checkpoint to condor_exec.exe.ckpt
Condor: Notice: Remote system calls disabled.

        $ more runner.log
000 (004.000.000) 11/06 16:59:16 Job submitted from host: <130.70.83.6:33007>
...
001 (004.000.000) 11/06 16:59:18 Job executing on host: <130.70.83.6:33008>
...
005 (004.000.000) 11/06 16:59:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job


16.  As user root
     Stop Condor
        $ condor_master -off

     Check and make sure all the condor daemons were stopped
             $ ps ax | grep condor
5204 ?        Ss     0:00 condor_master
5205 ?        Ss     0:00 condor_collector -f
5206 ?        Ss     0:00 condor_negotiator -f
5207 ?        Ss     0:00 condor_schedd -f
5208 ?        Ss     0:05 condor_startd -f
5217 pts/1    S+     0:00 grep condor

     If you notice that the condor daemons are still running, as in above,
simply kill the master
             $ kill 5204

     All the condor daemons should now be off
             ps ax | grep condor
 5537 pts/1    S+     0:00 grep condor

    Notice that none of the daemons are now listed as active


17.  As user root
     Modify the global config file to acknowledge gridpresent.local as the
pool manager
        $ vi $CONDOR/etc/condor_config
           a) add 1 line on line 51 for CONDOR_HOST     = gridpresent.local
              ##  What machine is your central manager?
          51  CONDOR_HOST        = gridpresent.local
##--------------------------------------------------------------------
##  Pathnames:
##--------------------------------------------------------------------
##  Where have you installed the bin, sbin and lib condor directories?
RELEASE_DIR                = /usr/local/condor


18.  As user root
     Modify the local config file to acknowledge gridpresent.local as the
pool manager
        $ vi $CONDOR/local.gridxx/condor_config.local
           a) Modify line 4
           2   ##  What machine is your central manager?
           3
           4 CONDOR_HOST = gridxx.local

           and replace gridxx.local with gridpresent.local
           4 CONDOR_HOST = gridpresent.local


19.  As user root
    Start Condor
        $ condor_master
        $ ps ax | grep condor
 5204 ?        Ss     0:00 condor_master
 5205 ?        Ss     0:00 condor_collector -f
 5206 ?        Ss     0:00 condor_negotiator -f
 5207 ?        Ss     0:00 condor_schedd -f
 5208 ?        Ss     0:05 condor_startd -f
 5217 pts/1    S+     0:00 grep condor

                You should see the following condor daemons executing:
master, collector,
                negotiator, schedd, & startd


20.  As user root
    Check the status of the global class condor pool - wait a few minutes
after initially starting the Condor
master before checking the status to allow all the daemons to initialize
        $ condor_status
Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

gridxx.local  LINUX       INTEL  Unclaimed  Idle       0.180  1002
0+00:00:04
gridxx.local  LINUX       INTEL  Unclaimed  Idle       0.180  1002
0+00:00:04
        .        .                .        .                .        .    
   .
        .        .                .        .                .        .    
   .
gridpresent.local  LINUX       INTEL  Unclaimed  Idle       0.180  1002
0+00:00:04


                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

         INTEL/LINUX     x     0       0         x       0          0
  0

               Total     x     0       0         x       0          0
  0

You should now see all the other laptops in the class which have joined
the pool

21.  As user root
     Open up the necessary ports for the Condor scheduler by adding 2
lines to iptables
             $ cp /etc/sysconfig/iptables /etc/sysconfig/iptables.SAV
        $ vi /etc/sysconfig/iptables

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport
60000:60500 -j ACCEPT

# Condor Scheduler
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 9618 -j
ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport
32000:33000 -j ACCEPT

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j
ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
COMMIT


22.  As user root
     Apply the changes in iptables to the firewall
             $ service iptables stop
        $ service iptables start



More information may be found in the Condor Version 6.8.2 manuals :
http://www.cs.wisc.edu/condor/manual/v6.8/


CHANGES PENDING :
1.  As user root
    Modify the global config file
    a) uncomment UID_DOMAIN = $(FULL_HOSTNAME)
    b) comment out UID_DOMAIN = your.domain
    c) uncomment FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
    d) comment out FILESYSTEM_DOMAIN = your.domain
    e) comment out COLLECTOR_NAME = My Pool
    f) add COLLECTOR_NAME = gridpresent.local
    g) modify JAVA = /opt/jdk1.5.0_08/bin/java
    h) uncomment REQUIRE_LOCAL_CONFIG_FILE = TRUE

2.  As user root
    Configure the local node as a submit and execution node and point to
the central manager
            $ condor_configure --type=submit,execute
--central-manager=gridpresent.local --owner=condor

3.  As user root
    BOTTOM OF CONDOR_CONFIG
    TESTING SECTION

4.  As user root
            $ chmod o=+rwx /home/griduser06/condor
        $ chmod o=+rwx /home/griduser06/condor/runner.err
        $ chmod o=+rwx /home/griduser06/condor/runner.out

-- NOTES --
>> use condor_reschedule - when a job remains in idle status



Thanks for the help,
Denvil...