[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Debugging interactive jobs (was Re: Condor and batchMatlab problem)



On Wed, Jul 14, 2004 at 11:29:25AM -0600, bgore@xxxxxxxxxx wrote:
> It could be a security/permissions thing. We had this happen on another matlab-like program. Every time it ran it wanted to update a file in the install directory. However, without modification the default condor local account did not have permission to update this file. So, run matlab interactively and look for recently modified files in the matlab install directory. Give update permissions to the local condor account for those files and see if that fixes your issue. ~B
> 

When you run it interactively, you should try and make sure that you're
using the same environment as what the job will see - the best way to do that
is to set 

USE_VISIBLE_DESKTOP = True 

on an _EXECUTE_ machine. 

When a job starts up, it will open a DOS prompt on the desktop and
start executing there if that is set. You can watch it run if you'd like!

Make your submit file be something like:

#Executable = matlab.bat
Executable = cmd.exe
Universe = vanilla
#Requirements = ((Arch == "INTEL" && OpSys == "WINNT51")) 
Requrirements = machine == "host-with-use-visible-desktop-set.your.domain"
should_transfer_files = YES 
transfer_executable = false
whenToTransferOutput = ON_EXIT 
#transfer_input_files = a.dat,b.dat,test.m 
transfer_input_files = a.dat,b.dat,test.m,matlab.bat
environment = PATH=c:\matlab_sv13\bin\win32 
#arguments = /r test /logfile log.txt 
arguments = /K
log = mat.log 
Output = mat.out 
Error = mat.err 
Queue 1

Submit your job, then wait at host-with-use-visible-desktop-set.your.domain
and you'll get a cmd.exe window. Now you can debug the job exactly as the
job will see it. 

I'm not 100% sure about the /K argument - I know you need to give something 
to cmd.exe to get it to stick around so you can actually type on it. 
Hopefully someone who uses Windows can confirm. 

Make sure you watch out for your START expression - if you've got it set
so that typing on the keyboard suspends your job, as soon as you try and
use your window the startd will suspend it. Best to set START=true :)

We use this approach all the time to debug why jobs won't run under
Condor - it turns out that there are a number of "console" apps that will
decide to pop up a window waiting for the user to click "OK" before they'll
start. 

You can pull a similar approach on Unix with X-Windows:
universe = vanilla
executable = /usr/X11R6/bin/xterm
arguments == --display=submitting-host.your.domain:0
queue

set 'xhost +' and then submit your job, and when it runs, you'll get an
xterm on your screen that's running on the remote machine, and you can
interactively work on the remote machine. Add in any file transfer 
options you need to set things up for your job and you can get a start
on debugging "it runs outside of Condor, but not inside"

And one last trick - sometimes a program just insists on haivng an Xserver
to run under - Open Office comes to mind - there's no way to disable it's
screen. In order to get it to work, we used the X Virtual Frame Buffer -
it's like /dev/null for X. (http://www.xfree86.org/4.0.1/Xvfb.1.html)

To get it to work, I used this submit file:
universe = vanilla
executable = xvfb-run
WhenToTransferOutput = always
transfer_files = always
arguments = xlr8r_linux filename
transfer_input_files = xlr8r_linux, xlr8r_linux.rdb, filename 
environment = LD_LIBRARY_PATH=/p/condor/workspaces/epaulson/739/xlr8r_libraries;
PATH=/usr/bin:/bin:/s/std/bin:.;HOME=/u/e/p/epaulson
requirements = IsC2Cluster 
output = filename.out
error = filename.err

(xlr8r_linux was a program that invoked OpenOffice)

xvfb-run was a shell script I found on the net (I think it's from Debian)


#!/bin/sh

chmod 755 xlr8r_linux
set -o xtrace
set -e

ulimit -c 0
# xvfb-run - run the specified command in a virtual X server

# This script starts an instance of Xvfb, the "fake" X server, runs a
# command with that server available, and kills the X server when
# done.  The return value of the command becomes the return value of
# this script.
#
# If anyone is using this to build a Debian package, make sure the
# package Build-Depends on xvfb, xbase-clients and xfonts-base.

set -e

DISPLAYNUM=99
AUTHFILE=$(pwd)/Xauthority
STARTWAIT=3
LISTENTCP="-nolisten tcp"
#unset AUTODISPLAYNUM


usage()
{
  echo "Usage: $0 [OPTION]... [command]"
  echo
  echo "run specified X client or command in a virtual X server environment"
  echo
  echo "  -a --auto-displaynum      Try to get a free display number, starting at --display-num"
  echo "  -f --auth-file=FILE       File to store auth cookie (default:./Xauthority)"
  echo "  -n --display-num=NUM      Display number to use (default:$DISPLAYNUM)"
  echo "  -l --listen-tcp           Enable TCP port listening in the X server"
  echo "  -w --wait=DELAY           Delay in seconds to wait for Xvfb to start (default:$STARTWAIT)"
  echo "  -h --help                 Display this help and exit"
}


# Parse command line
ARGS=`getopt --options +af:n:lw:h \
	--long auto-displaynum,authority-file:,display-num:,listen-tcp,wait:,help \
	--name "$0" -- "$@"`
if [ $? != 0 ] ; then echo "Terminating..." >&2 ; exit 1 ; fi

eval set -- "$ARGS"
while true ; do
    case "$1" in
      '-a'|'--auto-displaynum')
      	    AUTODISPLAYNUM=y
      	    ;;
      '-f'|'--auth-file')
	    AUTHFILE="$2"
	    shift
	    ;;
      '-n'|'--display-num')
	    DISPLAYNUM="$2"
	    shift
	    ;;
      '-l'|'--listen-tcp')
	    LISTENTCP=
	    ;;
      '-w'|'--wait')
	    STARTWAIT="$2"
	    shift
	    ;;
      '-h'|'--help')
	    usage
	    exit 1
	    ;;
      '--')
	    # end of options
	    shift
	    break
	    ;;
      *)
            echo "Internal error!"; exit 1;;
    esac

    shift
done

echo "Starting"
i=$DISPLAYNUM
while [ -f /tmp/.X$i-lock ]; do
  echo "Checking $i"
  i=$(($i+1))
 done
echo $i
DISPLAYNUM=$i

echo $DISPLAYNUM
# start Xvfb
rm -f "$AUTHFILE"
MCOOKIE=$(mcookie)
XAUTHORITY="$AUTHFILE" xauth add :$DISPLAYNUM . $MCOOKIE > /dev/null
XAUTHORITY="$AUTHFILE" Xvfb :$DISPLAYNUM -screen 0 640x480x8 $LISTENTCP \
	> /dev/null  &
XVFBPID=$!
sleep $STARTWAIT

set +e

# Check that server has not exited
if ! kill -0 $XVFBPID; then
  echo "Xvfb server has died" >&2
  exit 1
fi

# start the command and save its exit status
echo $@
DISPLAY=:$DISPLAYNUM XAUTHORITY="$AUTHFILE" $@ 2>&1
RETVAL=$?
set -e

# kill Xvfb and clean up
kill $XVFBPID
XAUTHORITY="$AUTHFILE" xauth remove :$DISPLAYNUM > /dev/null
rm "$AUTHFILE"

# return the executed command's exit status
exit $RETVAL


# Find free display number by looking at .X-lock files in /tmp
#find-free-display()
#{
#}