[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble running multithreaded job in vanilla universe



More likely that the bash script exits after spawning the actual app. Make the bash script wait for your app to exit, e.g.

 sleep 10 &
 wait %1

Best,


matt

On 11/24/2010 10:59 AM, Ian Chesal wrote:
Condor wouldn't kill a sleeping process. It's smarter than that. The
output from your StarterLog makes it clear the top level process exited
in what Condor deemed a normal way, with an error code of zero.

Have you tried looking at the environment the job is running under?
Perhaps your executable is looking for something environment-specific to
indicate it can spawn threads? You could try adding:

get_env = true

to your submit ticket. This would propagate your environment to the job.

Regards,
- Ian

On Wed, Nov 24, 2010 at 10:54 AM, Christopher Whelan
<whelanch@xxxxxxxxxxxx <mailto:whelanch@xxxxxxxxxxxx>> wrote:

    Hi all,

    I'm a new condor user after our cluster switched from SGE, and I was
    hoping someone might be able to help me out with some trouble I'm having
    running one of my jobs. I've run a few jobs successfully so far, but I'm
    having a lot of trouble getting one of my processes to run, and I'm
    wondering if it's because of the multithreading the application I'm
    running uses. My condor executable is a bash script that launches a
    binary
    application (a short read aligner, in case any of you are in
    bioinformatics.) My problem is that the job appears to be picked up and
    run, but terminates immediately. The job output looks the same as it
    would
    if I executed it from the command line and then pressed ^C immediately.
    I've tried executing it manually on the machine it's being run on,
    and it
    works there. As I said before, the application is multithreaded, and I'm
    wondering if maybe the top level thread goes to sleep while it waits
    for its
    worker threads, and condor thinks it's done and interrupts the job?

    Any advice anyone might have would be much appreciated - even tips
    on where to look to
    diagnose the problem would be very helpful. Details are below.

    Thanks in advance,

    Chris

    Unfortunately I don't have access to the application source so I
    can't see
    exactly what it's doing threadwise. Here's my job description file:

      Executable = launchapp.sh
      Universe   = vanilla
      output     = job_out/launchap.out
      error      = job_out/launchap.error
      Log        = /tmp/whelanch_condor.log
      Notification = Never

      Initialdir = .
      Queue

    In my job output file I get this, which is the same message I see if I
    manually kill the application right after launching it from the command
    prompt:

    Interrupted..11
    Obtained 0 stack frames.

    The StarterLog looks like this:

    11/23 18:15:57 ******************************************************
    11/23 18:15:57 ** condor_starter (CONDOR_STARTER) STARTING UP
    11/23 18:15:57 ** /usr/sbin/condor_starter
    11/23 18:15:57 ** SubsystemInfo: name=STARTER type=STARTER(8)
    class=DAEMON(1)
    11/23 18:15:57 ** Configuration: subsystem:STARTER local:<NONE>
    class:DAEMON
    11/23 18:15:57 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
    11/23 18:15:57 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
    11/23 18:15:57 ** PID = 3185
    11/23 18:15:57 ** Log last touched 11/23 17:55:10
    11/23 18:15:57 ******************************************************
    11/23 18:15:57 Using config source: /etc/condor/condor_config
    11/23 18:15:57 Using local config sources:
    11/23 18:15:57    /l2/condor/condor_config.cluster
    11/23 18:15:57    /l2/condor/condor_config.eagle1
    11/23 18:15:57 DaemonCore: Command Socket at <129.95.39.41:41009
    <http://129.95.39.41:41009>>
    11/23 18:15:57 Done setting resource limits
    11/23 18:15:57 Communicating with shadow <129.95.39.73:41785
    <http://129.95.39.73:41785>>
    11/23 18:15:57 Submitting machine is "ostrich3.csee.ogi.edu
    <http://ostrich3.csee.ogi.edu>"
    11/23 18:15:57 setting the orig job name in starter
    11/23 18:15:57 setting the orig job iwd in starter
    11/23 18:15:57 Job 24.0 set to execute immediately
    11/23 18:15:57 Starting a VANILLA universe job with ID: 24.0
    11/23 18:15:57 IWD: /l2/users/whelanch/scripts/.
    11/23 18:15:57 Output file:
    /l2/users/whelanch/scripts/./job_out/launchapp.out
    11/23 18:15:57 Error file:
    /l2/users/whelanch/scripts/./job_out/launchapp.error
    11/23 18:15:57 About to exec
    /l2/users/whelanch/scripts/launchapp.sh
    11/23 18:15:57 Create_Process succeeded, pid=3186
    11/23 18:15:57 Process exited, pid=3186, status=0
    11/23 18:15:57 Got SIGQUIT.  Performing fast shutdown.
    11/23 18:15:57 ShutdownFast all jobs.
    11/23 18:15:57 **** condor_starter (condor_STARTER) pid 3185 EXITING
    WITH
    STATUS 0

    Anywhere else I should look?
    _______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
    <mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/