[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble running multithreaded job in vanilla universe



Condor wouldn't kill a sleeping process. It's smarter than that. The output from your StarterLog makes it clear the top level process exited in what Condor deemed a normal way, with an error code of zero.

Have you tried looking at the environment the job is running under? Perhaps your executable is looking for something environment-specific to indicate it can spawn threads? You could try adding:

get_env = true

to your submit ticket. This would propagate your environment to the job.

Regards,
- Ian

On Wed, Nov 24, 2010 at 10:54 AM, Christopher Whelan <whelanch@xxxxxxxxxxxx> wrote:
Hi all,

I'm a new condor user after our cluster switched from SGE, and I was
hoping someone might be able to help me out with some trouble I'm having
running one of my jobs. I've run a few jobs successfully so far, but I'm
having a lot of trouble getting one of my processes to run, and I'm
wondering if it's because of the multithreading the application I'm
running uses. My condor executable is a bash script that launches a binary
application (a short read aligner, in case any of you are in
bioinformatics.) My problem is that the job appears to be picked up and
run, but terminates immediately. The job output looks the same as it would
if I executed it from the command line and then pressed ^C immediately.
I've tried executing it manually on the machine it's being run on, and it
works there. As I said before, the application is multithreaded, and I'm
wondering if maybe the top level thread goes to sleep while it waits for its
worker threads, and condor thinks it's done and interrupts the job?

Any advice anyone might have would be much appreciated - even tips on where to look to
diagnose the problem would be very helpful. Details are below.

Thanks in advance,

Chris

Unfortunately I don't have access to the application source so I can't see
exactly what it's doing threadwise. Here's my job description file:

 Executable = launchapp.sh
 Universe   = vanilla
 output     = job_out/launchap.out
 error      = job_out/launchap.error
 Log        = /tmp/whelanch_condor.log
 Notification = Never

 Initialdir = .
 Queue

In my job output file I get this, which is the same message I see if I
manually kill the application right after launching it from the command
prompt:

Interrupted..11
Obtained 0 stack frames.

The StarterLog looks like this:

11/23 18:15:57 ******************************************************
11/23 18:15:57 ** condor_starter (CONDOR_STARTER) STARTING UP
11/23 18:15:57 ** /usr/sbin/condor_starter
11/23 18:15:57 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
11/23 18:15:57 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
11/23 18:15:57 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
11/23 18:15:57 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
11/23 18:15:57 ** PID = 3185
11/23 18:15:57 ** Log last touched 11/23 17:55:10
11/23 18:15:57 ******************************************************
11/23 18:15:57 Using config source: /etc/condor/condor_config
11/23 18:15:57 Using local config sources:
11/23 18:15:57    /l2/condor/condor_config.cluster
11/23 18:15:57    /l2/condor/condor_config.eagle1
11/23 18:15:57 DaemonCore: Command Socket at <129.95.39.41:41009>
11/23 18:15:57 Done setting resource limits
11/23 18:15:57 Communicating with shadow <129.95.39.73:41785>
11/23 18:15:57 Submitting machine is "ostrich3.csee.ogi.edu"
11/23 18:15:57 setting the orig job name in starter
11/23 18:15:57 setting the orig job iwd in starter
11/23 18:15:57 Job 24.0 set to execute immediately
11/23 18:15:57 Starting a VANILLA universe job with ID: 24.0
11/23 18:15:57 IWD: /l2/users/whelanch/scripts/.
11/23 18:15:57 Output file:
/l2/users/whelanch/scripts/./job_out/launchapp.out
11/23 18:15:57 Error file:
/l2/users/whelanch/scripts/./job_out/launchapp.error
11/23 18:15:57 About to exec
/l2/users/whelanch/scripts/launchapp.sh
11/23 18:15:57 Create_Process succeeded, pid=3186
11/23 18:15:57 Process exited, pid=3186, status=0
11/23 18:15:57 Got SIGQUIT.  Performing fast shutdown.
11/23 18:15:57 ShutdownFast all jobs.
11/23 18:15:57 **** condor_starter (condor_STARTER) pid 3185 EXITING WITH
STATUS 0

Anywhere else I should look?
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/