[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Trouble running multithreaded job in vanilla universe



Hi all,

I'm a new condor user after our cluster switched from SGE, and I was 
hoping someone might be able to help me out with some trouble I'm having 
running one of my jobs. I've run a few jobs successfully so far, but I'm 
having a lot of trouble getting one of my processes to run, and I'm 
wondering if it's because of the multithreading the application I'm 
running uses. My condor executable is a bash script that launches a binary 
application (a short read aligner, in case any of you are in 
bioinformatics.) My problem is that the job appears to be picked up and 
run, but terminates immediately. The job output looks the same as it would 
if I executed it from the command line and then pressed ^C immediately. 
I've tried executing it manually on the machine it's being run on, and it 
works there. As I said before, the application is multithreaded, and I'm 
wondering if maybe the top level thread goes to sleep while it waits for its 
worker threads, and condor thinks it's done and interrupts the job? 

Any advice anyone might have would be much appreciated - even tips on where to look to 
diagnose the problem would be very helpful. Details are below.

Thanks in advance,

Chris
 
Unfortunately I don't have access to the application source so I can't see 
exactly what it's doing threadwise. Here's my job description file:

  Executable = launchapp.sh
  Universe   = vanilla                   
  output     = job_out/launchap.out                
  error      = job_out/launchap.error             
  Log        = /tmp/whelanch_condor.log
  Notification = Never
                                  
  Initialdir = .
  Queue                  

In my job output file I get this, which is the same message I see if I 
manually kill the application right after launching it from the command 
prompt:

Interrupted..11
Obtained 0 stack frames.

The StarterLog looks like this:

11/23 18:15:57 ******************************************************
11/23 18:15:57 ** condor_starter (CONDOR_STARTER) STARTING UP
11/23 18:15:57 ** /usr/sbin/condor_starter
11/23 18:15:57 ** SubsystemInfo: name=STARTER type=STARTER(8) 
class=DAEMON(1)
11/23 18:15:57 ** Configuration: subsystem:STARTER local:<NONE> 
class:DAEMON
11/23 18:15:57 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
11/23 18:15:57 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
11/23 18:15:57 ** PID = 3185
11/23 18:15:57 ** Log last touched 11/23 17:55:10
11/23 18:15:57 ******************************************************
11/23 18:15:57 Using config source: /etc/condor/condor_config
11/23 18:15:57 Using local config sources: 
11/23 18:15:57    /l2/condor/condor_config.cluster
11/23 18:15:57    /l2/condor/condor_config.eagle1
11/23 18:15:57 DaemonCore: Command Socket at <129.95.39.41:41009>
11/23 18:15:57 Done setting resource limits
11/23 18:15:57 Communicating with shadow <129.95.39.73:41785>
11/23 18:15:57 Submitting machine is "ostrich3.csee.ogi.edu"
11/23 18:15:57 setting the orig job name in starter
11/23 18:15:57 setting the orig job iwd in starter
11/23 18:15:57 Job 24.0 set to execute immediately
11/23 18:15:57 Starting a VANILLA universe job with ID: 24.0
11/23 18:15:57 IWD: /l2/users/whelanch/scripts/.
11/23 18:15:57 Output file: 
/l2/users/whelanch/scripts/./job_out/launchapp.out
11/23 18:15:57 Error file: 
/l2/users/whelanch/scripts/./job_out/launchapp.error
11/23 18:15:57 About to exec 
/l2/users/whelanch/scripts/launchapp.sh
11/23 18:15:57 Create_Process succeeded, pid=3186
11/23 18:15:57 Process exited, pid=3186, status=0
11/23 18:15:57 Got SIGQUIT.  Performing fast shutdown.
11/23 18:15:57 ShutdownFast all jobs.
11/23 18:15:57 **** condor_starter (condor_STARTER) pid 3185 EXITING WITH 
STATUS 0

Anywhere else I should look?