[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] getting segfault while trying to run cuda jobs



I'm having an issue with Condor where all jobs, that are compiled to use Cuda, segfault. The executables run fine on any of the local machines but when submitted through Condor I get this:
...
005 (246.000.000) 10/14 14:32:37 Job terminated.
       (0) Abnormal termination (signal 11)
       (0) No core file
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
...

Here is a snippet from the starter logs:

10/14 14:24:47 ******************************************************
10/14 14:24:47 ** condor_starter (CONDOR_STARTER) STARTING UP
10/14 14:24:47 ** /usr/sbin/condor_starter
10/14 14:24:47 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
10/14 14:24:47 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
10/14 14:24:47 ** $CondorVersion: 7.2.1 Jul 2 2009 BuildID: RH-7.2.2-0.9.el5 $
10/14 14:24:47 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
10/14 14:24:47 ** PID = 27515
10/14 14:24:47 ** Log last touched 10/14 14:23:12
10/14 14:24:47 ******************************************************
10/14 14:24:47 Using config source: /etc/condor/condor_config
10/14 14:24:47 Using local config sources:
10/14 14:24:47    /var/lib/condor/condor_config.local
10/14 14:24:47 DaemonCore: Command Socket at <192.168.1.1:57784>
10/14 14:24:47 Done setting resource limits
10/14 14:24:47 Communicating with shadow <192.168.1.100:36573>
10/14 14:24:47 Submitting machine is "tesla"
10/14 14:24:47 setting the orig job name in starter
10/14 14:24:47 setting the orig job iwd in starter
10/14 14:24:47 Job 244.0 set to execute immediately
10/14 14:24:47 Starting a VANILLA universe job with ID: 244.0
10/14 14:24:47 IWD: /home/nlawrence3/matrixMul
10/14 14:24:47 Output file: /home/nlawrence3/matrixMul/out.0
10/14 14:24:47 Error file: /home/nlawrence3/matrixMul/err.0
10/14 14:24:47 About to exec /home/nlawrence3/matrixMul/a.out
10/14 14:24:47 Create_Process succeeded, pid=27516
10/14 14:24:50 Process exited, pid=27516, signal=11
10/14 14:24:50 Got SIGQUIT.  Performing fast shutdown.
10/14 14:24:50 ShutdownFast all jobs.
10/14 14:24:50 **** condor_starter (condor_STARTER) pid 27515 EXITING WITH STATUS 0
10/14 14:31:40 ******************************************************

I have also been unable to get a core file, despite core dumps being enabled, and ulimit being set to 0. You can see the segfault in /var/log/messages on the local machine: Oct 14 11:02:43 node1 kernel: condor_exec.exe[20659]: segfault at 0000000000000000 rip 00002acfebcce980 rsp 00007fff474f1c68 error 4

I also noticed that the first few lines of the code are executing because there are several printlines before the segfault, which leads me to believe its not related to file permissions, although on some jobs I have noticed the following warning: 10/12 13:41:28 warning: unable to chmod condor_exec.exe to ensure execute bit is set: Operation not permitted

Thanks in advance for any assistance you can provide.