Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] getting segfault while trying to run cuda jobs

Date: Wed, 14 Oct 2009 14:53:09 -0400
From: Nigel Lawrence <nigel.lawrence@xxxxxxxxxxxxxxx>
Subject: [Condor-users] getting segfault while trying to run cuda jobs

I'm having an issue with Condor where all jobs, that are compiled to useCuda, segfault. The executables run fine on any of the local machinesbut when submitted through Condor I get this:

...
005 (246.000.000) 10/14 14:32:37 Job terminated.
       (0) Abnormal termination (signal 11)
       (0) No core file
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
...

Here is a snippet from the starter logs:

10/14 14:24:47 ******************************************************
10/14 14:24:47 ** condor_starter (CONDOR_STARTER) STARTING UP
10/14 14:24:47 ** /usr/sbin/condor_starter

10/14 14:24:47 ** SubsystemInfo: name=STARTER type=STARTER(8)class=DAEMON(1)

10/14 14:24:47 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

10/14 14:24:47 ** $CondorVersion: 7.2.1 Jul 2 2009 BuildID:RH-7.2.2-0.9.el5 $

10/14 14:24:47 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
10/14 14:24:47 ** PID = 27515
10/14 14:24:47 ** Log last touched 10/14 14:23:12
10/14 14:24:47 ******************************************************
10/14 14:24:47 Using config source: /etc/condor/condor_config
10/14 14:24:47 Using local config sources:
10/14 14:24:47    /var/lib/condor/condor_config.local
10/14 14:24:47 DaemonCore: Command Socket at <192.168.1.1:57784>
10/14 14:24:47 Done setting resource limits
10/14 14:24:47 Communicating with shadow <192.168.1.100:36573>
10/14 14:24:47 Submitting machine is "tesla"
10/14 14:24:47 setting the orig job name in starter
10/14 14:24:47 setting the orig job iwd in starter
10/14 14:24:47 Job 244.0 set to execute immediately
10/14 14:24:47 Starting a VANILLA universe job with ID: 244.0
10/14 14:24:47 IWD: /home/nlawrence3/matrixMul
10/14 14:24:47 Output file: /home/nlawrence3/matrixMul/out.0
10/14 14:24:47 Error file: /home/nlawrence3/matrixMul/err.0
10/14 14:24:47 About to exec /home/nlawrence3/matrixMul/a.out
10/14 14:24:47 Create_Process succeeded, pid=27516
10/14 14:24:50 Process exited, pid=27516, signal=11
10/14 14:24:50 Got SIGQUIT.  Performing fast shutdown.
10/14 14:24:50 ShutdownFast all jobs.

10/14 14:24:50 **** condor_starter (condor_STARTER) pid 27515 EXITINGWITH STATUS 0

10/14 14:31:40 ******************************************************

I have also been unable to get a core file, despite core dumps beingenabled, and ulimit being set to 0. You can see the segfault in/var/log/messages on the local machine:Oct 14 11:02:43 node1 kernel: condor_exec.exe[20659]: segfault at0000000000000000 rip 00002acfebcce980 rsp 00007fff474f1c68 error 4

I also noticed that the first few lines of the code are executingbecause there are several printlines before the segfault, which leads meto believe its not related to file permissions, although on some jobs Ihave noticed the following warning:10/12 13:41:28 warning: unable to chmod condor_exec.exe to ensureexecute bit is set: Operation not permitted


Thanks in advance for any assistance you can provide.

Prev by Date: Re: [Condor-users] Condor 6.8.n: job scheduling process delays
Next by Date: Re: [Condor-users] Build problems on ArchLinux
Previous by thread: Re: [Condor-users] Build problems on ArchLinux
Next by thread: [Condor-users] Call for Workshops: ACM Int. Symposium High Performance Distributed Computing (HPDC) 2010
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] getting segfault while trying to run cuda jobs