[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor 7.5.6 schedd crash issue



Hi Kristian,

Sorry to hear you had problems with 7.5.6.

I haven't been able to reproduce the crash, unfortunately.  If you have time to help us track down this bug, we'd appreciate the help.

One thing that would help is a core file.  To generate one, you would need to append the following to your condor_config.local:

CREATE_CORE_FILES = true

When the schedd crashes, there should be a file named 'core' or 'core.####' in the log directory.  Please send a copy to condor-admin@xxxxxxxxxxx or post us a URL to access it.

Another thing that would help is more verbose logging in the schedd.  This can be enabled by appending the following to your condor_config.local:

SCHEDD_DEBUG = $(SCHEDD_DEBUG) D_FULLDEBUG D_PROTOCOL D_MACHINE

Thanks in advance,
--Dan

On 3/22/11 5:54 PM, Kristian Kvilekval wrote:
I am getting a schedd crash from the following simple test sleep.sub,
since upgrading to 7.5.6 by the apt repository.
Critical bug since I am no longer able to submit jobs.

Downgrading to 7.5.5 allows the same submit file to work correctly.


==========================================================================

System: Linux loup 2.6.32-5-openvz-amd64 #1 SMP Wed Jan 12 04:22:50 UTC 2011 x86_64 GNU/Linux
OS    : Debian squeeze


universe   = vanilla
executable = /bin/sleep
requirements =  Memory >= 4096
request_cpus = 4
request_memory = 4096
arguments = 30

log        = condor.log
output     = condor.out
error      = condor.error

transfer_executable   = False
should_transfer_files = YES
when_to_transfer_output =  ON_EXIT

notification = NEVER

queue

================================================================================

This is an automated email from the Condor system
on machine "loup.ece.ucsb.edu".  Do not reply.

"/usr/sbin/condor_schedd" on "loup.ece.ucsb.edu" died due to signal 11 (Segmentation fault).
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /var/log/condor/SchedLog:
03/22/11 15:29:42 (pid:10864) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
03/22/11 15:29:42 (pid:10864) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
03/22/11 15:29:42 (pid:10864) ** $CondorVersion: 7.5.6 Mar 13 2011 BuildID: 319722 $
03/22/11 15:29:42 (pid:10864) ** $CondorPlatform: x86_64_deb_5.0 $
03/22/11 15:29:42 (pid:10864) ** PID = 10864
03/22/11 15:29:42 (pid:10864) ** Log last touched 3/22 15:28:40
03/22/11 15:29:42 (pid:10864) ******************************************************
03/22/11 15:29:42 (pid:10864) Using config source: /etc/condor/condor_config
03/22/11 15:29:42 (pid:10864) Using local config sources: 
03/22/11 15:29:42 (pid:10864)    /etc/condor/condor_config.local
03/22/11 15:29:42 (pid:10864) DaemonCore: command socket at <128.111.185.149:37280>
03/22/11 15:29:42 (pid:10864) DaemonCore: private command socket at <128.111.185.149:37280>
03/22/11 15:29:42 (pid:10864) Setting maximum accepts per cycle 4.
03/22/11 15:29:42 (pid:10864) History file rotation is enabled.
03/22/11 15:29:42 (pid:10864)   Maximum history file size is: 20971520 bytes
03/22/11 15:29:42 (pid:10864)   Number of rotated history files is: 2
03/22/11 15:29:48 (pid:10864) Sent ad to central manager for kgk@xxxxxxxxxxxx
03/22/11 15:29:48 (pid:10864) Sent ad to 1 collectors for kgk@xxxxxxxxxxxx
03/22/11 15:30:02 (pid:10864) Negotiating for owner: kgk@xxxxxxxxxxxx
03/22/11 15:30:02 (pid:10864) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory
*** End of file SchedLog

_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/