[Condor-users] Jobs returning blank output -- Schedd dying on master after job submission

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 11 Oct 2005 16:29:11 +0100

From: "Chris Miles" <chrismiles@xxxxxxxxxxxxxxxx>

Subject: [Condor-users] Jobs returning blank output -- Schedd dying on master after job submission

I have found a potential problem which may be related. It seems the application was not executing

successfully without transferring my dataset file with it.

So I have added the option:

transfer_input_files = dataset.dat

I then submit my job again... This time its crashing something in condor.

condor@thebeast:~/jobs/som-oct-5th> condor_q

-- Failed to fetch ads from: <192.168.1.1:43028> : thebeast.cluster.int
CEDAR:6001:Failed to connect to <192.168.1.1:43028>

I kill the master and daemons.

condor@thebeast:~/jobs/som-oct-5th> killall condor_master

condor@thebeast:~/jobs/som-oct-5th> ps -fe | grep condor
root      5532 3653 0 15:56 ?        00:00:00 sshd: condor [priv]
condor    5535 5532 0 15:56 ?        00:00:00 sshd: condor@pts/3
condor    5536 5535 0 15:56 pts/3    00:00:01 -bash
condor    6368 5536 0 16:28 pts/3    00:00:00 ps -fe
condor    6369 5536 0 16:28 pts/3    00:00:00 grep condor
condor@thebeast:~/jobs/som-oct-5th>

I know execute the master again.

The Scheduler is not starting up.

condor@thebeast:~/jobs/som-oct-5th> /home/condor/condor/sbin/condor_master
condor@thebeast:~/jobs/som-oct-5th> ps -fe | grep condor
root      5532 3653 0 15:56 ?        00:00:00 sshd: condor [priv]
condor    5535 5532 0 15:56 ?        00:00:00 sshd: condor@pts/3
condor    5536 5535 0 15:56 pts/3    00:00:01 -bash
condor    6371     1 0 16:29 ?        00:00:00 /home/condor/condor/sbin/condor_master
condor    6372 6371 0 16:29 ?        00:00:00 condor_collector -f
condor    6373 6371 0 16:29 ?        00:00:01 condor_startd -f
condor    6375 6371 0 16:29 ?        00:00:00 condor_negotiator -f
condor    6389 5536 0 16:29 pts/3    00:00:00 ps -fe
condor    6390 5536 0 16:29 pts/3    00:00:00 grep condor

The ScheddLog reports the following

10/11 16:13:37 ******************************************************
10/11 16:13:37 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/11 16:13:37 ** /home/condor/condor/sbin/condor_schedd
10/11 16:13:37 ** $CondorVersion: 6.7.10 Aug 3 2005 $
10/11 16:13:37 ** $CondorPlatform: I386-LINUX_RH9 $
10/11 16:13:37 ** PID = 6412
10/11 16:13:37 ******************************************************
10/11 16:13:37 Using config file: /home/condor/condor_config
10/11 16:13:37 Using local config files: /home/condor/condor/hosts/thebeast/condor_config.local
10/11 16:13:37 DaemonCore: Command Socket at <192.168.1.1:43225>
10/11 16:13:37 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
10/11 16:13:37 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
10/11 16:13:37 Will use UDP to update collector thebeast.cluster.int <192.168.1.1:9618>
10/11 16:13:37 Using name: thebeast.cluster.int
10/11 16:13:37 No Accountant host specified in config file
10/11 16:13:37 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
10/11 16:13:37 JOB_START_COUNT is undefined, using default value of 1
10/11 16:13:37 MAX_JOBS_SUBMITTED is undefined, using default value of 2147483647
10/11 16:13:37 STARTD_CONTACT_TIMEOUT is undefined, using default value of 45
10/11 16:13:37 initLocalStarterDir: /home/condor/condor/hosts/thebeast/spool/local_univ_execute already exists, deleting old contents
10/11 16:13:37 JOB_IS_FINISHED_INTERVAL is undefined, using default value of 0
10/11 16:13:37 Period for SelfDrainingQueue job_is_finished_queue set to 0
10/11 16:13:37 Queue Management Super Users:
10/11 16:13:37 root
10/11 16:13:37 condor
10/11 16:13:37 CronMgr: Constructing 'schedd'
10/11 16:13:37 CronMgr: Setting name to 'schedd'
10/11 16:13:37 CronMgr: Setting parameter base to 'schedd'
10/11 16:13:37 CronMgr: Doing config (initial)
10/11 16:13:37 About to truncate log /home/condor/condor/hosts/thebeast/spool/job_queue.log
10/11 16:13:37 entering FileTransfer::SimpleInit

This is *before* it dies, no information after it.

None of the other logs report anything out of the ordinary.

So I kill the daemons again.

And delete the logs etc in the central managers host directory

condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/log/
condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/spool/
condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/execute/

And recreate them

condor@thebeast:~/jobs/som-oct-5th> /home/condor/condor/sbin/condor_init
/home/condor/condor_config already exists.
Creating /home/condor/condor/hosts/thebeast/log
Creating /home/condor/condor/hosts/thebeast/spool
Creating /home/condor/condor/hosts/thebeast/execute
/home/condor/condor/hosts/thebeast/condor_config.local already exists.
Condor has been initialized, but not started.

And execute the master again.. The schedduler now starts up????

Why is this happening

Mailing List Archives

Public Access

[Condor-users] Jobs returning blank output -- Schedd dying on master after job submission