[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs returning blank output -- Schedd dying on master after job submission



I have found a potential problem which may be related. It seems the application was not executing
successfully without transferring my dataset file with it.
 
So I have added the option:
 
transfer_input_files = dataset.dat
 
I then submit my job again... This time its crashing something in condor.
 
 
-- Failed to fetch ads from: <192.168.1.1:43028> : thebeast.cluster.int
CEDAR:6001:Failed to connect to <192.168.1.1:43028>
 
 
I kill the master and daemons.
 
condor@thebeast:~/jobs/som-oct-5th> killall condor_master
 
 
condor@thebeast:~/jobs/som-oct-5th> ps -fe | grep condor
root      5532  3653  0 15:56 ?        00:00:00 sshd: condor [priv]
condor    5535  5532  0 15:56 ?        00:00:00 sshd: condor@pts/3
condor    5536  5535  0 15:56 pts/3    00:00:01 -bash
condor    6368  5536  0 16:28 pts/3    00:00:00 ps -fe
condor    6369  5536  0 16:28 pts/3    00:00:00 grep condor
condor@thebeast:~/jobs/som-oct-5th>
 
I know execute the master again.
 
The Scheduler is not starting up.
 
condor@thebeast:~/jobs/som-oct-5th> /home/condor/condor/sbin/condor_master
condor@thebeast:~/jobs/som-oct-5th> ps -fe | grep condor
root      5532  3653  0 15:56 ?        00:00:00 sshd: condor [priv]
condor    5535  5532  0 15:56 ?        00:00:00 sshd: condor@pts/3
condor    5536  5535  0 15:56 pts/3    00:00:01 -bash
condor    6371     1  0 16:29 ?        00:00:00 /home/condor/condor/sbin/condor_master
condor    6372  6371  0 16:29 ?        00:00:00 condor_collector -f
condor    6373  6371  0 16:29 ?        00:00:01 condor_startd -f
condor    6375  6371  0 16:29 ?        00:00:00 condor_negotiator -f
condor    6389  5536  0 16:29 pts/3    00:00:00 ps -fe
condor    6390  5536  0 16:29 pts/3    00:00:00 grep condor
 
The ScheddLog reports the following
 
10/11 16:13:37 ******************************************************
10/11 16:13:37 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/11 16:13:37 ** /home/condor/condor/sbin/condor_schedd
10/11 16:13:37 ** $CondorVersion: 6.7.10 Aug  3 2005 $
10/11 16:13:37 ** $CondorPlatform: I386-LINUX_RH9 $
10/11 16:13:37 ** PID = 6412
10/11 16:13:37 ******************************************************
10/11 16:13:37 Using config file: /home/condor/condor_config
10/11 16:13:37 Using local config files: /home/condor/condor/hosts/thebeast/condor_config.local
10/11 16:13:37 DaemonCore: Command Socket at <192.168.1.1:43225>
10/11 16:13:37 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
10/11 16:13:37 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
10/11 16:13:37 Will use UDP to update collector thebeast.cluster.int <192.168.1.1:9618>
10/11 16:13:37 Using name: thebeast.cluster.int
10/11 16:13:37 No Accountant host specified in config file
10/11 16:13:37 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
10/11 16:13:37 JOB_START_COUNT is undefined, using default value of 1
10/11 16:13:37 MAX_JOBS_SUBMITTED is undefined, using default value of 2147483647
10/11 16:13:37 STARTD_CONTACT_TIMEOUT is undefined, using default value of 45
10/11 16:13:37 initLocalStarterDir: /home/condor/condor/hosts/thebeast/spool/local_univ_execute already exists, deleting old contents
10/11 16:13:37 JOB_IS_FINISHED_INTERVAL is undefined, using default value of 0
10/11 16:13:37 Period for SelfDrainingQueue job_is_finished_queue set to 0
10/11 16:13:37 Queue Management Super Users:
10/11 16:13:37  root
10/11 16:13:37  condor
10/11 16:13:37 CronMgr: Constructing 'schedd'
10/11 16:13:37 CronMgr: Setting name to 'schedd'
10/11 16:13:37 CronMgr: Setting parameter base to 'schedd'
10/11 16:13:37 CronMgr: Doing config (initial)
10/11 16:13:37 About to truncate log /home/condor/condor/hosts/thebeast/spool/job_queue.log
10/11 16:13:37 entering FileTransfer::SimpleInit
 
This is *before* it dies, no information after it.
 
None of the other logs report anything out of the ordinary.
 
So I kill the daemons again.
 
And delete the logs etc in the central managers host directory
 
condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/log/
condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/spool/
condor@thebeast:~/jobs/som-oct-5th> rm -R ~/condor/hosts/thebeast/execute/
And recreate them
condor@thebeast:~/jobs/som-oct-5th> /home/condor/condor/sbin/condor_init
/home/condor/condor_config already exists.
Creating /home/condor/condor/hosts/thebeast/log
Creating /home/condor/condor/hosts/thebeast/spool
Creating /home/condor/condor/hosts/thebeast/execute
/home/condor/condor/hosts/thebeast/condor_config.local already exists.
Condor has been initialized, but not started.
 
And execute the master again.. The schedduler now starts up????
 
 Why is this happening