[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.7 Series Job Issue



Hi,

Before trying to upgrade Condor 6.6.10 to Condor 6.7 series, I was trying to test it with BirdBath.

So this pool has two machines the central manager (job submitting host) and the execution node.

I followed the configuration for BirdBath and got the Schedd & Collector WSDL files and put it in the web directory, then configured Schedd to port 9600.

The jobs are running forever and getting shadow exception. Job is a simple shell script. Here is the log msgs,

 

Job Submitting Host

***********************

ShadowLog

*************

6/6 16:54:08 (10.0) (27048): JobLeaseDuration remaining: 930

6/6 16:54:08 (10.0) (27048): Scheduling another attempt to reconnect in 128 seconds

6/6 16:56:16 (10.0) (27048): Attempting to reconnect to starter <x.x.x.x:9607>

6/6 16:56:16 (10.0) (27048): getpeername failed so connect must have failed

6/6 16:56:46 (10.0) (27048): Connect failed for 30 seconds; returning FALSE

6/6 16:56:46 (10.0) (27048): Attempt to reconnect failed: Failed to connect to starter <x.x.x.x:9607>

6/6 16:56:46 (10.0) (27048): JobLeaseDuration remaining: 772

6/6 16:56:46 (10.0) (27048): Scheduling another attempt to reconnect in 256 seconds

6/6 17:01:02 (10.0) (27048): Attempting to reconnect to starter <x.x.x.x:9607>

6/6 17:01:02 (10.0) (27048): getpeername failed so connect must have failed

6/6 17:01:32 (10.0) (27048): Connect failed for 30 seconds; returning FALSE

6/6 17:01:32 (10.0) (27048): Attempt to reconnect failed: Failed to connect to starter <x.x.x.x:9607>

6/6 17:01:32 (10.0) (27048): JobLeaseDuration remaining: 486

6/6 17:01:32 (10.0) (27048): Scheduling another attempt to reconnect in 300 seconds

 

Execution Node

******************

StarterLog

************

6/6 16:49:38 ** Log last touched 6/6 16:19:38

6/6 16:49:38 ******************************************************

6/6 16:49:38 Using config file: /home/condor/condor_config

6/6 16:49:38 Using local config files: /home/condor/condor_config.local

6/6 16:49:38 DaemonCore: Command Socket at <x.x.x.x:9607>

6/6 16:49:38 Done setting resource limits

6/6 16:49:38 Communicating with shadow <x.x.x.x:9619>

6/6 16:49:38 Submitting machine is "machine.name.edu"

6/6 16:49:38 couldn't create dir /home/condor/execute/dir_25145: Permission denied

6/6 16:49:38 Failed to initialize JobInfoCommunicator, aborting

6/6 16:49:38 Unable to start job.

6/6 16:49:38 **** condor_starter (condor_STARTER) EXITING WITH STATUS 1

 

 

Both central manager and execution node services are started as root. I don’t know why it says permission denied.

Could you please let me know what is going on? And how to make it run.

Thanks,

Senthil