[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Lost Connection?



All,

 

I am submitting a job (written in Java) to the Central Manager. When I submit it for execution on the Central Manager everything works fine but when I submitting it to another machine (using: Rank = (machine == "<target_machine>") I get the error below.

 

7/25 11:13:13 ******************************************************

7/25 11:13:13 ** condor_shadow (CONDOR_SHADOW) STARTING UP

7/25 11:13:13 ** D:\condor\bin\condor_shadow.exe

7/25 11:13:13 ** $CondorVersion: 6.8.0 Jul 19 2006 $

7/25 11:13:13 ** $CondorPlatform: INTEL-WINNT50 $

7/25 11:13:13 ** PID = 2668

7/25 11:13:13 ** Log last touched 7/25 11:13:11

7/25 11:13:13 ******************************************************

7/25 11:13:13 Using config source: D:\condor\condor_config

7/25 11:13:13 Using local config sources:

7/25 11:13:13    D:\condor/condor_config.local

7/25 11:13:13 DaemonCore: Command Socket at <192.168.16.23:1637>

7/25 11:13:13 Initializing a JAVA shadow for job 23.0

7/25 11:13:13 (23.0) (2668): Request to run on <192.168.16.37:3345> was ACCEPTED

7/25 11:13:13 (22.0) (1316): Job 22.0 terminated: exited with status 0

7/25 11:13:14 (22.0) (1316): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

7/25 11:13:15 (23.0) (2668): condor_read(): recv() returned -1, errno = 10054, assuming failure.

7/25 11:13:15 (23.0) (2668): Can no longer talk to condor_starter <192.168.16.37:3345>

7/25 11:13:15 (23.0) (2668): Trying to reconnect to disconnected job

7/25 11:13:15 (23.0) (2668): LastJobLeaseRenewal: 1153818795 Tue Jul 25 11:13:15 2006

7/25 11:13:15 (23.0) (2668): JobLeaseDuration: 1200 seconds

7/25 11:13:15 (23.0) (2668): JobLeaseDuration remaining: 1200

7/25 11:13:15 (23.0) (2668): Attempting to reconnect to starter <192.168.16.37:3372>

 

 

On the target machine I get:

7/25 15:10:22 ******************************************************

7/25 15:10:22 ** condor_starter (CONDOR_STARTER) STARTING UP

7/25 15:10:22 ** C:\condor\bin\condor_starter.exe

7/25 15:10:22 ** $CondorVersion: 6.8.0 Jul 19 2006 $

7/25 15:10:22 ** $CondorPlatform: INTEL-WINNT50 $

7/25 15:10:22 ** PID = 2012

7/25 15:10:22 ** Log last touched 7/25 14:40:20

7/25 15:10:22 ******************************************************

7/25 15:10:22 Using config source: C:\condor\condor_config

7/25 15:10:22 Using local config sources:

7/25 15:10:22    C:\condor/condor_config.local

7/25 15:10:22 DaemonCore: Command Socket at <192.168.16.37:4226>

7/25 15:10:22 Setting resource limits not implemented!

7/25 15:10:22 Communicating with shadow <192.168.16.23:4181>

7/25 15:10:22 Submitting machine is "RONEN01.sbs.local"

7/25 15:10:23 Initialized IO Proxy.

7/25 15:10:23 File transfer completed successfully.

7/25 15:10:24 Starting a JAVA universe job with ID: 23.0

7/25 15:10:24 JavaProc: Cmd="C:\Program Files\Java\jdk1.5.0_06\bin\JAVA.EXE"

7/25 15:10:24 JavaProc: Args=-classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -Xmx300m -Dchirp.config=C:\condor\execute\dir_2012\chirp.config CondorJavaWrapper C:\condor\execute\dir_2012\jvm.start C:\condor\execute\dir_2012\jvm.end com.test.Hello

7/25 15:10:24 IWD: C:\condor/execute\dir_2012

7/25 15:10:24 Output file: C:\condor/execute\dir_2012\_condor_stdout

7/25 15:10:24 Error file: C:\condor/execute\dir_2012\_condor_stderr

7/25 15:10:24 Renice expr "10" evaluated to 10

7/25 15:10:24 About to exec C:\condor/execute\dir_2012\"C:\Program Files\Java\jdk1.5.0_06\bin\JAVA.EXE" -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -Xmx300m -Dchirp.config=C:\condor\execute\dir_2012\chirp.config CondorJavaWrapper C:\condor\execute\dir_2012\jvm.start C:\condor\execute\dir_2012\jvm.end com.test.Hello

 

 

Although that everything seems OK on the target machine the Central Manager never gets the job execution results.

Any suggestions?

 

Thanks,

Ronen