[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] no correct recognition of finish job



Hello !

Here is the problem:

A java job was submitted under condor 6.8.2, sucessfully finished, all resulting files were transferred to the submit host.
Nevertheless condor_q reports this job as running...

here some log file snippets:

from starter.log on the remote (job executing) machine:

11/13 11:40:08 ******************************************************
11/13 11:40:08 ** condor_starter (CONDOR_STARTER) STARTING UP
11/13 11:40:08 ** C:\condor\bin\condor_starter.exe
11/13 11:40:08 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/13 11:40:08 ** $CondorPlatform: INTEL-WINNT50 $
11/13 11:40:08 ** PID = 2636
11/13 11:40:08 ** Log last touched 11/10 08:45:04
11/13 11:40:08 ******************************************************
11/13 11:40:08 Using config source: C:\condor\condor_config
11/13 11:40:08 Using local config sources: 
11/13 11:40:08    C:\condor/condor_config.local
11/13 11:40:08 DaemonCore: Command Socket at <10.1.101.182:3447>
11/13 11:40:08 Setting resource limits not implemented!
11/13 11:40:08 Communicating with shadow <10.1.101.43:1724>
11/13 11:40:08 Submitting machine is "pc05100101kbv.kbv.int"
11/13 11:40:08 Initialized IO Proxy.
11/13 11:40:44 File transfer completed successfully.
11/13 11:40:45 Starting a JAVA universe job with ID: 63.0
11/13 11:40:45 JavaProc: Cmd=C:\Programme\Java\jre1.5.0_06\bin\JAVA.EXE
11/13 11:40:45 JavaProc: Args=-classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;.;C:\condor\execute\dir_2636\vdx_frequenzbewerter.jar;C:\condor\execute\dir_2636\vdx_framework.jar;C:\condor\execute\dir_2636\vdx_frequenz.jar;C:\condor\execute\dir_2636\vdx_support.jar;C:\condor\execute\dir_2636\vdx_keytabs.jar;C:\condor\execute\dir_2636\xbean.jar;C:\condor\execute\dir_2636\jsr173_1.0_api.jar;C:\condor\execute\dir_2636\jug-asl-2.0rc4.jar;C:\condor\execute\dir_2636\log4j-1.2.13.jar;C:\condor\execute\dir_2636\hxsqlmains4.jar;C:\condor\execute\dir_2636\kbvjpom.jar;C:\condor\execute\dir_2636\JGlobal.jar -Xmx1000m -Dchirp.config=C:\condor\execute\dir_2636\chirp.config CondorJavaWrapper C:\condor\execute\dir_2636\jvm.start C:\condor\execute\dir_2636\jvm.end de.kbv.vdx.frequenz.bewerter.FrequenzBewerterApp -v -n -b -q -iF08_1.07_kv61_tf+2005q2.xml -k74E05201.100 -lpruefliste -mFB3 -obewertet.xml -sbeispiel.bsd -x -zvdx_stammdaten_test.zip
11/13 11:40:45 IWD: C:\condor/execute\dir_2636
11/13 11:40:45 Output file: C:\condor/execute\dir_2636\bewerter.output
11/13 11:40:45 Error file: C:\condor/execute\dir_2636\bewerter.error
11/13 11:40:45 Renice expr "10" evaluated to 10
11/13 11:40:45 About to exec C:\Programme\Java\jre1.5.0_06\bin\JAVA.EXE -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;.;C:\condor\execute\dir_2636\vdx_frequenzbewerter.jar;C:\condor\execute\dir_2636\vdx_framework.jar;C:\condor\execute\dir_2636\vdx_frequenz.jar;C:\condor\execute\dir_2636\vdx_support.jar;C:\condor\execute\dir_2636\vdx_keytabs.jar;C:\condor\execute\dir_2636\xbean.jar;C:\condor\execute\dir_2636\jsr173_1.0_api.jar;C:\condor\execute\dir_2636\jug-asl-2.0rc4.jar;C:\condor\execute\dir_2636\log4j-1.2.13.jar;C:\condor\execute\dir_2636\hxsqlmains4.jar;C:\condor\execute\dir_2636\kbvjpom.jar;C:\condor\execute\dir_2636\JGlobal.jar -Xmx1000m -Dchirp.config=C:\condor\execute\dir_2636\chirp.config CondorJavaWrapper C:\condor\execute\dir_2636\jvm.start C:\condor\execute\dir_2636\jvm.end de.kbv.vdx.frequenz.bewerter.FrequenzBewerterApp -v -n -b -q -iF08_1.07_kv61_tf+2005q2.xml -k74E05201.100 -lpruefliste -mFB3 -obewertet.xml -sbeispiel.bsd -x -zvdx_stammdaten_test.zip
11/13 11:40:46 Create_Process succeeded, pid=3160
11/13 12:28:44 Suspending all jobs.
11/13 12:37:10 Continuing all jobs.
11/13 12:38:31 Suspending all jobs.
11/13 12:43:37 Continuing all jobs.
11/13 12:45:03 Suspending all jobs.
11/13 12:50:09 Continuing all jobs.
11/13 12:58:31 Process exited, pid=3160, status=0
11/13 12:58:31 JavaProc: JVM pid 3160 has finished
11/13 12:58:31 JavaProc: JVM exited normally with code 0
11/13 12:58:31 JavaProc: Wrapper left start record C:\condor\execute\dir_2636\jvm.start
11/13 12:58:31 JavaProc: Wrapper left end record C:\condor\execute\dir_2636\jvm.end
11/13 12:58:31 JavaProc: Job called System.exit(0)
11/13 12:58:31 JavaProc: unlinking C:\condor\execute\dir_2636\jvm.start and C:\condor\execute\dir_2636\jvm.end


from startd.log on the remote (job executing) machine:

11/13 11:40:03 DaemonCore: Command received via TCP from host <10.1.101.43:1723>
11/13 11:40:03 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
11/13 11:40:03 Request accepted.
11/13 11:40:03 Remote owner is fweiler@xxxxxx
11/13 11:40:03 State change: claiming protocol successful
11/13 11:40:03 Changing state: Unclaimed -> Claimed
11/13 11:40:03 DaemonCore: Command received via UDP from host <10.1.101.32:35742>
11/13 11:40:03 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
11/13 11:40:03 match_info called
11/13 11:40:07 DaemonCore: Command received via TCP from host <10.1.101.43:1728>
11/13 11:40:07 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
11/13 11:40:07 Got activate_claim request from shadow (<10.1.101.43:1728>)
11/13 11:40:07 Remote job ID is 63.0
11/13 11:40:07 Got universe "JAVA" (10) from request classad
11/13 11:40:07 State change: claim-activation protocol successful
11/13 11:40:07 Changing activity: Idle -> Busy
11/13 12:28:44 State change: SUSPEND is TRUE
11/13 12:28:44 Changing activity: Busy -> Suspended
11/13 12:37:10 State change: CONTINUE is TRUE
11/13 12:37:10 Changing activity: Suspended -> Busy
11/13 12:38:31 State change: SUSPEND is TRUE
11/13 12:38:31 Changing activity: Busy -> Suspended
11/13 12:43:37 State change: CONTINUE is TRUE
11/13 12:43:37 Changing activity: Suspended -> Busy
11/13 12:45:03 State change: SUSPEND is TRUE
11/13 12:45:03 Changing activity: Busy -> Suspended
11/13 12:50:09 State change: CONTINUE is TRUE
11/13 12:50:09 Changing activity: Suspended -> Busy
11/13 12:53:40 DaemonCore: Command received via TCP from host <10.1.101.182:3724>
11/13 12:53:40 DaemonCore: received command 448 (GIVE_STATE), calling handler (command_give_state)
11/13 13:10:43 State change: SUSPEND is TRUE
11/13 13:10:43 Changing activity: Busy -> Suspended
11/13 13:17:09 State change: CONTINUE is TRUE
11/13 13:17:09 Changing activity: Suspended -> Busy


So, I can't see, what's the problem is, because the job is finished correctly.
Why doesn't condor remark this job as finished  ?

Regards,
Frank Weiler, Softwareentwickler
____________________________
Kassenärztliche Bundesvereinigung Berlin