[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Shadow exiting with status 100



Hello!

First of all, I found a similar post dated October 2003 (http://www.cs.wisc.edu/~lists/archive/condor-users/msg00129.html). However, it's been left unanswered.

I'm experiencing strange behavior of my Java jobs running on Condor. The jobs run in a different pool via flocking. When a job completes within several minutes everything is fine. When the job runs longer it completes, but no files get transferred back, despite the fact that Condor thinks everything went fine.

The submit machine runs Condor 6.5.5, whereas the execute machine and its central manager run Condor 6.6.0. All machines run Linux.

Can this strange behavior be caused by the fact that execute machine's local time is one hour ahead of the submit machine one? Apart from flocking and time difference there's nothing suspicious I can see.

Any ideas and suggestions appreciated.


I attach the relevant portions of relevant logs.


IPs are as follows:
* 1.1.1.1 -- submit machine;
* 2.2.2.2 -- execute machine;
* 3.3.3.3 -- central manager of the pool which the execute machine belongs to.


Here's the ShadowLog of the submit machine:
1/7 10:19:57 ******************************************************
1/7 10:19:57 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/7 10:19:57 ** $CondorVersion: 6.5.5 Sep 15 2003 $
1/7 10:19:57 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
1/7 10:19:57 ** PID = 5383
1/7 10:19:57 ******************************************************
1/7 10:19:57 Using config file: /usr/local/condor/etc/condor_config
1/7 10:19:57 Using local config files: /opt/condor/condor_config.local
1/7 10:19:57 DaemonCore: Command Socket at <1.1.1.1:33713>
1/7 10:19:58 Initializing a JAVA shadow
1/7 10:19:58 (2765.0) (5383): Request to run on <2.2.2.2:34942> was ACCEPTED
1/7 11:20:08 (2765.0) (5383): DC_AUTHENTICATE: attempt to open invalid session klyubin:5383:1073470807:1, failing.
1/7 11:24:46 (2765.0) (5383): DC_AUTHENTICATE: attempt to open invalid session klyubin:5383:1073470798:0, failing.
1/7 11:24:46 (2765.0) (5383): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100



StartLog of the execute machine:
1/7 10:28:07 DaemonCore: Command received via UDP from host <3.3.3.3:33457>
1/7 10:28:07 DaemonCore: received command 60014 (DC_INVALIDATE_KEY), calling handler (handle_invalidate_key())
1/7 11:18:28 DaemonCore: Command received via UDP from host <3.3.3.3:33457>
1/7 11:18:28 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/7 11:18:28 match_info called
1/7 11:18:28 Received match <2.2.2.2:34942>#7246952077
1/7 11:18:28 State change: match notification protocol successful
1/7 11:18:28 Changing state: Unclaimed -> Matched
1/7 11:18:28 DaemonCore: Command received via TCP from host <1.1.1.1:36761>
1/7 11:18:28 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
1/7 11:18:28 Request accepted.
1/7 11:18:28 Remote owner is nice-user.klyubin@xxxxxxxxxxxxxxxxxxxxxxxx
1/7 11:18:28 State change: claiming protocol successful
1/7 11:18:28 Changing state: Matched -> Claimed
1/7 11:18:32 DaemonCore: Command received via TCP from host <1.1.1.1:30073>
1/7 11:18:32 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
1/7 11:18:32 Got activate_claim request from shadow (<1.1.1.1:30073>)
1/7 11:18:32 Remote job ID is 2765.0
1/7 11:18:32 Got universe "JAVA" (10) from request classad
1/7 11:18:32 State change: claim-activation protocol successful
1/7 11:18:32 Changing activity: Idle -> Busy
1/7 12:23:19 DaemonCore: Command received via TCP from host <1.1.1.1:38267>
1/7 12:23:19 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/7 12:23:19 Called deactivate_claim_forcibly()
1/7 12:23:19 Starter pid 13466 exited with status 0
1/7 12:23:19 State change: starter exited
1/7 12:23:19 Changing activity: Busy -> Idle
1/7 12:23:20 DaemonCore: Command received via UDP from host <1.1.1.1:33566>
1/7 12:23:20 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/7 12:23:20 State change: received RELEASE_CLAIM command
1/7 12:23:20 Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/7 12:23:20 State change: No preempting claim, returning to owner
1/7 12:23:20 Changing state and activity: Preempting/Vacating -> Owner/Idle
1/7 12:23:20 State change: IS_OWNER is false
1/7 12:23:20 Changing state: Owner -> Unclaimed
1/7 12:23:20 DaemonCore: Command received via UDP from host <1.1.1.1:33829>
1/7 12:23:20 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/7 12:23:20 Error: can't find resource with capability (<2.2.2.2:34942>#7246952077)



StarterLog of execute machine:
1/7 11:18:32 passwd_cache: getpwnam() failed at line 150 with error Inappropriate ioctl for device
1/7 11:18:32 passwd_cache: getpwnam() failed at line 150 with error Inappropriate ioctl for device
1/7 11:18:32 ******************************************************
1/7 11:18:32 ** condor_starter (CONDOR_STARTER) STARTING UP
1/7 11:18:32 ** $CondorVersion: 6.6.0 Nov 13 2003 $
1/7 11:18:32 ** $CondorPlatform: INTEL-LINUX-GLIBC23 $
1/7 11:18:32 ** PID = 13466
1/7 11:18:32 ******************************************************
1/7 11:18:32 Using config file: /usr/local/condor/etc/condor_config
1/7 11:18:32 Using local config files: /var/condor/condor_config.local
1/7 11:18:32 DaemonCore: Command Socket at <2.2.2.2:37201>
1/7 11:18:32 Done setting resource limits
1/7 11:18:32 Starter communicating with condor_shadow <1.1.1.1:33713>
1/7 11:18:32 Submitting machine is "Alexander"
1/7 11:18:32 Initialized IO Proxy.
1/7 11:18:32 File transfer completed successfully.
1/7 11:18:33 Starting a JAVA universe job with ID: 2765.0
1/7 11:18:33 JavaProc: Cmd=/tools/j2sdk1.3.1/bin/java
1/7 11:18:33 JavaProc: Args=Args="-Xmx501m -classpath /usr/local/condor/lib:/usr/local/condor/lib/scimark2lib.jar:.:Test.jar -Dchirp.config=/var/condor/execute/dir_13466/chirp.config CondorJavaWrapper /var/condor/execute/dir_13466/j
vm.start /var/condor/execute/dir_13466/jvm.end Test"
1/7 11:18:33 IWD: /var/condor/execute/dir_13466
1/7 11:18:33 Output file: /var/condor/execute/dir_13466/Test.out.part0
1/7 11:18:33 Error file: /var/condor/execute/dir_13466/Test.err.part0
1/7 11:18:33 Renice expr "19" evaluated to 19
1/7 11:18:33 About to exec /tools/j2sdk1.3.1/bin/java -Xmx501m -classpath /usr/local/condor/lib:/usr/local/condor/lib/scimar
k2lib.jar:.:Test.jar -Dchirp.config=/var/condor/execute/dir_13466/chirp.config CondorJavaWrapper /var/condor
/execute/dir_13466/jvm.start /var/condor/execute/dir_13466/jvm.end Test
1/7 11:18:33 Create_Process succeeded, pid=13468
1/7 12:23:19 Process exited, pid=13468, status=0
1/7 12:23:19 JavaProc: JVM pid 13468 has finished
1/7 12:23:19 JavaProc: JVM exited normally with code 0
1/7 12:23:19 JavaProc: Wrapper left start record /var/condor/execute/dir_13466/jvm.start
1/7 12:23:19 JavaProc: Wrapper left end record /var/condor/execute/dir_13466/jvm.end
1/7 12:23:19 JavaProc: Job returned from main()
1/7 12:23:19 JavaProc: unlinking /var/condor/execute/dir_13466/jvm.start and /var/condor/execute/dir_13466/jvm.end
1/7 12:23:19 Got SIGQUIT. Performing fast shutdown.
1/7 12:23:19 ShutdownFast all jobs.
1/7 12:23:19 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0



Best Regards, Alexander Klyubin

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>