[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Slave machine produces no output; return value 203



 
I am attempting to render a Maya scene file.  I have 3 physical computers and 12 virtual machines in my pool.  The physical attributes of the machines are identical.  I'm using condor_render.exe to produce and submit the jobs to condor.  If I render more than 4 frames, some of the rendered images do not show up.  For example, if I render 45 frames, only about 20 images show up.  I have narrowed the problem down to the jobs rendered on the slave computers.  These jobs return a value of 203 as indicated in the log files below.  Jobs rendered on the master return a value of 0.
 
According to the starter log on the slave machine (included below), everything appears the same as on the master starter log until we get down to the line, fourth from the bottom:
 
5/7 05:27:15 Process exited, pid=2176, status=203
Can anyone tell me what return value 203 means?  What steps should I take to correct the problem.
 
Any help greatly appreciated.  Thank you.
 
-Mike
 
 
exerpt from condor log file:
...
005 (002.003.000) 05/07 05:27:16 Job terminated.
 (1) Normal termination (return value 203)
  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
 33  -  Run Bytes Sent By Job
 63428  -  Run Bytes Received By Job
 33  -  Total Bytes Sent By Job
 63428  -  Total Bytes Received By Job
...
005 (002.000.000) 05/07 05:27:17 Job terminated.
 (1) Normal termination (return value 0)
  Usr 0 00:00:01, Sys 0 00:00:00  -  Run Remote Usage
  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
  Usr 0 00:00:01, Sys 0 00:00:00  -  Total Remote Usage
  Us! r 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
 20594  -  Run Bytes Sent By Job
 63428  -  Run Bytes Received By Job
 20594  -  Total Bytes Sent By Job
 63428  -  Total Bytes Received By Job
...
 
 
exerpt from starter log.vm1 on slave machine:
5/7 05:27:14 ******************************************************
5/7 05:27:14 ** condor_starter (CONDOR_STARTER) STARTING UP
5/7 05:27:14 ** C:\Condor\bin\condor_starter.exe
5/7 05:27:14 ** $CondorVersion: 6.6.9 Mar 10 2005 $
5/7 05:27:14 ** $CondorPlatform: INTEL-WINNT40 $
5/7 05:27:14 ** PID = 1156
5/7 05:27:14 ******************************************************
5/7 05:27:14 Using config file: C:\Condor\condor_config
5/7 05:27:14 Using local config files: C:\Condor/condor_config.local
5/7 05:27:14 DaemonCore: Command Socket at <10.100.4.8:4247>
5/7 05:27:14 Setting resource limits not implemented!
5/7 05:27:14 Starter communicating with condor_shadow <10.100.4.8:4244>
5/7 05:27:14 Submitting machine is "anim2"
5/7 05:27:14 File transfer completed successfully.
5/7 05:27:15 Starting a VANILLA universe job with ID: 2.3
5/7 05:27:15 IWD: C:\Condor/execute\dir_1156
5/7 05:27:15 Output file: C:\Condor/execute\dir_1156\cr.out
5/7 05:27:15 Error file: C:\Condor/execute\dir_1156\cr.err
5/7 05:27:15 Renice expr "10" evaluated to 10
5/7 05:27:15 About to exec C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat -rd . -im Frame -s 4.0000 -e 4.0000 -b 1.0000 ~Test1.mb
5/7 05:27:15 Create_Process succeeded, pid=2176
5/7 05:27:15 Process exited, pid=2176, status=203
5/7 05:27:16 Got SIGQUIT.  Performing fast shutdown.
5/7 05:27:16 ShutdownFast all jobs.
5/7 05:27:16 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
 
 
exerpt from schedlog on slave:
5/7 05:27:02 DaemonCore: Command received via UDP from host <10.100.4.8:4217>
5/7 05:27:02 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
5/7 05:27:02 Sent ad to central manager for Anim@xxxxxxxxxx
5/7 05:27:02 Called reschedule_negotiator()
5/7 05:27:02 Activity on stashed negotiator socket
5/7 05:27:02 Negotiating for owner: Anim@xxxxxxxxxx
5/7 05:27:02 Checking consistency running and runnable jobs
5/7 05:27:02 Tables are consistent
5/7 05:27:04 Out of jobs - 5 jobs matched, 0 jobs idle, flock level = 0
5/7 05:27:07 Started shadow for job 2.0 on "<10.100.4.6:4472>", (shadow pid = 2628)
5/7 05:27:07 Sent ad to central manager for Anim@xxxxxxxxxx
5/7 05:27:09 Started shadow for job 2.1 on "<10.100.4.6:4472>", (shadow pid = 2208)
5/7 05:27:11 Started shadow for job 2.2! on "<10.100.4.6:4472>", (shadow pid = 2920)
5/7 05:27:13 Started shadow for job 2.3 on "<10.100.4.8:2484>", (shadow pid = 3816)
5/7 05:27:15 Started shadow for job 2.4 on "<10.100.4.6:4472>", (shadow pid = 4076)
5/7 05:27:15 Sent ad to central manager for Anim@xxxxxxxxxx
5/7 05:27:16 DaemonCore: Command received via UDP from host <10.100.4.8:4264>
5/7 05:27:16 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
5/7 05:27:16 Shadow pid 3816 for job 2.3 exited with status 100
5/7 05:27:16 match (<10.100.4.8:2484>#2627232347) out of jobs (cluster id 2); relinquishing
5/7 05:27:16 Sent RELEASE_CLAIM to startd on <10.100.4.8:2484>
5/7 05:27:16 Match record (<10.100.4.8:2484>, 2, -1) deleted
5/7 05:27:16 DaemonCore: Command received via TCP from host <10.100.4.8:4267>
5/7 05:27:16 DaemonCore: received command 443 (VACATE_! SERVICE), calling handler (vacate_service)
5/7 05:27:16 Got VACATE_SERVICE from <10.100.4.8:4267>
5/7 05:27:17 DaemonCore: Command received via UDP from host <10.100.4.8:4274>
5/7 05:27:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
5/7 05:27:17 Shadow pid 2628 for job 2.0 exited with status 100
5/7 05:27:17 match (<10.100.4.6:4472>#2785286360) out of jobs (cluster id 2); relinquishing
5/7 05:27:17 Sent RELEASE_CLAIM to startd on <10.100.4.6:4472>
5/7 05:27:17 Match record (<10.100.4.6:4472>, 2, -1) deleted