[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs Still not returning any output





Dan Bradley wrote:

Chris Miles wrote:



I have completely started fresh. reinstalled and started with no log files whatsoever.

The job file (hello.sub) contains.

executable = helloworld
universe = vanilla
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
requirements = (Arch == "X86_64") && (OpSys == "LINUX")
output  = output_$(Process).out
error   = error_$(Process).out
log     = log.out
Queue 5





<snip> The only logs that you sent which are relevant are the shadow logs. The starter logs on the execute machine (not the submit machine) would also be useful.




ShadowLog

10/12 01:38:27 ******************************************************
10/12 01:38:27 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/12 01:38:27 ** /home/condor/release/sbin/condor_shadow
10/12 01:38:27 ** $CondorVersion: 6.7.10 Aug 3 2005 $
10/12 01:38:27 ** $CondorPlatform: I386-LINUX_RH9 $
10/12 01:38:27 ** PID = 12878
10/12 01:38:27 ******************************************************
10/12 01:38:27 Using config file: /home/condor/etc/condor_config
10/12 01:38:27 Using local config files:
/home/condor/hosts/thebeast/condor_config.local
10/12 01:38:27 DaemonCore: Command Socket at <192.168.1.1:45639>
10/12 01:38:27 SEC_DEFAULT_SESSION_DURATION is undefined, using default
value of 3600
10/12 01:38:27 Reading job ClassAd from STDIN
10/12 01:38:27 Initializing a VANILLA shadow for job 1.0
10/12 01:38:27 (1.0) (12878): ENABLE_USERLOG_LOCKING is undefined, using
default value of True
10/12 01:38:27 (1.0) (12878): UserLog = /home/condor/jobs/helloworld/log.out
10/12 01:38:27 (1.0) (12878): *** Reserved Swap = 0
10/12 01:38:27 (1.0) (12878): *** Free Swap = 787168
10/12 01:38:27 (1.0) (12878): in RemoteResource::initStartdInfo()
10/12 01:38:27 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
10/12 01:38:27 (1.0) (12878): Entering DCStartd::activateClaim()
10/12 01:38:27 (1.0) (12878): DCStartd::activateClaim: successfully sent
command, reply is: 1
10/12 01:38:27 (1.0) (12878): Request to run on <192.168.1.101:35193> was
ACCEPTED
10/12 01:38:27 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state


from PRE to STARTUP


10/12 01:38:27 (1.0) (12878): Getting monitoring info for pid 12878
10/12 01:38:27 (1.0) (12878): entering FileTransfer::Init
10/12 01:38:27 (1.0) (12878): entering FileTransfer::SimpleInit
10/12 01:38:27 (1.0) (12878): entering FileTransfer::HandleCommands
10/12 01:38:27 (1.0) (12878): FileTransfer::HandleCommands read
transkey=1#434c5b036fe0c01059a0454b
10/12 01:38:27 (1.0) (12878): entering FileTransfer::Upload
10/12 01:38:27 (1.0) (12878): entering FileTransfer::DoUpload
10/12 01:38:27 (1.0) (12878): DoUpload: send file
/home/condor/hosts/thebeast/spool/cluster1.ickpt.subproc0
10/12 01:38:27 (1.0) (12878): ReliSock::put_file_with_permissions(): going
to send permissions 100755
10/12 01:38:27 (1.0) (12878): put_file: going to send from filename
/home/condor/hosts/thebeast/spool/cluster1.ickpt.subproc0
10/12 01:38:27 (1.0) (12878): put_file: Found file size 10457
10/12 01:38:27 (1.0) (12878): put_file: senting 10457 bytes
10/12 01:38:27 (1.0) (12878): ReliSock: put_file: sent 10457 bytes
10/12 01:38:27 (1.0) (12878): DoUpload: exiting at 1605
10/12 01:38:28 (1.0) (12878): DaemonCore: in SendAliveToParent()
10/12 01:38:28 (1.0) (12878): DaemonCore: attempting to connect to
'<192.168.1.1:45580>'
10/12 01:38:28 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
10/12 01:38:28 (1.0) (12878): SEC_TCP_SESSION_TIMEOUT is undefined, using
default value of 20
10/12 01:38:28 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state


from STARTUP to EXECUTING


10/12 01:38:28 (1.0) (12878): SHADOW_QUEUE_UPDATE_INTERVAL is undefined,
using default value of 900
10/12 01:38:28 (1.0) (12878): QmgrJobUpdater: started timer to update queue
(tid=7)
10/12 01:38:28 (1.0) (12878): Inside RemoteResource::updateFromStarter()
10/12 01:38:28 (1.0) (12878): Inside RemoteResource::resourceExit()
10/12 01:38:28 (1.0) (12878): setting exit reason on vm1@xxxxxxxxxxxxxxxxx
to 100
10/12 01:38:28 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state


from EXECUTING to FINISHED


10/12 01:38:28 (1.0) (12878): Entering DCStartd::deactivateClaim(forceful)
10/12 01:38:28 (1.0) (12878): SEC_DEBUG_PRINT_KEYS is undefined, using
default value of False
10/12 01:38:28 (1.0) (12878): DCStartd::deactivateClaim: successfully sent
command
10/12 01:38:28 (1.0) (12878): Killed starter (fast) at <192.168.1.101:35193>
10/12 01:38:28 (1.0) (12878): Job 1.0 terminated: exited with status 0
10/12 01:38:28 (1.0) (12878): Forking Mailer process...
10/12 01:38:28 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
10/12 01:38:28 (1.0) (12878): AUTHENTICATE_FS: used file /tmp/qmgr_Kl41Hy,
status: 1
10/12 01:38:28 (1.0) (12878): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1129077508)
10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(ExitBySignal
= FALSE)
10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(ExitCode = 0)
10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(BytesSent =
0.000000)
10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(BytesRecvd =
10457.000000)
10/12 01:38:28 (1.0) (12878): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
10/12 01:38:29 PASSWD_CACHE_REFRESH is undefined, using default value of 300





I see now file downloads happening. There are log messages about put_file, but no get_file. Therefore, it seems to me that either your job did not produce output, or something is going wrong on the execute machine. Please send StarterLog from a machine that is executing one of these jobs.


--Dan



Oops. I meant to write, "I see no file downloads happening", not "I see now downloads happening".


--Dan