[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem using schedd web service



Matthew Farrellee wrote:

All the files that are going to be input or output (including Out/ Err) should be declared.

If after you declare Out and Err you still have trouble you should try to add StageInStart and StageInFinish, both set to some non-zero integer, to the job ad. CreateJobTemplate will add those attributes for you in future versions of Condor.


matt


I'm sorry to bother again, but I'm still having a problem, although at least it's a different problem now! I am declaring the stdout and stderr files like this:


// Declare the stdout and stderr files. Status retval = stub.declareFile( txn, clusterId, jobId, OUTPUT_FILENAME, Integer.MAX_VALUE, // Also tried with -1 HashType.NOHASH, null); retval = stub.declareFile( txn, clusterId, jobId, ERROR_FILENAME, Integer.MAX_VALUE, HashType.NOHASH, null);

and I have also included StageInStart ("10") and StageInFinish("20") in the JobAd. However, I am getting the following in the ShadowLog:

10/28 09:08:00 (140.0) (8981):Requesting Primary Starter
10/28 09:08:00 (140.0) (8981):Shadow: Request to run a job was ACCEPTED
10/28 09:08:00 (140.0) (8981):Shadow: RSC_SOCK connected, fd = 17
10/28 09:08:00 (140.0) (8981):Shadow: CLIENT_LOG connected, fd = 18
10/28 09:08:00 (140.0) (8981):My_Filesystem_Domain = "ixico.net"
10/28 09:08:00 (140.0) (8981):My_UID_Domain = "ixico.net"
10/28 09:08:00 (140.0) (8981): Entering pseudo_get_file_stream
10/28 09:08:00 (140.0) (8981): file = "/opt/condor-6.6.10/examples/env.remote"
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981):Reaped child status - pid 8983 exited with status 0
10/28 09:08:00 (140.0) (8981):Read: condor_restart:
10/28 09:08:00 (140.0) (8981):Read: Checkpoint file name is "/home/condor/spool/cluster140.proc0.subproc0"
10/28 09:08:00 (140.0) (8981): Entering pseudo_get_file_stream
10/28 09:08:00 (140.0) (8981): file = "/home/condor/spool/cluster140.proc0.subproc0"
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981):Read: Opened "/home/condor/spool/cluster140.proc0.subproc0" via file stream
10/28 09:08:00 (140.0) (8987):Failed to transfer 96 bytes (only sent -1)
10/28 09:08:00 (140.0) (8981):Reaped child status - pid 8987 exited with status 1
10/28 09:08:00 (140.0) (8981):Shadow: Job 140.0 exited, termsig = 9, coredump = 0, retcode = 0
10/28 09:08:00 (140.0) (8981):Shadow: Job was kicked off without a checkpoint
10/28 09:08:00 (140.0) (8981):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/spool/cluster140.proc0.subproc0.tmp'
10/28 09:08:00 (140.0) (8981):Trying to unlink /home/condor/spool/cluster140.proc0.subproc0.tmp
10/28 09:08:00 (140.0) (8981):user_time = 1 ticks
10/28 09:08:00 (140.0) (8981):sys_time = 2 ticks
10/28 09:08:00 (140.0) (8981):********** Shadow Exiting(107) **********



It looks like Condor is trying write to a directory (/home/condor/spool/cluster140.proc0.subproc0) as if it were a file, but I have no idea why. Any suggestions?


Thanks,

Peter