[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem using schedd web service



Hi there,

I am currently trying to submit jobs via the schedd web service, but I have run into problems with the "Out" and "Err" JobAd properties. If neither of these properties are present, then the job runs just fine. However, as soon as one of them is added to the JobAd, the process fails. The only unusual bit in the log files that I have found is in ShadowLog, where it says that it failed to open '/dev/null':

10/27 11:38:55 (?.?) (13074):******* Standard Shadow starting up *******
10/27 11:38:55 (?.?) (13074):** $CondorVersion: 6.7.12 Sep 24 2005 $
10/27 11:38:55 (?.?) (13074):** $CondorPlatform: I386-LINUX_RH9 $
10/27 11:38:55 (?.?) (13074):*******************************************
10/27 11:38:55 (?.?) (13074):uid=0, euid=19419, gid=0, egid=100
10/27 11:38:55 (?.?) (13074):RemoveNewShadowDroppings(): Old shadow removed new shadow ckpt directory: /home/condor/spool/cluster125.proc0.subproc0
10/27 11:38:55 (?.?) (13074):RemoveNewShadowDroppings(): Old shadow removed new shadow ckpt directory: /home/condor/spool/cluster125.proc0.subproc0.tmp
10/27 11:38:55 (?.?) (13074):Hostname = "<xxx.xxx.xxx.xxx:nnnnn>", Job = 125.0
10/27 11:38:55 (125.0) (13074):Requesting Primary Starter
10/27 11:38:55 (125.0) (13074):Shadow: Request to run a job was ACCEPTED
10/27 11:38:55 (125.0) (13074):Shadow: RSC_SOCK connected, fd = 17
10/27 11:38:55 (125.0) (13074):Shadow: CLIENT_LOG connected, fd = 18
10/27 11:38:55 (125.0) (13074):My_Filesystem_Domain = "ixico.net"
10/27 11:38:55 (125.0) (13074):My_UID_Domain = "ixico.net"
10/27 11:38:55 (125.0) (13074): Entering pseudo_get_file_stream
10/27 11:38:55 (125.0) (13074): file = "/opt/condor-6.6.10/examples/env.remote"
10/27 11:38:55 (125.0) (13074): Weird 0xc0a8010b
10/27 11:38:55 (125.0) (13074): Weird 0xc0a8010b
10/27 11:38:56 (125.0) (13074):Reaped child status - pid 13076 exited with status 0
10/27 11:38:56 (125.0) (13074):Read: User Job - $CondorPlatform: I386-LINUX_RH9 $
10/27 11:38:56 (125.0) (13074):Read: User Job - $CondorVersion: 6.6.10 Jun 13 2005 $
10/27 11:38:56 (125.0) (13074):Read: Checkpoint file name is "/home/condor/spool/cluster125.proc0.subproc0"
10/27 11:38:56 (125.0) (13074):error: Error: Couldn't open standard file '/dev/null'
10/27 11:38:56 (125.0) (13074):Shadow: Job 125.0 exited, termsig = 9, coredump = 0, retcode = 0
10/27 11:38:56 (125.0) (13074):Shadow: Job was kicked off without a checkpoint
10/27 11:38:56 (125.0) (13074):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/spool/cluster125.proc0.subproc0.tmp'
10/27 11:38:56 (125.0) (13074):Trying to unlink /home/condor/spool/cluster125.proc0.subproc0.tmp
10/27 11:38:56 (125.0) (13074):user_time = 1 ticks
10/27 11:38:56 (125.0) (13074):sys_time = 4 ticks
10/27 11:38:56 (125.0) (13074):********** Shadow Exiting(107) **********


Does anyone have any pointers on how to fix this problem? I have added my test code to the end of the e-mail. It is currently trying to run the "env.remote" example. Using "condor_submit" works fine.

Thanks in advance,

Peter

-------------
public int submitJob(String command, List<String> arguments, List<File> inputFiles, List<File> outputFiles) throws IOException{
try{
// Create a transaction, a cluster, and a new job.
CondorScheddPortType stub = this.scheddService.getcondorSchedd(this.wsUrl);
Transaction txn = stub.beginTransaction(TRANSACTION_DURATION).getTransaction();
int clusterId = stub.newCluster(txn).getInteger();
int jobId = stub.newJob(txn, clusterId).getInteger();


           // Convert the arguments into a single string.
           StringBuilder buffer = new StringBuilder();
           for (String arg : arguments){
               buffer.append(arg).append(' ');
           }

// Send over the input files.
for (File file : inputFiles){
Status retval = stub.declareFile(txn, clusterId, jobId, file.getName(), (int) file.length(), HashType.NOHASH, null);
System.out.println("Declaring file " + file + ": " + retval.getCode());
sendFile(stub, txn, clusterId, jobId, file);
}
stub.commitTransaction(txn);


// Now submit the job.
txn = stub.beginTransaction(TRANSACTION_DURATION).getTransaction();
ClassAdStructAttr[] templ = stub.createJobTemplate(
clusterId, jobId, "user", UniverseType.STANDARD, command, buffer.toString(), "").getClassAd();
Map<String, ClassAdStructAttr> jobAd = new HashMap<String, ClassAdStructAttr>();
for (ClassAdStructAttr attribute : templ){
jobAd.put(attribute.getName(), attribute);
}


// Customise the template.
jobAd.put("Iwd", new ClassAdStructAttr("Iwd", ClassAdAttrType.value3, "/tmp/test-submit"));
jobAd.put("UserLog", new ClassAdStructAttr("UserLog", ClassAdAttrType.value3, "/tmp/test-submit/log"));
jobAd.put("LeaveJobInQueue", new ClassAdStructAttr("LeaveJobInQueue", ClassAdAttrType.value5, "FALSE"));
jobAd.put("WantCheckpoint", new ClassAdStructAttr("WantCheckpoint", ClassAdAttrType.value5, "TRUE"));
jobAd.put("WantRemoteSyscalls", new ClassAdStructAttr("WantRemoteSyscalls", ClassAdAttrType.value5, "TRUE"));
jobAd.put("Err", new ClassAdStructAttr("Err", ClassAdAttrType.value3, "job.err"));
jobAd.put("Out", new ClassAdStructAttr("Out", ClassAdAttrType.value3, "job.out"));
// jobAd.put("ShouldTransferFiles", new ClassAdStructAttr("ShouldTransferFiles", ClassAdAttrType.value3, "NO"));
// jobAd.put("TransferIn", new ClassAdStructAttr("TransferIn", ClassAdAttrType.value5, "TRUE"));
// jobAd.put("In", new ClassAdStructAttr("In", ClassAdAttrType.value3, "cmd.in"));
// jobAd.put("TransferFiles", new ClassAdStructAttr("TransferFiles", ClassAdAttrType.value3, "NEVER"));
// jobAd.put("WhenToTransferOutput", new ClassAdStructAttr("WhenToTransferOutput", ClassAdAttrType.value3, "ON_EXIT"));


RequirementsAndStatus retval = stub.submit(txn, clusterId, jobId, jobAd.values().toArray(new ClassAdStructAttr[0]));
System.out.println("Submit status: " + retval.getStatus().getCode());
stub.commitTransaction(txn);


           // Try to get the file.
//            stub.getFile(null, clusterId, jobId,
           return jobId;
       }
       catch (ServiceException ex){
           // TODO Auto-generated catch block
           ex.printStackTrace();
       }
       catch (RemoteException ex){
           // TODO Auto-generated catch block
           ex.printStackTrace();
       }
       return 0;
   }