[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Web Services JDL Parsing





Sean Manning wrote:
Hi, I appreciate that my last email was somewhat lengthy, and I have made some progress since then. I now have a very specific question about how to stage back output in a grid environment.

Again, I am working on Web Services code using the birdbath and condor Java packages. I can submit a job (see the attached JDL) using my Web Services interface from my account, and see it appear in the condor queue of the grid metascheduler. The input files get transferred correctly from my client machine to the metascheduler (they go to the folder /opt/condor/local.babargt4/spool/cluster1234.proc0.subproc0 or similar), but the folder and its contents belong to root (the user who is running Condor) not myself (the user who submitted the job). Unless I change the owner of the files to myself by hand, I get an error HoldReason = "Failed to get expiration time of Proxy" because the job and the proxy certificate must be owned by the same user.

When we changed the owner of the spool/cluster folder and its contents to myself, the job can create a gridftp wrapper and start running. We can see it on the head node of one of our clusters, and see it create a scratch folder (in /hepuser/gcprod01/.globus/scratch on our NFS) and store the output and error there. But the output does not get staged back from the head node to the metascheduler to the client, and the job hangs in mode C = Completed. We have tried several variant JDL files without success.

  In other words, we have two problems:

(i) How can we run the jobs as the user who submits them, not the user who owns condor?

(ii) How can we get output to stage back from the cluster to the metascheduler and the client machine?

  Can anyone advise how to solve either of these problems?

Thanks,

Sean Manning

Is your JDL parser setting StateInStart and StageInFinish?

from src/condor_schedd.V6/soap_scheddStub.C, in createJobTemplate:
      // It is kinda scary but if ATTR_STAGE_IN_START/FINISH are
      // present and non-zero in a Job Ad the Schedd will do the
      // right thing, when run as root, and chown the job's spool
      // directory, thus fixing a long standing permissions problem.
   job->Assign(ATTR_STAGE_IN_START, 1);
   job->Assign(ATTR_STAGE_IN_FINISH, 1);

$ grep STAGE_IN src/condor_c++_util/condor_attributes.C
const char *ATTR_STAGE_IN_START           = "StageInStart";
const char *ATTR_STAGE_IN_FINISH          = "StageInFinish";

Best,


matt