[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Web Services JDL Parsing





Matthew Farrellee wrote:


>
>
>Sean Manning wrote:
>> Hi,  
>> 
>>   I appreciate that my last email was somewhat lengthy, and I have 
made 
>> some progress since then.  I now have a very specific question about 
>> how to stage back output in a grid environment.
>> 
>>   Again, I am working on Web Services code using the birdbath and 
>> condor Java packages.  I can submit a job (see the attached JDL) using 
>> my Web Services interface from my account, and see it appear in the 
>> condor queue of the grid metascheduler.  The input files get 
>> transferred correctly from my client machine to the metascheduler 
(they 
>> go to the folder 
>> /opt/condor/local.babargt4/spool/cluster1234.proc0.subproc0 or 
>> similar), but the folder and its contents belong to root (the user who 
>> is running Condor) not myself (the user who submitted the job).  
Unless 
>> I change the owner of the files to myself by hand, I get an error 
>> HoldReason = "Failed to get expiration time of Proxy" because the job 
>> and the proxy certificate must be owned by the same user.
>> 
>>   When we changed the owner of the spool/cluster folder and its 
>> contents to myself, the job can create a gridftp wrapper and start 
>> running.  We can see it on the head node of one of our clusters, and 
>> see it create a scratch folder (in /hepuser/gcprod01/.globus/scratch 
on 
>> our NFS) and store the output and error there.  But the output does 
not 
>> get staged back from the head node to the metascheduler to the client, 
>> and the job hangs in mode C = Completed.  We have tried several 
variant 
>> JDL files without success.
>> 
>>   In other words, we have two problems:
>> 
>> (i) How can we run the jobs as the user who submits them, not the user 
>> who owns condor?
>> 
>> (ii) How can we get output to stage back from the cluster to the 
>> metascheduler and the client machine?
>> 
>>   Can anyone advise how to solve either of these problems?
>> 
>> Thanks,
>> 
>> Sean Manning
>
>Is your JDL parser setting StateInStart and StageInFinish?
>
>from src/condor_schedd.V6/soap_scheddStub.C, in createJobTemplate:
>       // It is kinda scary but if ATTR_STAGE_IN_START/FINISH are
>       // present and non-zero in a Job Ad the Schedd will do the
>       // right thing, when run as root, and chown the job's spool
>       // directory, thus fixing a long standing permissions problem.
>    job->Assign(ATTR_STAGE_IN_START, 1);
>    job->Assign(ATTR_STAGE_IN_FINISH, 1);
>
>$ grep STAGE_IN src/condor_c++_util/condor_attributes.C
>const char *ATTR_STAGE_IN_START           = "StageInStart";
>const char *ATTR_STAGE_IN_FINISH          = "StageInFinish";
>
>Best,
>
>
>matt
>

Hi Matthew,

  They are both set to 1, but I'm not sure how, and it isn't helping.  
Jobs still halt with HoldReason = "Failed to get expiration time of 
proxy" unless I explicitly change the owner of the folder containing 
the proxy to myself.

  I think the parser is org.glite.jdl.Ad.fromFile () plus some code of 
my own to get a condor.ClassAdStructAttr for each attribute in the 
submit description file.  These are passed to 
birdbath.Transaction.submit () to create the ClassAd which Condor-G 
sees.

  This need to change the owner of the spool/cluster folder by hand is 
the main problem I have left.  I can get stdout, stderr, and other 
output files back now from the cluster to the metascheduler by setting 
TransferOutput = {"specialOutputFile1.txt", "specialOutputFile2.txt"}; 
in the JDL, and I can stage them from the metascheduler to the client 
with some other code.

Thanks for your help,

Sean