Jaime Frey wrote:
On Jun 23, 2008, at 6:50 PM, Sean Manning wrote:
I am working on a Web Services interface to submit jobs to our
Globus
grid. It uses the condor and birdbath Java packages. We can
successfully submit the attached JDL on the command line of a condor
head node (the metascheduler of our grid) and see it complete, but
when we submit it with the Java program from an external Condor
client
machine the job stays Idle then Halts with an error. Running the
condor daemons as root got rid of one error, but now we get another
one: HoldReason = "Streaming not supported". I can't find any
information about this error in the usergroup archives. Does anyone
here have an idea what could be causing this?
For GT4 GRAM jobs, if StreamOut and StreamErr aren't explicitly set
to
False in the job ad, then Condor assumes you want stdout and stderr
to
be streamed, which isn't supported by Condor for GT4 GRAM jobs. This
appears to be a bug, as the default behavior for other job types is
no
streaming.
If you add the following two attributes to your job ads, it should
eliminate the problem:
StreamOut = False
StreamErr = False
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
Dear Jaime,
Thanks for the reply.
I made that change, but jobs are still hanging with HoldReason =
"Streaming not supported." I can submit the new file with
condor_submit from the grid metascheduler and see it appear on the
head
node of a worker cluster, when condor_config has SOAP enabled. The
output and error come back to the machine I submitted the job from
just
like they are supposed to. But when I submit the same JDL to the
grid
metascheduler using our Web Services code, the job always holds
after a
delay.
Right now, the Condor daemons are running as root. The web services
code is running on my personal account (seangwm) on my workstation.
The spool directory on the metascheduler
($CONDOR_LOCATION/local.babargt4/spool) belongs to condor:root.. We
have been changing the owner of the job folder on the spool
($CONDOR_LOCATION/local.babargt4/spool/cluster5252.proc0.subproc0) by
hand from root:root to my personal account and group, because jobs
stay
idle until I do so. I think that this has to do with the fact that
the
proxy file must have very specific permissions so the grid will trust
it. If I change the owner of the spool folder to root:root I get a
HoldReason = "Failed to get expiration time of proxy" instead.
In principle, if we can submit a job to the grid using condor_submit,
then the web services submission should work as well. I would be very
grateful if you have any further advice about what I am missing.
I have attached our main Java class for job submittion and the JDL
which I have been trying with the Web Services code. In the attached
files, babargt4 is the grid metascheduler and ugdev07 is the head of
one of the clusters of worker nodes.