[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [Globus-discuss] error submitting jobs to condor pool



Hi Martin,

The staging job works :
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
nano@elka-113:~/Experiments/grid$ globusrun-ws -submit -Ft Condor
-streaming -S -f globusmultijob.rsl
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:c65ee45a-18fa-11dc-adad-001676c58b92
Termination time: 06/13/2007 15:37 GMT
Current job state: StageIn
Current job state: Pending
Current job state: Active
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The corresponding message in container.log
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2007-06-12 22:37:07,324 INFO  exec.StateMachine
[RunQueueThread_9,logJobAccepted:3193] Job
c6effee0-18fa-11dc-aa0c-938be5c4dcca accepted for local user 'nano'
2007-06-12 22:37:21,206 INFO  exec.StateMachine
[RunQueueThread_10,logJobSucceeded:3204] Job
c6effee0-18fa-11dc-aa0c-938be5c4dcca finished successfully
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I think this log corresponds to the same job, since it has the same time;
the Job ID is different, I don't know if it's normal or not.


This is the job file:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
<?xml version="1.0" encoding="UTF-8"?>
<multiJob xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
 <factoryEndpoint>
   <wsa:Address>https://elka-113.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
   <wsa:ReferenceProperties>
     <gram:ResourceID>Multi</gram:ResourceID>
   </wsa:ReferenceProperties>
 </factoryEndpoint>
 <directory>${GLOBUS_USER_HOME}/test</directory>
 <count>1</count>
 <job>
   <factoryEndpoint>
     <wsa:Address>https://elka-113.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
     <wsa:ReferenceProperties>
       <gram:ResourceID>Condor</gram:ResourceID>
     </wsa:ReferenceProperties>
   </factoryEndpoint>
   <executable>/usr/bin/java</executable>
   <argument>-classpath</argument>
   <argument>.:jai_core.jar:jai_codec.jar</argument>
   <argument>Encoder</argument>
   <argument>xrayA-00-00.bmp</argument>
   <stdout>${GLOBUS_USER_HOME}/target/stdout</stdout>
   <stderr>${GLOBUS_USER_HOME}/target/stderr</stderr>
   <fileStageIn>
     <transfer>
       <sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/Encoder.class</sourceUrl>
       <destinationUrl>file:///${GLOBUS_USER_HOME}/target/Encoder.class</destinationUrl>
     </transfer>
     <transfer>
       <sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/jai_core.jar</sourceUrl>
       <destinationUrl>file:///${GLOBUS_USER_HOME}/target/jai_core.jar</destinationUrl>
     </transfer>
     <transfer>
       <sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/jai_codec.jar</sourceUrl>
       <destinationUrl>file:///${GLOBUS_USER_HOME}/target/jai_codec.jar</destinationUrl>
     </transfer>
     <transfer>
       <sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/codebook</sourceUrl>
       <destinationUrl>file:///${GLOBUS_USER_HOME}/target/codebook</destinationUrl>
     </transfer>
     <transfer>
       <sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/xrayA-00-00.bmp</sourceUrl>
       <destinationUrl>file:///${GLOBUS_USER_HOME}/target/xrayA-00-00.bmp</destinationUrl>
     </transfer>
   </fileStageIn>
   <fileCleanUp>
     <deletion><file>file:///${GLOBUS_USER_HOME}/target/Encoder.class</file></deletion>
     <deletion><file>file:///${GLOBUS_USER_HOME}/target/jai_core.jar</file></deletion>
     <deletion><file>file:///${GLOBUS_USER_HOME}/target/jai_codec.jar</file></deletion>
     <deletion><file>file:///${GLOBUS_USER_HOME}/target/codebook</file></deletion>
     <deletion><file>file:///${GLOBUS_USER_HOME}/target/xrayA-00-00.bmp</file></deletion>
   </fileCleanUp>
   <extensions>
     <condorsubmit name="universe">Java</condorsubmit>
     <condorsubmit name="should_transfer_files">YES</condorsubmit>
     <condorsubmit
name="when_to_transfer_output">ON_EXIT_OR_EVICT</condorsubmit>
     <condorsubmit name="requirements">Arch == "INTEL" &amp;&amp;
OpSys == "WINNT51" || Arch == "INTEL" &amp;&amp; OpSys ==
"LINUX"</condorsubmit>
   </extensions>
 </job>
</multiJob>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Though it is worked, there's one problem left: while I'm query the
job, using condor_q -better-analyze, it said that the job requirement
Arch == "INTEL" &amp;&amp; OpSys ==  "LINUX", eventhougth I explicitly
said in the job file that I also want it to be executed on WINNT51.
The executor nodes on my Condor pool has 4 Windows machines and only
one Linux machine.

Here's the result of condor_q -better-analyze:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
128.000:  Run analysis summary.  Of 7 machines,
     5 are rejected by your job's requirements
     0 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     2 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     0 are available to run your job

The Requirements expression for your job is:

( target.OpSys == "LINUX" && target.Arch == "INTEL" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

   Condition                         Machines Matched    Suggestion
   ---------                         ----------------    ----------
1   target.OpSys == "LINUX"           2
2   ( TARGET.FileSystemDomain == "elka-113.ee.itb.ac.id" )
                                     2
3   target.Arch == "INTEL"            7
4   ( target.Disk >= 10000 )          7
5   ( ( 1024 * target.Memory ) >= 10000 )7
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Actually, I need to run hundred of jobs; so I think it's better to use
condor_submit (Condor-G), right? Can you diagnose why I previously had
problem while submit the jobs using Condor-G ?


Very best regards,

--
Nano Surbakti

On 6/12/07, feller@xxxxxxxxxxx <feller@xxxxxxxxxxx> wrote:
Nano,
see
http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html#q-gram2
for how to submit a staging job.
From a first look it seems that delegation didn't work.
Please try the globusrun-ws job with staging and send the
the output of the client and the relevant parts of the
container logfile then.
Martin

> Hi Martin,
>
> While I'm reading globusrun-ws manual, here is the container log:
>
> --------------------------------------
> 2007-06-12 10:15:45,455 INFO  exec.StateMachine
> [RunQueueThread_4,logJobAccepted:3193] Job
> 333f5540-1893-11dc-bb3f-aec5afd22587 accepted for local user 'nano'
> 2007-06-12 10:15:50,878 ERROR exec.StateMachine
> [RunQueueThread_9,fileCleanUp:2730] A secondary fault occured while
> trying to gracefully fail.
> AxisFault
>  faultCode:
> {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
>  faultSubcode:
>  faultString: java.rmi.RemoteException: Unable to create RFT resource;
> nested exception is:
>       org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
>  faultActor:
>  faultNode:
>  faultDetail:
>       {http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException:
> Unable to create RFT resource; nested exception is:
>       org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
>       at
> org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:245)
>       at sun.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at
> org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384)
>       at
> org.globus.axis.providers.RPCProvider.invokeMethodSub(RPCProvider.java:107)
>       at
> org.globus.axis.providers.PrivilegedInvokeMethodAction.run(PrivilegedInvokeMethodAction.java:42)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at org.globus.gsi.jaas.GlobusSubject.runAs(GlobusSubject.java:55)
>       at org.globus.gsi.jaas.JaasSubject.doAs(JaasSubject.java:90)
>       at
> org.globus.axis.providers.RPCProvider.invokeMethod(RPCProvider.java:97)
>       at
> org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281)
>       at
> org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319)
>       at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>       at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>       at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>       at org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450)
>       at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285)
>       at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
>       at
> org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
>       at
> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147)
>       at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
> Caused by: org.globus.transfer.reliable.service.exception.RftException:
> Error processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
>       at
> org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:391)
>       at
> org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:354)
>       at
> org.globus.transfer.reliable.service.ReliableFileTransferHome.create(ReliableFileTransferHome.java:134)
>       at
> org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:235)
>       ... 22 more
>
>       {http://xml.apache.org/axis/}hostname:hobitton
>
> java.rmi.RemoteException: Unable to create RFT resource; nested exception
> is:
>       org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
>       at
> org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221)
>       at
> org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.java:128)
>       at
> org.apache.axis.encoding.DeserializationContext.endElement(DeserializationContext.java:1087)
>       at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
>       at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown
> Source)
>       at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>       at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>       at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>       at
> org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
>       at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645)
>       at org.apache.axis.Message.getSOAPEnvelope(Message.java:424)
>       at
> org.apache.axis.message.addressing.handler.AddressingHandler.processClientResponse(AddressingHandler.java:305)
>       at
> org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:110)
>       at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>       at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>       at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>       at org.apache.axis.client.AxisClient.invoke(AxisClient.java:190)
>       at org.apache.axis.client.Call.invokeEngine(Call.java:2727)
>       at org.apache.axis.client.Call.invoke(Call.java:2710)
>       at org.apache.axis.client.Call.invoke(Call.java:2386)
>       at org.apache.axis.client.Call.invoke(Call.java:2309)
>       at org.apache.axis.client.Call.invoke(Call.java:1766)
>       at
> org.globus.rft.generated.bindings.ReliableFileTransferFactoryPortTypeSOAPBindingStub.createReliableFileTransfer(ReliableFileTransferFactoryPortTypeSOAPBindingStub.java:874)
>       at
> org.globus.exec.service.exec.utils.StagingHelper.submitStagingRequest(StagingHelper.java:168)
>       at
> org.globus.exec.service.exec.StateMachine.fileCleanUp(StateMachine.java:2716)
>       at
> org.globus.exec.service.exec.StateMachine.processFailureFileCleanUpState(StateMachine.java:2091)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at
> org.globus.exec.service.exec.StateMachine.processState(StateMachine.java:302)
>       at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
> 2007-06-12 10:15:51,055 INFO  exec.StateMachine
> [RunQueueThread_9,logJobFailed:3212] Job
> 333f5540-1893-11dc-bb3f-aec5afd22587 failed
> --------------------------------------
> This time I only submit one job, to minimize the log/error message.
>
> To make the log complete :) ... here's what Condor log said about the same
> job:
> --------------------------------------
> 017 (096.000.000) 06/12 10:15:50 Job submitted to Globus
>     RM-Contact:
> https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
>     JM-Contact:
> https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?333f5540-1893-11dc-bb3f-aec5afd22587
>     Can-Restart-JM: 0
> ...
> 027 (096.000.000) 06/12 10:15:50 Job submitted to grid resource
>     GridResource: gt4
> https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
> Condor
>     GridJobId: gt4
> https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?333f5540-1893-11dc-bb3f-aec5afd22587
> ...
> 012 (096.000.000) 06/12 10:15:51 Job was held.
>       Globus error: Staging error for RSL element fileStageIn.
>       Code 0 Subcode 0
> --------------------------------------
>
> Big THANKS !!
>
> --
> Nano Surbakti
>
>
> On 6/12/07, feller@xxxxxxxxxxx <feller@xxxxxxxxxxx> wrote:
>> Ok, what does the server-side GT4 container logfile say?
>> If it's available, please post it to the list.
>> If not: Do you have the Condor's GridmanagerLog?
>> Also: please try to submit a staging job with globusrun-ws
>> (instead of condor-g). What's the output of the client and
>> what does the server-log say (if this fails too)?
>> Martin
>>
>
>