[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [Globus-discuss] error submitting jobs to condor pool



Hi,

My mistake. I've changed it to:
--------------------------------------
grid_resource = gt4
https://elka-78.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService
Condor
--------------------------------------

Now I got another error (from Condor's log):
--------------------------------------
017 (091.000.000) 06/12 09:07:02 Job submitted to Globus
   RM-Contact:
https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
   JM-Contact:
https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?966c5320-1889-11dc-acaa-95396c73a41b
   Can-Restart-JM: 0
...
027 (091.000.000) 06/12 09:07:02 Job submitted to grid resource
   GridResource: gt4
https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
Condor
   GridJobId: gt4
https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?966c5320-1889-11dc-acaa-95396c73a41b
...
017 (091.001.000) 06/12 09:07:02 Job submitted to Globus
   RM-Contact:
https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
   JM-Contact:
https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?966c7a30-1889-11dc-acaa-95396c73a41b
   Can-Restart-JM: 0
...
027 (091.001.000) 06/12 09:07:02 Job submitted to grid resource
   GridResource: gt4
https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
Condor
   GridJobId: gt4
https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?966c7a30-1889-11dc-acaa-95396c73a41b
...
012 (091.001.000) 06/12 09:07:03 Job was held.
	Globus error: Staging error for RSL element fileStageIn.
	Code 0 Subcode 0
...
012 (091.000.000) 06/12 09:07:03 Job was held.
	Globus error: Staging error for RSL element fileStageIn.
	Code 0 Subcode 0
--------------------------------------

I guess this error related to RFT, right? Do I need to have an RFT
server on elka-113?

My grid system is like this:

elka-78: the grid central; having RFT, SimpleCA, and MyProxy
elka-113: head node of Condor pool, also acts as submit and execute node,
             installed with GT4 with --enable-wsgram-condor, but no RFT server
the Condor pool have 4 Windows execute nodes
the job to be distributed is Java, and all execute nodes have JVM

I'm trying to submit about 400 small jobs to the grid. The job works
well when I submit it directly to Condor Manager @ elka-113; but I
need to simulate that this job also can be submitted to the Grid
(GT4). Currently I have only one submit machine: elka-113.
It's a little bit weird ... :)

If you have clue to my problem, or suggestion for a better "scenario",
please tell me.

Thanks!!
--
Nano Surbakti


On 6/11/07, Martin Feller <feller@xxxxxxxxxxx> wrote:
Nano:
elka-78 doesn't have Condor support installed, i.e. configure
was probably run without --enable-wsgram-condor
Your globusrun-ws command was submitted to elka-113, which
seems to be another host where WS-GRAM supports Condor.
Martin


> Hi,
>
> I'm trying to submit some Java jobs to a Condor pool.
> This is my submit file:
> ------------------------------------------------------------------
> executable     = Encoder.class
> universe     = grid
> log         = report.$(Cluster).log
> getenv         = true
> globus_rsl     = (condor_submit=(universe java))
> grid-type    = gt4
> grid_resource     = gt4
> https://elka-78.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService
> Condor
> MyProxyHost    = elka-78.ee.itb.ac.id:7512
> MyProxyServerDN    =
> O=Grid/OU=GlobusTest/OU=simple-CA-elka-78.ee.itb.ac.id/OU=ee.itb.ac.id/CN=Nano
>
> Surbakti
> MyProxyPassword    = "censored"
>
> arguments = Encoder xrayA-00-00.bmp
> transfer_input_files = codebook,images_tiled/xrayA-00-00.bmp
> when_to_transfer_output = ON_EXIT_OR_EVICT
> queue
>
> arguments = Encoder xrayA-01-00.bmp
> transfer_input_files = codebook,images_tiled/xrayA-01-00.bmp
> when_to_transfer_output = ON_EXIT_OR_EVICT
> queue
>
> should_transfer_files = YES
> ------------------------------------------------------------------
>
> I got this error:
> 018 (085.001.000) 06/11 19:33:44 Globus job submission failed!
>    Reason: 0 java.rmi.RemoteException: Job creation failed.; nested
> exception is:      java.rmi.RemoteException: The Managed Job Factory
> Service at
> https://167.205.65.78:8443/wsrf/services/ManagedJobFactoryService
> does not have a resource with key "Condor".
>
> I thought perhaps something wrong with gram-condor extension, so I
> check with simple job submission from a tutorial:
>
> ------------------------------------------------------------------
> nano@elka-113: globusrun-ws -submit -Ft Condor -c /bin/sleep 50
> Submitting job...Done
> Job ID: ................
> ------------------------------------------------------------------
>
> Looks like nothing wrong. I also check with condor_q to make sure the
> job submitted to Condor.
>
> Please give me some clue or pointer to solve this problem.
>
> I'm sorrry if you got double posting, just don't know where's the best
> mailing list to ask for this.
>
> Thanks in advance,
>
> --
> Nano Surbakti
>
> -
> To Unsubscribe: send mail to majordomo@xxxxxxxxxx
> with "unsubscribe discuss" in the body of the message
>