[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] SUBMISSION OF MPI JOBS THROUGH CONDOR-G



Dear Mr. Jaime Frey,

	Thanks a lot for the reply. Here are  the details required by you:

1. For submitting MPI jobs, we used Globus as the scheduler in the remote
machines.

2. In our machines we have installed  Condor-G, but Condor has not been
installed.

3. The error message that we obtained was:
"    Submission of subjob (label = "subjob 1") failed because authentication
failed:
GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:

init.c:499: globus_gss_assist_init_sec_context_async: Error during context
initialization
init_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify
remote side's credentials
globus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify
remote side's credentials: Couldn't verify the remote certificate
OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function
SSL3_READ_BYTES: sslv3 alert bad certificate (error code 57)
"

	We got the above error message in the output file. There was no error
message in the error file or log files. In fact, the log file displayed the
message
"005 (421.000.000) 04/02 14:52:20 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job"

	ie, the log file indicated that the program had normal termination

4. The following is the submit file that we used:

universe = globus
Executable = /home/tools/ganesh/mpich-g2/test.sh
transfer_executable = false
globusscheduler=e1.cdacb.org.in/jobmanager
output = condor1.out
error = condor1.err
log = condor1.log
Queue

	Here the test.sh is a Unix shell script that contains the following:
"	/usr/local/mpich-G2/bin/mpirun -globusrsl
/home/tools/ganesh/mpich-g2/hello.rsl"

	The hello.rsl file contains the following:
"	+
( &(resourceManagerContact="e1")
   (count=4)
   (label="subjob 1")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX  0) (LD_LIBRARY_PATH
/usr/local/Globu
s-2.4/lib))
   (executable=/home/tools/ganesh/mpich-g2/hello2)
)"

	We got the same error when we tried to execute the script using
globus-job-run.

	When we tried submitting serial jobs using the same method (ie,
substituting the name of the serial executable in place of hello.rsl inside
the script), the program executed successfully.

	Also, the Globus authentication test command, ie, globusrun -a -r
<node-name>, was successful from the submit node to the execute node and
vice versa.

	I hope I have included all the details required by you. Kindly get back to
me.

	Once again thanking You,

	Vineeth Simon Arackal


-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx
[mailto:owner-condor-users@xxxxxxxxxxx]On Behalf Of Jaime Frey
Sent: Thursday, April 01, 2004 9:35 PM
To: condor-users@xxxxxxxxxxx
Subject: Re: [condor-users] SUBMISSION OF MPI JOBS THROUGH CONDOR-G


On Thu, 1 Apr 2004, Vineeth Simon Arackal wrote:

> 	This is  in continuation to the mail that I had sent to this group
> regarding submission of MPI jobs through Condor-G. (My query was: "How to
> submit MPI jobs through Condor-G? We have installed Condor-G and Globus in
> our machines. We have been successful in submitting and executing
sequential
> jobs through Condor-G in Linux,Solaris and AIX machines. But so far we
have
> not been successful in executing MPI jobs. We took the sample MPI submit
> file provided in http://www.teragrid.org/userinfo/guide_jobs_condorg.html
> for the job submission, and made modifications necessary for our
machines").
>
> 	I hope I have provided enough details in the above paragraph on how we
> actually tried to execute the program.

Actually, some more details would be useful.

What is the scheduler you are submitting to on the remote system? The
Globus interface to Condor isn't smart enough to handle jobType=mpi
properly. There's a way to get around that, though.

If that isn't the problem, how are your job failing and what does your
submit file look like?

+------------------------------------+-------------------------------+
|             Jaime Frey             |There are 10 types of people in|
|         jfrey@xxxxxxxxxxx          |the world: Those who understand|
|   http://www.cs.wisc.edu/~jfrey/   |  binary, and those who don't  |
+------------------------------------+-------------------------------+
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>



Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>