[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] THE_MPI_JOB_ALWAYS_IN_"RUNNING"



Hi Han

	You must modify this line in mp1script to execute your mpi program.

	 MPDIR=/u/g/t/gthain/mpich-1.2.6/bin

	You can following statements before this line "mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@" in mp1script for debugging.
	
	cat machines
	echo "mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@"

	The output will be generated in output file.


	The mp1script is just an example which provided by condor developers,it will be better that you write a script by yourself. 
--------				 
			   zhaokun
        		2009-07-08

-------------------------------------------------------------
From:zhaokun zhaokun@xxxxxxxxxxxxx
Date:2009-07-08 11:45:55
To:cmesunoom; Condor-Users Mail List cmesunoom@xxxxxxxx; condor-users@xxxxxxxxxxx
cc: 
Title:Re: [Condor-users] THE_MPI_JOB_ALWAYS_IN_"RUNNING"

Hi cmesunoom, 

   1.modify "/usr/local/condor/etc/examples/mp1script" by adding some debug statements to trace the running process.
   2.which mpi you are using?
   


------------------				 
			   zhaokun
			   2009-07-08

-------------------------------------------------------------
From:cmesunoom cmesunoom@xxxxxxxx
Date:2009-07-08 11:20:19
To:Condor-Users Mail List condor-users@xxxxxxxxxxx
cc: 
Title:Re: [Condor-users] THE MPI JOB ALWAYS IN "RUNNING"

Hello,zhaokun
You give me three advices,but I also have some puzzle
1.mpi can run well without condor
2.how to add some "echo ..." statement to trace errors?can you tell me in detail
3.as follows:
7/8 10:41:34 ******************************************************
7/8 10:41:34 ** condor_shadow (CONDOR_SHADOW) STARTING UP
7/8 10:41:34 ** /usr/local/src/condor/sbin/condor_shadow
7/8 10:41:34 ** $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $
7/8 10:41:34 ** $CondorPlatform: I386-LINUX_RH9 $
7/8 10:41:34 ** PID = 6554
7/8 10:41:34 ** Log last touched 7/8 10:33:26
7/8 10:41:34 ******************************************************
7/8 10:41:34 Using config source: /usr/local/src/condor/etc/condor_config
7/8 10:41:34 Using local config sources: 
7/8 10:41:34 /usr/local/src/condor/local.node1/condor_config.local
7/8 10:41:34 DaemonCore: Command Socket at <192.168.0.101:33644>
7/8 10:41:34 Initializing a PARALLEL shadow for job 44.0
7/8 10:41:35 (44.0) (6554): Request to run on <192.168.0.116:33302> was ACCEPTED
7/8 10:41:35 (44.0) (6554): Request to run on <192.168.0.101:32793> was ACCEPTED

7/8 10:41:35 ******************************************************
7/8 10:41:35 ** condor_starter (CONDOR_STARTER) STARTING UP
7/8 10:41:35 ** /usr/local/src/condor/sbin/condor_starter
7/8 10:41:35 ** $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $
7/8 10:41:35 ** $CondorPlatform: I386-LINUX_RH9 $
7/8 10:41:35 ** PID = 6555
7/8 10:41:35 ** Log last touched 7/8 10:32:56
7/8 10:41:35 ******************************************************
7/8 10:41:35 Using config source: /usr/local/src/condor/etc/condor_config
7/8 10:41:35 Using local config sources: 
7/8 10:41:35 /usr/local/src/condor/local.node1/condor_config.local
7/8 10:41:35 DaemonCore: Command Socket at <192.168.0.101:33651>
7/8 10:41:35 Done setting resource limits
7/8 10:41:36 Communicating with shadow <192.168.0.101:33644>
7/8 10:41:36 Submitting machine is "node1.localdomain"
7/8 10:41:36 setting the orig job name in starter
7/8 10:41:36 setting the orig job iwd in starter
7/8 10:41:36 Job has WantIOProxy=true
7/8 10:41:36 Initialized IO Proxy.
7/8 10:41:36 File transfer completed successfully.
7/8 10:41:37 Job 44.0 set to execute immediately
7/8 10:41:37 Starting a PARALLEL universe job with ID: 44.0
7/8 10:41:37 IWD: /usr/local/src/condor/local.node1/execute/dir_6555
7/8 10:41:37 Output file: /usr/local/src/condor/local.node1/execute/dir_6555/hello.out
7/8 10:41:37 Error file: /usr/local/src/condor/local.node1/execute/dir_6555/hello.err
7/8 10:41:37 About to exec /usr/local/src/condor/local.node1/execute/dir_6555/condor_exec.exe hello 2
7/8 10:41:37 Create_Process succeeded, pid=6557
7/8 10:41:37 IOProxy: accepting connection from 192.168.0.101
7/8 10:41:37 IOProxyHandler: closing connection to 192.168.0.101

what is wrong with it?
I really need a help!
Any help will be appraciated.
Regards,
Han

--- 09年7月8日,周三, zhaokun <zhaokun@xxxxxxxxxxxxx> 写道:

> 发件人: zhaokun <zhaokun@xxxxxxxxxxxxx>
> 主题: Re: [Condor-users] THE MPI JOB ALWAYS IN "RUNNING"
> 收件人: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
> 日期: 2009年7月8日,周三,上午10:55
> Hi Condor-Users Mail List, 
> 
>    Sorry to reply so late.
>    
>    1. check you mpi settings
>    2. add some "echo ..." statement to trace
> errors.
>    3. view log files to get more info.
> SchedLog,StartLog,StarterLog ... 
> ------------------       
>          
>            
>    zhaokun
>            
>     2009-07-08
> 
> -------------------------------------------------------------
> From:Hehe cmesunoom@xxxxxxxx
> Date:2009-07-07 09:36:01
> To:Condor-Users Mail List condor-users@xxxxxxxxxxx
> cc: 
> Title:Re: [Condor-users] THE MPI JOB ALWAYS IN "RUNNING"
> 
> hello,zhaokun
> my mpi job submit description file is as followed:
> universe=parallel
> executable=/usr/local/condor/etc/examples/mp1script
> arguments=hello
> log=hello.log
> output=hello.out
> error=hello.err
> machine_count=2
> should_transfer_files=yes
> when_to_transfer_output=on_exit
> transfer_input_files=hello
> queue
>  
> that is all,does it have any problem?
> thanks in advance.
> Han.(你是中国人吧?方便的话可以直接用汉语交流吗?我的英语很糟粕)
> 
> --- 09年7月7日,周二, zhaokun <zhaokun@xxxxxxxxxxxxx>
> 写道:
> 
> 
> 发件人: zhaokun <zhaokun@xxxxxxxxxxxxx>
> 主题: Re: [Condor-users] THE MPI JOB ALWAYS IN "RUNNING"
> 收件人: "Condor-Users Mail List" <condor-users@xxxxxxxxxxxx>
> 日期: 2009年7月7日,周二,上午9:15
> 
> 
> Hi Condor-Users Mail List, 
> 
>     Please attach your job script file to find the
> reason.
> ------------------                 
>                zhaokun
>                 2009-07-07
> 
> -------------------------------------------------------------
> From:Hehe cmesunoom@xxxxxxxx
> Date:2009-07-06 18:47:50
> To:condor-users condor-users@xxxxxxxxxxx
> cc: 
> Title:[Condor-users] THE MPI JOB ALWAYS IN "RUNNING"
> 
> hello,all
> when I submit mpi job on condor,the job stay in the state
> "running" all the time
>  
> ************hello_log  file***************
> Job submitted from host:<.......>
> Node 0 executing on host:<........>
> Job executing on host:MPI_job
>  
> so I want to know the reason for it
> 
> Any help will be appraciated.
> Regards,
> Han
> 
> 
>      
> ___________________________________________________________
> 
>   好玩贺卡等你发,邮箱贺卡全新上线! 
> http://card.mail.cn.yahoo.com/ 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/ 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

> 
> 
> 
>      
> ___________________________________________________________
> 
>   好玩贺卡等你发,邮箱贺卡全新上线!
> 
> http://card.mail.cn.yahoo.com/

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>



      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/