[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI job problem



Can you send us the log from the schedd and the startd?

Thanks,

-greg

Li-Yung_Ho wrote:
> Hi Mark and Greg
> Thanks for your responses
> 
> I change the START attribute from Scheduler =?= $(DedicatedScheduler) to True
> in pragma002 and pragma004 local configuraion file and indeed , the status 
> become "Unclaimed"
> ------------------------------------------------------------------------
> [lyho@pragma001 lyho]$ condor_status
> 
> Name          OpSys       Arch   State      Activity   LoadAv Mem   
> ActvtyTime
> 
> pragma001.gri LINUX       INTEL  Owner      Idle       0.010   469  
> 0+00:10:04
> pragma002.gri LINUX       INTEL  Unclaimed  Idle       0.290   469  
> 0+03:21:02
> pragma004.gri LINUX       INTEL  Unclaimed  Idle       0.150  1004  
> 0+03:19:48
> 
>                      Machines Owner Claimed Unclaimed Matched Preempting
> 
>          INTEL/LINUX        3     1       0         2       0          0
> 
>                Total        3     1       0         2       0          0
> 
> -------------------------------------------------------------------------
> 
> but the job still IDLE
> 
> -------------------------------------------------------------------------
> [lyho@pragma001 lyho]$ condor_q
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>  140.0   lyho            4/29 17:44   0+00:00:00 I  0   0.3  cpi
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> ------------------------------------------------------------------------
> 
> and then I test the vanilla job
> the job description file :
> ============================
> universe = vanilla
> executable = cpi
> log = logofcpi.new
> error = errofcpi.$(NODE).new
> output = outofcpi.$(NODE).new
> queue
> =============================
> 
> and it can be done
> 
> ------------------------------------------------------------------------
> [lyho@pragma001 condor_test]$ condor_q
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>  142.0   lyho            5/2  13:18   0+00:00:00 R  0   0.3  cpi
> 
> 1 jobs; 0 idle, 1 running, 0 held
> ---------------------------------------------------------------------
> 
> The files of log, error and output
> 
> ---------------------------------------------------------------------
> [lyho@pragma001 condor_test]$ more *.new
> ::::::::::::::
> errofcpi..new
> ::::::::::::::
> Process 0 on pragma002.grid.sinica.edu.tw
> ::::::::::::::
> logofcpi.new
> ::::::::::::::
> 000 (142.000.000) 05/02 13:18:57 Job submitted from host: 
> <140.109.98.21:33670>
> ...
> 001 (142.000.000) 05/02 13:19:00 Job executing on host: <140.109.98.22:48852>
> ...
> 005 (142.000.000) 05/02 13:19:00 Job terminated.
>         (1) Normal termination (return value 0)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         0  -  Run Bytes Sent By Job
>         0  -  Run Bytes Received By Job
>         0  -  Total Bytes Sent By Job
>         0  -  Total Bytes Received By Job
> ...
> ::::::::::::::
> outofcpi..new
> ::::::::::::::
> pi is approximately 3.1416009869231254, Error is 0.0000083333333323
> wall clock time = 0.000055
> 
> --------------------------------------------------------------------
> 
> So, someting wrong with mpi job
> 
> Can anyone help me ??
> 
> 
> 
> On Fri, 29 Apr 2005 12:11:53 +0300, Mark Silberstein wrote
> 
>>The problem seems to be in the fact that all your computers are in 
>>the "Owner" state, i.e. Condor is NOT allowed to start any job on them.
>>Obviously you're using the START expression (in the condor_config),
>>which makes your resources reject Condor jobs when they are under 
>>load or when there's some  keyboard activity. ( the output you sent was
>>produced on pragma001, so you were working on it, and two others 
>>have a load average of 1.000 ) . To TEST that MPI really works you 
>>might want to disable this, by putting START=TRUE ( which would 
>>allow any job to be invoked, regardless of the current computer 
>>activity), or START=($(START))||((Scheduler =?= $(DedicatedScheduler)
>>). Mark
>>
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users