[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor on Win XP & Win HPC server 2008



Dear condor users,
     I've installed condor-7.2.3 on Windows Server 2008 HPC edition 64bit on a dual core, dual processor system. (condor pool contain only this single system).
    The submitted job keep in idle state and never turn into running state.
Command output Details:
E:\condor723\mpi-test>condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@master. WINNT60 INTEL Unclaimed Idle 0.000 1023 0+00:00:58
slot2@master. WINNT60 INTEL Unclaimed Idle 0.000 1023 0+00:40:05
 slot3@master. WINNT60 INTEL Unclaimed Idle 0.000 1023 0+00:40:06
slot4@master. WINNT60 INTEL Unclaimed Idle 0.060 1023 0+00:40:07
  Total Owner Claimed Unclaimed Matched Preempting Backfill
  INTEL/WINNT60 4 0 0 4 0 0 0
  Total 4 0 0 4 0 0 0
E:\condor723\mpi-test>

 
E:\condor723\mpi-test>condor_q
-- Submitter: master.hpc.com : <10.129.150.44:49193> : master.hpc.com
  ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
  1.0 administrator 5/14 13:10 0+00:00:02 I 0 0.0 ans.bat
1 jobs; 1 idle, 0 running, 0 held
E:\condor723\mpi-test>condor_q -analyze
 -- Submitter: master.hpc.com : <10.129.150.44:49193> : master.hpc.com
 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
 ---
001.000: Run analysis summary. Of 4 machines,
  0 are rejected by your job's requirements
  0 reject your job because of their own requirements
  0 match but are serving users with a better priority in the pool
  4 match but reject the job for unknown reasons
  0 match but will not currently preempt their existing job
  0 are available to run your job
  Last successful match: Thu May 14 13:15:38 2009
 1 jobs; 1 idle, 0 running, 0 held

 
 E:\condor723\mpi-test>type log_ansys
000 (001.000.000) 05/14 13:10:38 Job submitted from host: <10.129.150.44:49193>
 ...
022 (001.000.000) 05/14 13:10:39 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to slot1@xxxxxxxxxxxxxx <10.129.150.44:49194>
 ...
024 (001.000.000) 05/14 13:10:39 Job reconnection failed
  Job not found at execution machine
  Can not reconnect to slot1@xxxxxxxxxxxxxx, rescheduling job
 ...
022 (001.000.000) 05/14 13:15:39 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to slot1@xxxxxxxxxxxxxx <10.129.150.44:49194>
 ...
024 (001.000.000) 05/14 13:15:39 Job reconnection failed
  Job not found at execution machine
  Can not reconnect to slot1@xxxxxxxxxxxxxx, rescheduling job
 ...

Is condor tested on 64bit Wondows Sytems?
Thanks,
Sangamesh


On Sat, May 2, 2009 at 11:50 AM, Sangamesh B <forum.san@xxxxxxxxx> wrote:
Dear all,
      Condor-7.0.5 - central manager is installed on Windows XP 32bit (single core machine) and execution machine on Win Server 2008 64bit HPC Edition (dual core, dual processor = total 4 cores). The job is submitted from master node, and should run on hpc server 2008. But its failing with following error:
E:\condor705\con-mpi-test\sleep-test1>type log
000 (080.000.000) 05/02 11:29:40 Job submitted from host: <10.129.150.82:1043>
...
022 (080.000.000) 05/02 11:29:55 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.129.150.44:56466>
...
024 (080.000.000) 05/02 11:30:00 Job reconnection failed
  Job not found at execution machine
  Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (080.000.000) 05/02 11:34:46 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.129.150.44:56466>
...
024 (080.000.000) 05/02 11:34:46 Job reconnection failed
  Job not found at execution machine
  Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
E:\condor705\con-mpi-test\sleep-test1>

-- Submitter: support-2 : <10.129.150.82:1043> : support-2
 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
080.000: Run analysis summary. Of 5 machines,
  1 are rejected by your job's requirements
  0 reject your job because of their own requirements
  0 match but are serving users with a better priority in the pool
  4 match but reject the job for unknown reasons
  0 match but will not currently preempt their existing job
  0 are available to run your job
  Last successful match: Sat May 02 11:34:41 2009
1 jobs; 1 idle, 0 running, 0 held
E:\condor705\con-mpi-test\sleep-test1>

Any hint, why its not able to connect?
But, it works for other 32 bit XP systems.
Thanks in advance..