[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI on Windows XP



Hi,
 
Is it possible to run MPI jobs on Windows XP machines using Condor 6.8.4 and MPICH 1.2.4?
 
I have two windows XP machines in a Windows domain environment, with common accounts on both machines. I can run MPI jobs on the two machines, and I can run simple jobs using condor. I cannot run MPI jobs under Condor. The job remains "idle" in the queue.
 
I have the following setup:
-----------------------------------------------------------------------------------------------------------------
central manager (nes30700.lands.resnet.qg):
 
condor_config:
the following changes were made to the default
    SEC_DEFAULT_AUTHENTICATION = OPTIONAL

    CREDD_HOST = nes30700.lands.resnet.qg
    STARTER_ALLOW_RUNAS_OWNER = True
    CREDD_CACHE_LOCALLY = True
    SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
 
    HOSTALLOW_ADMINISTRATOR = *
    HOSTALLOW_READ = *
    HOSTALLOW_WRITE = *
 
    HOSTALLOW_CONFIG = nes30700.lands.resnet.qg
    ALLOW_CONFIG =
root@xxxxxxxxxxxxxxx/$(IP_ADDRESS)
 
The contents of etc/condor_config.local.credd were copied to condor_config.local:
- only one line was changed:
    CREDD.SEC_DEFAULT_AUTHENTICATION =REQUIRED was changed to:
    CREDD.SEC_DEFAULT_AUTHENTICATION =OPTIONAL
 
Pool password was set: condor_store_cred -c add
My own windows account password was set: condor_store_cred -u jeffreysj@xxxxxxxxxxxxxxx add
 
-----------------------------------------------------------------------------------------------------------------
One execute machine (nes15300.lands.resnet.qg):
 
condor_config:
the following changes were made to the default
    SEC_DEFAULT_AUTHENTICATION = OPTIONAL

    CREDD_HOST = nes30700.lands.resnet.qg
    STARTER_ALLOW_RUNAS_OWNER = True
    CREDD_CACHE_LOCALLY = True
    SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
 
    HOSTALLOW_ADMINISTRATOR = *
    HOSTALLOW_READ = *
    HOSTALLOW_WRITE = *
 
    HOSTALLOW_CONFIG = nes15300.lands.resnet.qg
    ALLOW_CONFIG =
root@xxxxxxxxxxxxxxx/$(IP_ADDRESS)
 
The contents of etc/condor_config.local.credd were copied to condor_config.local:
-  one line was changed:
    CREDD.SEC_DEFAULT_AUTHENTICATION =REQUIRED was changed to:
    CREDD.SEC_DEFAULT_AUTHENTICATION =OPTIONAL
The  contents of etc/condor_config.local.dedicated.resource were appended to condor_config.local:
- run policy number 2 was selected
    DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxx" changed to DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"
 
Pool password set: condor_store_cred -c add
My own windows account password set: condor_store_cred -u jeffreysj@xxxxxxxxxxxxxxx add
-----------------------------------------------------------------------------------------------------------------
 
My submit script is:
 
Universe                = parallel
Executable              = run_condor_MPICH1
arguments               = cpilog_minimal.exe
machine_count           = 1
should_transfer_files   = yes
when_to_transfer_output = on_exit
transfer_input_files    =
\\indplly1\userdirs\JeffreySJ\Condor_Jobs\cpilog_minimal.exe
 
 
It appears as though Condor attempts to start the job - the execute machine nes15300 changes status to "Claimed", but it fails in some authentication test.
 
The start log contains:
-----------------------------------------------------------------------------------------------------------------
3/19 16:26:49 DaemonCore: Command received via TCP from host <131.242.63.124:3285>
3/19 16:26:49 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
3/19 16:26:49 Request accepted.
3/19 16:26:49 Remote owner is
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
3/19 16:26:49 State change: claiming protocol successful
3/19 16:26:49 Changing state: Unclaimed -> Claimed
3/19 16:26:49 DaemonCore: Command received via UDP from host <131.242.63.124:3283>
3/19 16:26:49 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
3/19 16:26:49 match_info called
3/19 16:26:53 DaemonCore: Command received via TCP from host <131.242.63.124:3298>
3/19 16:26:53 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
3/19 16:26:53 Got activate_claim request from shadow (<131.242.63.124:3298>)
3/19 16:26:53 Remote job ID is 9.0
3/19 16:26:53 Got universe "PARALLEL" (11) from request classad
3/19 16:26:53 State change: claim-activation protocol successful
3/19 16:26:53 Changing activity: Idle -> Busy
3/19 16:26:55 DaemonCore: Command received via TCP from host <131.242.63.124:3304>
3/19 16:26:55 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
3/19 16:26:55 Called deactivate_claim()
3/19 16:26:55 attempt to connect to <131.242.63.162:2345> failed: connect errno = 10061 connection refused.
3/19 16:26:55 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:2345> failed
 
3/19 16:26:55 Send_Signal: ERROR Connect to <131.242.63.162:2345> failed.
3/19 16:26:55 Error sending signal to starter, errno = 0 (No error)
3/19 16:26:55 attempt to connect to <131.242.63.162:2345> failed: connect errno = 10061 connection refused.
3/19 16:26:55 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:2345> failed
 
3/19 16:26:55 Send_Signal: ERROR Connect to <131.242.63.162:2345> failed.
3/19 16:26:55 DaemonCore: Command received via UDP from host <131.242.63.162:2355>
3/19 16:26:55 DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
3/19 16:26:55 Starter pid 388 exited with status 0
3/19 16:26:55 State change: starter exited
3/19 16:26:55 Changing activity: Busy -> Idle
3/19 16:31:59 DaemonCore: Command received via TCP from host <131.242.63.124:3344>
3/19 16:31:59 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
3/19 16:31:59 Got activate_claim request from shadow (<131.242.63.124:3344>)
3/19 16:31:59 Remote job ID is 9.0
3/19 16:31:59 Got universe "PARALLEL" (11) from request classad
3/19 16:31:59 State change: claim-activation protocol successful
3/19 16:31:59 Changing activity: Idle -> Busy
3/19 16:32:01 DaemonCore: Command received via TCP from host <131.242.63.124:3346>
3/19 16:32:01 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
3/19 16:32:01 Called deactivate_claim()
3/19 16:32:01 attempt to connect to <131.242.63.162:2389> failed: connect errno = 10061 connection refused.
3/19 16:32:01 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:2389> failed
 
3/19 16:32:01 Send_Signal: ERROR Connect to <131.242.63.162:2389> failed.
3/19 16:32:01 Error sending signal to starter, errno = 0 (No error)
3/19 16:32:01 attempt to connect to <131.242.63.162:2389> failed: connect errno = 10061 connection refused.
3/19 16:32:01 ERROR: SECMAN:2003:TCP auth connection to <131.242.63.162:2389> failed
 
3/19 16:32:01 Send_Signal: ERROR Connect to <131.242.63.162:2389> failed.
3/19 16:32:01 DaemonCore: Command received via UDP from host <131.242.63.162:2399>
3/19 16:32:01 DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
3/19 16:32:01 Starter pid 2980 exited with status 0
3/19 16:32:01 State change: starter exited
3/19 16:32:01 Changing activity: Busy -> Idle
-----------------------------------------------------------------------------------------------------------------
 
 
 
The StarterLog indicates that the MPI job executable was successfully transferred to the execute machine, but it contains another error:
-----------------------------------------------------------------------------------------------------------------
3/19 16:26:53 ******************************************************
3/19 16:26:53 ** condor_starter (CONDOR_STARTER) STARTING UP
3/19 16:26:53 ** D:\condor-6.8.4\bin\condor_starter.exe
3/19 16:26:53 ** $CondorVersion: 6.8.4 Feb  1 2007 $
3/19 16:26:53 ** $CondorPlatform: INTEL-WINNT50 $
3/19 16:26:53 ** PID = 388
3/19 16:26:53 ** Log last touched 3/19 15:50:01
3/19 16:26:53 ******************************************************
3/19 16:26:53 Using config source: D:\condor-6.8.4\condor_config
3/19 16:26:53 Using local config sources:
3/19 16:26:53    D:\condor-6.8.4/condor_config.local
3/19 16:26:53 DaemonCore: Command Socket at <131.242.63.162:2345>
3/19 16:26:53 Setting resource limits not implemented!
3/19 16:26:53 Communicating with shadow <131.242.63.124:3289>
3/19 16:26:53 Submitting machine is "nes30700.lands.resnet.qg"
3/19 16:26:53 Job has WantIOProxy=true
3/19 16:26:53 Initialized IO Proxy.
3/19 16:26:54 File transfer completed successfully.
3/19 16:26:55 Starting a PARALLEL universe job with ID: 9.0
3/19 16:26:55 IWD: D:\condor-6.8.4/execute\dir_388
3/19 16:26:55 Renice expr "10" evaluated to 10
3/19 16:26:55 About to exec D:\condor-6.8.4\execute\dir_388\condor_exec.exe cpilog_minimal.exe
3/19 16:26:55 ERROR: D:\condor-6.8.4\execute\dir_388\condor_exec.exe is not a valid Windows executable
3/19 16:26:55 ERROR "Create_Process(D:\condor-6.8.4\execute\dir_388\condor_exec.exe,cpilog_minimal.exe, ...) failed" at line 393 in file ..\src\condor_starter.V6.1\os_proc.C
3/19 16:26:55 ShutdownFast all jobs.
-----------------------------------------------------------------------------------------------------------------
 
Any advice would be greatly appreciated.
 
cheers
steve

************************************************************************

The information in this e-mail together with any attachments is

intended only for the person or entity to which it is addressed

and may contain confidential and/or privileged material.

Any form of review, disclosure, modification, distribution

and/or publication of this e-mail message is prohibited.

If you have received this message in error, you are asked to

inform the sender as quickly as possible and delete this message

and any copies of this message from your computer and/or your

computer system network.

************************************************************************