[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor configuration for Multi-CPU machines



Everyone,
 I just read Peter's email on setting up a set dual core/dual cpu compute
nodes running WinXP (x64).
I am in the middle of setting up a very similar grid configuration (using
condor 6.8.2) and his configuration script is very enlightening and
demonstrates the flexibility of condor.

My problem is submitting simple vanilla jobs to a pool containing 16
compute nodes each with dual cpu/dual core opterons running WinXP x64.
The executable is a multi-threaded app (24 threads and the exe has a ram
footprint of 90MB) that has run well on a condor pool with a combination of
single/dual cpu (all single core) compute nodes.  The problem appears to be
condor but I cannot be sure...

When several (6) jobs (job id 22-27) are submitted to the compute node only
one job completed in 24hr and the other 5 jobs are continuously idle.
These jobs should each take approx 3hr each to run on a single processor.

Note I have tried to use the suggestions from the FAQ article "Why does my
Linux job have an enormous ImageSize and refuse to run anymore?" with no
change in the job behavior...

Are there any special considerations for XP x64 and Condor ?
Sorry for all of the log files here but I am trying to demonstrate a single
node is working before turning on the 15 nodes
and I am unsure what log files people need to see...

The compute node is headless however we do login via RemoteDesktop.

I have included the following:
1) system info
2) output from condor_q -global -analyze run on the submit host.
3) condor logs from the 6 jobs submitted.
4) condor log/STARTLOG
5) condor log/STARTLOG.vm1 (huge), STARTLOG.vm2(small), STARTLOG.vm3(small)
, STARTLOG.vm4 (small)

A portion of my system info file is here:
%%%%%%%%%%%%%%%%%%%%%
OS Name     Microsoft(R) Windows(R) XP Professional x64 Edition
Version     5.2.3790 Service Pack 1 Build 3790
Other OS Description    Not Available
OS Manufacturer   Microsoft Corporation
System Name CONDOR02
System Manufacturer     RIOWORKS
System Model      C1000
System Type x64-based PC
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2405 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2406 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2406 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2405 Mhz
BIOS Version/Date Phoenix Technologies Ltd. V1.13, 7/13/2006
SMBIOS Version    2.34
Windows Directory C:\WINDOWS
System Directory  C:\WINDOWS\system32
Boot Device \Device\HarddiskVolume1
Locale      United States
Hardware Abstraction Layer    Version = "5.2.3790.1830
(srv03_sp1_rtm.050324-1447)"
User Name   Not Available
Time Zone   Pacific Standard Time
Total Physical Memory   4,094.68 MB
Available Physical Memory     3.49 GB
Total Virtual Memory    5.75 GB
Available Virtual Memory      5.48 GB
Page File Space   2.00 GB
Page File   D:\pagefile.sys
%%%%%%%%%%%%%%%%%%%%%

dump from "condor_q -global -analyze" (run from the submit host) looked
like this after running 24hr.
note job 23.0 is missing since it completed 15hr ago.
Note there is NOTHING else running on the compute node (condor02
192.168.1.165)
And there are no entries in d:/condor/execute for these jobs...
The job submit host is GRID05 (192.168.1.5)
%%%%%%%%%%%%%%%%%%%%%
$ condor_q -global -analyze


-- Schedd: GRID05.RMS.local : <192.168.1.5:4897>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
022.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
024.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
025.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
026.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 11:27:09 2006
---
027.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
%%%%%%%%%%%%%%%%%%%%%

Note Job 23 completed (condor.log)
%%%%%%%%%%%%%%%%%%%%%
000 (023.000.000) 11/09 18:10:39 Job submitted from host:
<192.168.1.5:4897>
...
001 (023.000.000) 11/09 18:13:30 Job executing on host:
<192.168.1.165:1617>
...
006 (023.000.000) 11/09 18:13:38 Image size of job updated: 91684
...
006 (023.000.000) 11/09 18:33:38 Image size of job updated: 91904
...
005 (023.000.000) 11/09 20:08:10 Job terminated.
      (1) Normal termination (return value 0)
            Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
      18274022  -  Run Bytes Sent By Job
      41873792  -  Run Bytes Received By Job
      18274022  -  Total Bytes Sent By Job
      41873792  -  Total Bytes Received By Job
%%%%%%%%%%%%%%%%%%%%%


Note Job 21,22,24,25   look like this (condor.log)
%%%%%%%%%%%%%%%%%%%%%
000 (024.000.000) 11/09 18:10:51 Job submitted from host:
<192.168.1.5:4897>
%%%%%%%%%%%%%%%%%%%%%

Note Job 27  looks like this (condor.log)
%%%%%%%%%%%%%%%%%%%%%
000 (027.000.000) 11/10 11:26:34 Job submitted from host:
<192.168.1.5:4897>
...
007 (027.000.000) 11/10 11:32:37 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:32:52 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:06 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:21 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:36 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
%%%%%%%%%%%%%%%%%%%%%

the STARTLOG on the controller/compute node is huge and ends like this...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
...
11/10 10:54:10 DaemonCore: Command received via UDP from host
<192.168.1.5:2708>
11/10 10:54:10 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:54:10 Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:54:22 DaemonCore: Command received via TCP from host
<192.168.1.5:2685>
11/10 10:54:22 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/10 10:54:22 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:54:22 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.1.5:2685>, fd is 336
11/10 10:54:22 Buf::write(): condor_write() failed
11/10 10:54:22 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2694>, fd is 336
11/10 10:54:22 Buf::write(): condor_write() failed
11/10 10:54:22 SECMAN: Error sending response classad!
11/10 10:54:34 DaemonCore: Command received via UDP from host
<192.168.1.5:2714>
11/10 10:54:34 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:54:34 vm3: State change: received RELEASE_CLAIM command
11/10 10:54:34 vm3: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/10 10:54:34 vm3: State change: No preempting claim, returning to owner
11/10 10:54:34 vm3: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/10 10:54:34 vm3: State change: IS_OWNER is false
11/10 10:54:34 vm3: Changing state: Owner -> Unclaimed
11/10 10:54:47 DaemonCore: Command received via TCP from host
<192.168.1.5:2695>
11/10 10:54:47 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/10 10:54:47 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1595)
11/10 10:54:59 DaemonCore: Command received via UDP from host
<192.168.1.5:2715>
11/10 10:54:59 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:54:59 Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:54:59 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2700>, fd is 388
11/10 10:54:59 Buf::write(): condor_write() failed
11/10 10:54:59 SECMAN: Error sending response classad!
11/10 10:55:11 DaemonCore: Command received via TCP from host
<192.168.1.5:2701>
11/10 10:55:11 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/10 10:55:11 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:55:11 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.1.5:2701>, fd is 388
11/10 10:55:11 Buf::write(): condor_write() failed
11/10 10:55:23 DaemonCore: Command received via TCP from host
<192.168.1.5:2704>
11/10 10:55:23 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/10 10:55:23 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:55:23 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2709>, fd is 388
11/10 10:55:23 Buf::write(): condor_write() failed
11/10 10:55:23 SECMAN: Error sending response classad!
11/10 10:55:35 DaemonCore: Command received via TCP from host
<192.168.1.5:2711>
11/10 10:55:35 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/10 10:55:35 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:55:47 DaemonCore: Command received via TCP from host
<192.168.1.5:2719>
11/10 10:55:47 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
11/10 10:55:47 vm1: Request accepted.
11/10 10:55:59 vm1: Remote owner is bhaz@xxxxxxxxx
11/10 10:55:59 vm1: State change: claiming protocol successful
11/10 10:55:59 vm1: Changing state: Unclaimed -> Claimed
11/10 10:55:59 DaemonCore: Command received via UDP from host
<192.168.1.165:2760>
11/10 10:55:59 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/10 10:55:59 vm1: match_info called
11/10 10:56:11 DaemonCore: Command received via TCP from host
<192.168.1.5:2720>
11/10 10:56:11 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
11/10 10:56:11 vm2: Request accepted.
11/10 10:56:23 vm2: Remote owner is bhaz@xxxxxxxxx
11/10 10:56:23 vm2: State change: claiming protocol successful
11/10 10:56:23 vm2: Changing state: Unclaimed -> Claimed
11/10 10:56:23 DaemonCore: Command received via UDP from host
<192.168.1.165:2761>
11/10 10:56:23 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/10 10:56:23 vm2: match_info called
11/10 10:56:35 DaemonCore: Command received via TCP from host
<192.168.1.5:2721>
11/10 10:56:35 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
11/10 10:56:35 vm4: Request accepted.
11/10 10:56:47 vm4: Remote owner is bhaz@xxxxxxxxx
11/10 10:56:47 vm4: State change: claiming protocol successful
11/10 10:56:47 vm4: Changing state: Unclaimed -> Claimed
11/10 10:56:47 DaemonCore: Command received via UDP from host
<192.168.1.165:2762>
11/10 10:56:47 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/10 10:56:47 vm2: match_info called
11/10 10:56:47 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2730>, fd is 184
11/10 10:56:47 Buf::write(): condor_write() failed
11/10 10:56:47 SECMAN: Error sending response classad!
11/10 10:56:47 DaemonCore: Command received via UDP from host
<192.168.1.165:2763>
11/10 10:56:47 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/10 10:56:47 vm4: match_info called
11/10 10:56:59 DaemonCore: Command received via TCP from host
<192.168.1.5:2733>
11/10 10:56:59 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/10 10:56:59 vm1: Got activate_claim request from shadow
(<192.168.1.5:2733>)
11/10 10:56:59 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.1.5:2733>, fd is 184
11/10 10:56:59 Buf::write(): condor_write() failed
11/10 10:56:59 vm1: Can't send eom to shadow.
11/10 10:56:59 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2742>, fd is 240
11/10 10:56:59 Buf::write(): condor_write() failed
11/10 10:56:59 SECMAN: Error sending response classad!
11/10 10:57:12 DaemonCore: Command received via UDP from host
<192.168.1.5:2757>
11/10 10:57:12 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:57:12 vm1: State change: received RELEASE_CLAIM command
11/10 10:57:12 vm1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/10 10:57:12 vm1: State change: No preempting claim, returning to owner
11/10 10:57:12 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/10 10:57:12 vm1: State change: IS_OWNER is false
11/10 10:57:12 vm1: Changing state: Owner -> Unclaimed
11/10 10:57:12 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2743>, fd is 388
11/10 10:57:12 Buf::write(): condor_write() failed
11/10 10:57:12 SECMAN: Error sending response classad!
11/10 10:57:24 DaemonCore: Command received via UDP from host
<192.168.1.5:2758>
11/10 10:57:24 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:57:24 Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1601)
11/10 10:57:36 DaemonCore: Command received via TCP from host
<192.168.1.5:2744>
11/10 10:57:36 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/10 10:57:36 vm2: Got activate_claim request from shadow
(<192.168.1.5:2744>)
11/10 10:57:36 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.1.5:2744>, fd is 240
11/10 10:57:36 Buf::write(): condor_write() failed
11/10 10:57:36 vm2: Can't send eom to shadow.
11/10 10:57:48 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2753>, fd is 240
11/10 10:57:48 Buf::write(): condor_write() failed
11/10 10:57:48 SECMAN: Error sending response classad!
11/10 10:58:12 DaemonCore: Command received via TCP from host
<192.168.1.5:2754>
11/10 10:58:12 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/10 10:58:12 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1601)
11/10 10:58:24 DaemonCore: Command received via UDP from host
<192.168.1.5:2766>
11/10 10:58:24 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:58:24 vm2: State change: received RELEASE_CLAIM command
11/10 10:58:24 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/10 10:58:24 vm2: State change: No preempting claim, returning to owner
11/10 10:58:24 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/10 10:58:24 vm2: State change: IS_OWNER is false
11/10 10:58:24 vm2: Changing state: Owner -> Unclaimed
11/10 10:58:24 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2759>, fd is 336
11/10 10:58:24 Buf::write(): condor_write() failed
11/10 10:58:24 SECMAN: Error sending response classad!
11/10 10:58:36 DaemonCore: Command received via UDP from host
<192.168.1.5:2767>
11/10 10:58:36 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:58:36 Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1602)
11/10 10:58:48 DaemonCore: Command received via TCP from host
<192.168.1.5:2760>
11/10 10:58:48 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/10 10:58:48 vm4: Got activate_claim request from shadow
(<192.168.1.5:2760>)
11/10 10:58:48 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.1.5:2760>, fd is 240
11/10 10:58:48 Buf::write(): condor_write() failed
11/10 10:58:48 vm4: Can't send eom to shadow.
11/10 10:59:00 DaemonCore: Command received via TCP from host
<192.168.1.5:2763>
11/10 10:59:00 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/10 10:59:00 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1602)
11/10 10:59:13 DaemonCore: Command received via UDP from host
<192.168.1.5:2772>
11/10 10:59:13 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:59:13 vm4: State change: received RELEASE_CLAIM command
11/10 10:59:13 vm4: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/10 10:59:13 vm4: State change: No preempting claim, returning to owner
11/10 10:59:13 vm4: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/10 10:59:13 vm4: State change: IS_OWNER is false
11/10 10:59:13 vm4: Changing state: Owner -> Unclaimed
11/10 10:59:13 condor_write(): Socket closed when trying to write 287 bytes
to <192.168.1.5:2768>, fd is 336
11/10 10:59:13 Buf::write(): condor_write() failed
11/10 10:59:13 SECMAN: Error sending response classad!
11/10 10:59:25 DaemonCore: Command received via UDP from host
<192.168.1.5:2773>
11/10 10:59:25 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/10 10:59:25 Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1603)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


The STARTLOG.vm1 is huge... and I got a .old file after about 22hr...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
.......
11/9 16:06:04 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/9 16:06:04 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/9 16:06:04 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
11/10 10:32:37 ******************************************************
11/10 10:32:37 ** condor_starter (CONDOR_STARTER) STARTING UP
11/10 10:32:37 ** D:\condor\bin\condor_starter.exe
11/10 10:32:37 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/10 10:32:37 ** $CondorPlatform: INTEL-WINNT50 $
11/10 10:32:37 ** PID = 2672
11/10 10:32:37 ** Log last touched 11/9 16:06:04
11/10 10:32:37 ******************************************************
11/10 10:32:37 Using config source: D:\condor\condor_config
11/10 10:32:37 Using local config sources:
11/10 10:32:37    D:\condor/condor_config.local
11/10 10:32:37 DaemonCore: Command Socket at <192.168.1.165:2507>
11/10 10:32:37 Setting resource limits not implemented!
11/10 10:32:37 Communicating with shadow <192.168.1.5:2348>
11/10 10:32:37 Submitting machine is "GRID05.RMS.local"
11/10 10:32:37 Error enabling account condor-reuse-vm1
11/10 10:32:37 Error setting password on account condor-reuse-vm1
11/10 10:32:37 LogonUser(condor-reuse-vm1, ... ) failed with status 1326
11/10 10:32:37 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/10 10:32:37 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:32:37 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
11/10 10:32:52 ******************************************************
11/10 10:32:52 ** condor_starter (CONDOR_STARTER) STARTING UP
11/10 10:32:52 ** D:\condor\bin\condor_starter.exe
11/10 10:32:52 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/10 10:32:52 ** $CondorPlatform: INTEL-WINNT50 $
11/10 10:32:52 ** PID = 1584
11/10 10:32:52 ** Log last touched 11/10 10:32:37
11/10 10:32:52 ******************************************************
11/10 10:32:52 Using config source: D:\condor\condor_config
11/10 10:32:52 Using local config sources:
11/10 10:32:52    D:\condor/condor_config.local
11/10 10:32:52 DaemonCore: Command Socket at <192.168.1.165:2518>
11/10 10:32:52 Setting resource limits not implemented!
11/10 10:32:52 Communicating with shadow <192.168.1.5:2361>
11/10 10:32:52 Submitting machine is "GRID05.RMS.local"
11/10 10:32:52 Error enabling account condor-reuse-vm1
11/10 10:32:52 Error setting password on account condor-reuse-vm1
11/10 10:32:52 LogonUser(condor-reuse-vm1, ... ) failed with status 1326
11/10 10:32:52 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/10 10:32:52 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:32:52 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
11/10 10:33:06 ******************************************************
11/10 10:33:06 ** condor_starter (CONDOR_STARTER) STARTING UP
11/10 10:33:06 ** D:\condor\bin\condor_starter.exe
11/10 10:33:06 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/10 10:33:06 ** $CondorPlatform: INTEL-WINNT50 $
11/10 10:33:06 ** PID = 2404
11/10 10:33:06 ** Log last touched 11/10 10:32:52
11/10 10:33:06 ******************************************************
11/10 10:33:06 Using config source: D:\condor\condor_config
11/10 10:33:06 Using local config sources:
11/10 10:33:06    D:\condor/condor_config.local
11/10 10:33:06 DaemonCore: Command Socket at <192.168.1.165:2522>
11/10 10:33:06 Setting resource limits not implemented!
11/10 10:33:06 Communicating with shadow <192.168.1.5:2373>
11/10 10:33:06 Submitting machine is "GRID05.RMS.local"
11/10 10:33:06 Error enabling account condor-reuse-vm1
11/10 10:33:06 Error setting password on account condor-reuse-vm1
11/10 10:33:06 LogonUser(condor-reuse-vm1, ... ) failed with status 1326
11/10 10:33:06 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/10 10:33:06 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:06 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
11/10 10:33:21 ******************************************************
11/10 10:33:21 ** condor_starter (CONDOR_STARTER) STARTING UP
11/10 10:33:21 ** D:\condor\bin\condor_starter.exe
11/10 10:33:21 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/10 10:33:21 ** $CondorPlatform: INTEL-WINNT50 $
11/10 10:33:21 ** PID = 2164
11/10 10:33:21 ** Log last touched 11/10 10:33:06
11/10 10:33:21 ******************************************************
11/10 10:33:21 Using config source: D:\condor\condor_config
11/10 10:33:21 Using local config sources:
11/10 10:33:21    D:\condor/condor_config.local
11/10 10:33:21 DaemonCore: Command Socket at <192.168.1.165:2526>
11/10 10:33:21 Setting resource limits not implemented!
11/10 10:33:21 Communicating with shadow <192.168.1.5:2385>
11/10 10:33:21 Submitting machine is "GRID05.RMS.local"
11/10 10:33:21 Error enabling account condor-reuse-vm1
11/10 10:33:21 Error setting password on account condor-reuse-vm1
11/10 10:33:21 LogonUser(condor-reuse-vm1, ... ) failed with status 1326
11/10 10:33:21 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/10 10:33:21 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:21 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
11/10 10:33:36 ******************************************************
11/10 10:33:36 ** condor_starter (CONDOR_STARTER) STARTING UP
11/10 10:33:36 ** D:\condor\bin\condor_starter.exe
11/10 10:33:36 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/10 10:33:36 ** $CondorPlatform: INTEL-WINNT50 $
11/10 10:33:36 ** PID = 2096
11/10 10:33:36 ** Log last touched 11/10 10:33:21
11/10 10:33:36 ******************************************************
11/10 10:33:36 Using config source: D:\condor\condor_config
11/10 10:33:36 Using local config sources:
11/10 10:33:36    D:\condor/condor_config.local
11/10 10:33:36 DaemonCore: Command Socket at <192.168.1.165:2530>
11/10 10:33:36 Setting resource limits not implemented!
11/10 10:33:36 Communicating with shadow <192.168.1.5:2397>
11/10 10:33:36 Submitting machine is "GRID05.RMS.local"
11/10 10:33:36 Error enabling account condor-reuse-vm1
11/10 10:33:36 Error setting password on account condor-reuse-vm1
11/10 10:33:36 LogonUser(condor-reuse-vm1, ... ) failed with status 1326
11/10 10:33:36 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/10 10:33:36 ERROR "LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:36 Error disabling account condor-reuse-vm1 (INVALID PARAMETER)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The files  STARTLOG.vm2, vm3, and VM4 are relatively small and look like
this...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
11/9 10:34:38 ******************************************************
11/9 10:34:38 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 10:34:38 ** D:\condor\bin\condor_starter.exe
11/9 10:34:38 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 10:34:38 ** $CondorPlatform: INTEL-WINNT50 $
11/9 10:34:38 ** PID = 1476
11/9 10:34:38 ** Log last touched time unavailable (No such file or
directory)
11/9 10:34:38 ******************************************************
11/9 10:34:38 Using config source: D:\condor\condor_config
11/9 10:34:38 Using local config sources:
11/9 10:34:38    D:\condor/condor_config.local
11/9 10:34:38 DaemonCore: Command Socket at <192.168.1.165:2669>
11/9 10:34:38 Setting resource limits not implemented!
11/9 10:34:38 Communicating with shadow <192.168.1.165:2664>
11/9 10:34:38 Submitting machine is "condor02.RMS.local"
11/9 10:34:38 File transfer completed successfully.
11/9 10:34:39 Starting a VANILLA universe job with ID: 2.0
11/9 10:34:39 IWD: D:\condor/execute\dir_1476
11/9 10:34:39 Output file: streaming from remote file std_output.log
11/9 10:34:39 Error file: D:\condor/execute\dir_1476\std_error.log
11/9 10:34:39 Renice expr "10" evaluated to 10
11/9 10:34:39 About to exec D:\condor\execute\dir_1476\condor_exec.exe
11/9 10:34:39 Create_Process succeeded, pid=2352
11/9 14:05:34 Got SIGQUIT.  Performing fast shutdown.
11/9 14:05:34 ShutdownFast all jobs.
11/9 14:05:34 Process exited, pid=2352, status=2
11/9 14:05:34 Last process exited, now Starter is exiting
11/9 14:05:34 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/9 14:05:36 ******************************************************
11/9 14:05:36 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 14:05:36 ** D:\condor\bin\condor_starter.exe
11/9 14:05:36 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 14:05:36 ** $CondorPlatform: INTEL-WINNT50 $
11/9 14:05:36 ** PID = 2008
11/9 14:05:36 ** Log last touched 11/9 14:05:34
11/9 14:05:36 ******************************************************
11/9 14:05:36 Using config source: D:\condor\condor_config
11/9 14:05:36 Using local config sources:
11/9 14:05:36    D:\condor/condor_config.local
11/9 14:05:36 DaemonCore: Command Socket at <192.168.1.165:2388>
11/9 14:05:36 Setting resource limits not implemented!
11/9 14:05:36 Communicating with shadow <192.168.1.165:2379>
11/9 14:05:36 Submitting machine is "condor02.RMS.local"
11/9 14:05:37 File transfer completed successfully.
11/9 14:05:38 Starting a VANILLA universe job with ID: 1.0
11/9 14:05:38 IWD: D:\condor/execute\dir_2008
11/9 14:05:38 Output file: streaming from remote file std_output.log
11/9 14:05:38 Error file: D:\condor/execute\dir_2008\std_error.log
11/9 14:05:38 Renice expr "10" evaluated to 10
11/9 14:05:38 About to exec D:\condor\execute\dir_2008\condor_exec.exe
ads_scenario.if1
11/9 14:05:38 Create_Process succeeded, pid=1256
11/9 14:11:10 Process exited, pid=1256, status=0
11/9 14:11:10 Got SIGQUIT.  Performing fast shutdown.
11/9 14:11:10 ShutdownFast all jobs.
11/9 14:11:10 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/9 15:34:33 ******************************************************
11/9 15:34:33 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 15:34:33 ** D:\condor\bin\condor_starter.exe
11/9 15:34:33 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 15:34:33 ** $CondorPlatform: INTEL-WINNT50 $
11/9 15:34:33 ** PID = 260
11/9 15:34:33 ** Log last touched 11/9 14:11:10
11/9 15:34:33 ******************************************************
11/9 15:34:33 Using config source: D:\condor\condor_config
11/9 15:34:33 Using local config sources:
11/9 15:34:33    D:\condor/condor_config.local
11/9 15:34:33 DaemonCore: Command Socket at <192.168.1.165:4716>
11/9 15:34:33 Setting resource limits not implemented!
11/9 15:34:33 Communicating with shadow <192.168.1.5:1378>
11/9 15:34:33 Submitting machine is "GRID05.RMS.local"
11/9 15:34:34 File transfer completed successfully.
11/9 15:34:35 Starting a VANILLA universe job with ID: 7.0
11/9 15:34:35 IWD: D:\condor/execute\dir_260
11/9 15:34:35 Output file: streaming from remote file std_output.log
11/9 15:34:35 Error file: D:\condor/execute\dir_260\std_error.log
11/9 15:34:35 Renice expr "10" evaluated to 10
11/9 15:34:35 About to exec D:\condor\execute\dir_260\condor_exec.exe
11/9 15:34:35 Create_Process succeeded, pid=696
11/9 16:09:03 Got SIGQUIT.  Performing fast shutdown.
11/9 16:09:03 ShutdownFast all jobs.
11/9 16:09:03 Process exited, pid=696, status=2
11/9 16:09:03 Last process exited, now Starter is exiting
11/9 16:09:03 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Bradford L Hazzard
Senior Software Engineer
Raytheon/RMS

Email: bhazzard@xxxxxxxxxxxx , bhazzard@xxxxxxx