[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor configuration for Multi-CPU machines



Hi there,

Not being an expert in anyway in interpreting your situation, but two
thing stand out to me, firstly your start logs seem to show the
inability to use user "nobody" nor to use user "condor-reuse-vm1". See
if you can find a solution to this problem on WinXP from the archives of
the user list. 

The second problem appears to be an inability to talk to the shadow
daemons or the collector on the master server/computer, but I'm not sure
of that.

Both of these comments I am unsure of, so take with a barrel of salt....

Peter


Dr Peter Myerscough-Jackopson  -  Engineer
MULTIPLE ACCESS COMMUNICATIONS LIMITED
Delta House, The University of Southampton Science Park, Southampton,
SO16 7NS,
United Kingdom.
Tel: +44 (0)23 8076 7808 Fax: +44 (0)23 8076 0602
Web: http://www.macltd.com/  Email:
peter.myerscough-jackopson@xxxxxxxxxx

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Bradford Hazzard
Sent: 10 November 2006 22:37
To: Condor-Users Mail List
Cc: Amzie L Mcwhorter
Subject: [Condor-users] Condor configuration for Multi-CPU machines

Everyone,
 I just read Peter's email on setting up a set dual core/dual cpu
compute nodes running WinXP (x64).
I am in the middle of setting up a very similar grid configuration
(using condor 6.8.2) and his configuration script is very enlightening
and demonstrates the flexibility of condor.

My problem is submitting simple vanilla jobs to a pool containing 16
compute nodes each with dual cpu/dual core opterons running WinXP x64.
The executable is a multi-threaded app (24 threads and the exe has a ram
footprint of 90MB) that has run well on a condor pool with a combination
of single/dual cpu (all single core) compute nodes.  The problem appears
to be condor but I cannot be sure...

When several (6) jobs (job id 22-27) are submitted to the compute node
only one job completed in 24hr and the other 5 jobs are continuously
idle.
These jobs should each take approx 3hr each to run on a single
processor.

Note I have tried to use the suggestions from the FAQ article "Why does
my Linux job have an enormous ImageSize and refuse to run anymore?" with
no change in the job behavior...

Are there any special considerations for XP x64 and Condor ?
Sorry for all of the log files here but I am trying to demonstrate a
single node is working before turning on the 15 nodes and I am unsure
what log files people need to see...

The compute node is headless however we do login via RemoteDesktop.

I have included the following:
1) system info
2) output from condor_q -global -analyze run on the submit host.
3) condor logs from the 6 jobs submitted.
4) condor log/STARTLOG
5) condor log/STARTLOG.vm1 (huge), STARTLOG.vm2(small),
STARTLOG.vm3(small) , STARTLOG.vm4 (small)

A portion of my system info file is here:
%%%%%%%%%%%%%%%%%%%%%
OS Name     Microsoft(R) Windows(R) XP Professional x64 Edition
Version     5.2.3790 Service Pack 1 Build 3790
Other OS Description    Not Available
OS Manufacturer   Microsoft Corporation
System Name CONDOR02
System Manufacturer     RIOWORKS
System Model      C1000
System Type x64-based PC
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2405 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2406 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2406 Mhz
Processor   AMD64 Family 15 Model 33 Stepping 2 AuthenticAMD ~2405 Mhz
BIOS Version/Date Phoenix Technologies Ltd. V1.13, 7/13/2006
SMBIOS Version    2.34
Windows Directory C:\WINDOWS
System Directory  C:\WINDOWS\system32
Boot Device \Device\HarddiskVolume1
Locale      United States
Hardware Abstraction Layer    Version = "5.2.3790.1830
(srv03_sp1_rtm.050324-1447)"
User Name   Not Available
Time Zone   Pacific Standard Time
Total Physical Memory   4,094.68 MB
Available Physical Memory     3.49 GB
Total Virtual Memory    5.75 GB
Available Virtual Memory      5.48 GB
Page File Space   2.00 GB
Page File   D:\pagefile.sys
%%%%%%%%%%%%%%%%%%%%%

dump from "condor_q -global -analyze" (run from the submit host) looked
like this after running 24hr.
note job 23.0 is missing since it completed 15hr ago.
Note there is NOTHING else running on the compute node (condor02
192.168.1.165)
And there are no entries in d:/condor/execute for these jobs...
The job submit host is GRID05 (192.168.1.5) %%%%%%%%%%%%%%%%%%%%% $
condor_q -global -analyze


-- Schedd: GRID05.RMS.local : <192.168.1.5:4897>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
022.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
024.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
025.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
---
026.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 11:27:09 2006
---
027.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 10 12:00:23 2006
%%%%%%%%%%%%%%%%%%%%%

Note Job 23 completed (condor.log)
%%%%%%%%%%%%%%%%%%%%%
000 (023.000.000) 11/09 18:10:39 Job submitted from host:
<192.168.1.5:4897>
...
001 (023.000.000) 11/09 18:13:30 Job executing on host:
<192.168.1.165:1617>
...
006 (023.000.000) 11/09 18:13:38 Image size of job updated: 91684 ...
006 (023.000.000) 11/09 18:33:38 Image size of job updated: 91904 ...
005 (023.000.000) 11/09 20:08:10 Job terminated.
      (1) Normal termination (return value 0)
            Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
            Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
      18274022  -  Run Bytes Sent By Job
      41873792  -  Run Bytes Received By Job
      18274022  -  Total Bytes Sent By Job
      41873792  -  Total Bytes Received By Job %%%%%%%%%%%%%%%%%%%%%


Note Job 21,22,24,25   look like this (condor.log)
%%%%%%%%%%%%%%%%%%%%%
000 (024.000.000) 11/09 18:10:51 Job submitted from host:
<192.168.1.5:4897>
%%%%%%%%%%%%%%%%%%%%%

Note Job 27  looks like this (condor.log) %%%%%%%%%%%%%%%%%%%%% 000
(027.000.000) 11/10 11:26:34 Job submitted from host:
<192.168.1.5:4897>
...
007 (027.000.000) 11/10 11:32:37 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:32:52 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:06 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:21 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
...
007 (027.000.000) 11/10 11:33:36 Shadow exception!
      Can no longer talk to condor_starter <192.168.1.165:1617>
      0  -  Run Bytes Sent By Job
      0  -  Run Bytes Received By Job
%%%%%%%%%%%%%%%%%%%%%

the STARTLOG on the controller/compute node is huge and ends like
this...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
...
11/10 10:54:10 DaemonCore: Command received via UDP from host
<192.168.1.5:2708> 11/10 10:54:10 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:54:10
Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:54:22 DaemonCore: Command received via TCP from host
<192.168.1.5:2685> 11/10 10:54:22 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/10
10:54:22 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:54:22 condor_write(): Socket closed when trying to write 13
bytes to <192.168.1.5:2685>, fd is 336 11/10 10:54:22 Buf::write():
condor_write() failed 11/10 10:54:22 condor_write(): Socket closed when
trying to write 287 bytes to <192.168.1.5:2694>, fd is 336 11/10
10:54:22 Buf::write(): condor_write() failed 11/10 10:54:22 SECMAN:
Error sending response classad!
11/10 10:54:34 DaemonCore: Command received via UDP from host
<192.168.1.5:2714> 11/10 10:54:34 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:54:34
vm3: State change: received RELEASE_CLAIM command 11/10 10:54:34 vm3:
Changing state and activity: Claimed/Idle -> Preempting/Vacating 11/10
10:54:34 vm3: State change: No preempting claim, returning to owner
11/10 10:54:34 vm3: Changing state and activity: Preempting/Vacating ->
Owner/Idle 11/10 10:54:34 vm3: State change: IS_OWNER is false 11/10
10:54:34 vm3: Changing state: Owner -> Unclaimed 11/10 10:54:47
DaemonCore: Command received via TCP from host <192.168.1.5:2695> 11/10
10:54:47 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler) 11/10 10:54:47 Error: can't find
resource with ClaimId
(<192.168.1.165:1617>#1163119066#1595)
11/10 10:54:59 DaemonCore: Command received via UDP from host
<192.168.1.5:2715> 11/10 10:54:59 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:54:59
Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:54:59 condor_write(): Socket closed when trying to write 287
bytes to <192.168.1.5:2700>, fd is 388 11/10 10:54:59 Buf::write():
condor_write() failed 11/10 10:54:59 SECMAN: Error sending response
classad!
11/10 10:55:11 DaemonCore: Command received via TCP from host
<192.168.1.5:2701> 11/10 10:55:11 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/10
10:55:11 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:55:11 condor_write(): Socket closed when trying to write 13
bytes to <192.168.1.5:2701>, fd is 388 11/10 10:55:11 Buf::write():
condor_write() failed 11/10 10:55:23 DaemonCore: Command received via
TCP from host <192.168.1.5:2704> 11/10 10:55:23 DaemonCore: received
command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler
(command_handler) 11/10 10:55:23 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1596)
11/10 10:55:23 condor_write(): Socket closed when trying to write 287
bytes to <192.168.1.5:2709>, fd is 388 11/10 10:55:23 Buf::write():
condor_write() failed 11/10 10:55:23 SECMAN: Error sending response
classad!
11/10 10:55:35 DaemonCore: Command received via TCP from host
<192.168.1.5:2711> 11/10 10:55:35 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 11/10
10:55:35 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1590)
11/10 10:55:47 DaemonCore: Command received via TCP from host
<192.168.1.5:2719> 11/10 10:55:47 DaemonCore: received command 442
(REQUEST_CLAIM), calling handler (command_request_claim) 11/10 10:55:47
vm1: Request accepted.
11/10 10:55:59 vm1: Remote owner is bhaz@xxxxxxxxx 11/10 10:55:59 vm1:
State change: claiming protocol successful 11/10 10:55:59 vm1: Changing
state: Unclaimed -> Claimed 11/10 10:55:59 DaemonCore: Command received
via UDP from host <192.168.1.165:2760> 11/10 10:55:59 DaemonCore:
received command 440 (MATCH_INFO), calling handler (command_match_info)
11/10 10:55:59 vm1: match_info called 11/10 10:56:11 DaemonCore: Command
received via TCP from host <192.168.1.5:2720> 11/10 10:56:11 DaemonCore:
received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim) 11/10 10:56:11 vm2: Request accepted.
11/10 10:56:23 vm2: Remote owner is bhaz@xxxxxxxxx 11/10 10:56:23 vm2:
State change: claiming protocol successful 11/10 10:56:23 vm2: Changing
state: Unclaimed -> Claimed 11/10 10:56:23 DaemonCore: Command received
via UDP from host <192.168.1.165:2761> 11/10 10:56:23 DaemonCore:
received command 440 (MATCH_INFO), calling handler (command_match_info)
11/10 10:56:23 vm2: match_info called 11/10 10:56:35 DaemonCore: Command
received via TCP from host <192.168.1.5:2721> 11/10 10:56:35 DaemonCore:
received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim) 11/10 10:56:35 vm4: Request accepted.
11/10 10:56:47 vm4: Remote owner is bhaz@xxxxxxxxx 11/10 10:56:47 vm4:
State change: claiming protocol successful 11/10 10:56:47 vm4: Changing
state: Unclaimed -> Claimed 11/10 10:56:47 DaemonCore: Command received
via UDP from host <192.168.1.165:2762> 11/10 10:56:47 DaemonCore:
received command 440 (MATCH_INFO), calling handler (command_match_info)
11/10 10:56:47 vm2: match_info called 11/10 10:56:47 condor_write():
Socket closed when trying to write 287 bytes to <192.168.1.5:2730>, fd
is 184 11/10 10:56:47 Buf::write(): condor_write() failed 11/10 10:56:47
SECMAN: Error sending response classad!
11/10 10:56:47 DaemonCore: Command received via UDP from host
<192.168.1.165:2763> 11/10 10:56:47 DaemonCore: received command 440
(MATCH_INFO), calling handler (command_match_info) 11/10 10:56:47 vm4:
match_info called 11/10 10:56:59 DaemonCore: Command received via TCP
from host <192.168.1.5:2733> 11/10 10:56:59 DaemonCore: received command
444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/10
10:56:59 vm1: Got activate_claim request from shadow
(<192.168.1.5:2733>)
11/10 10:56:59 condor_write(): Socket closed when trying to write 13
bytes to <192.168.1.5:2733>, fd is 184 11/10 10:56:59 Buf::write():
condor_write() failed 11/10 10:56:59 vm1: Can't send eom to shadow.
11/10 10:56:59 condor_write(): Socket closed when trying to write 287
bytes to <192.168.1.5:2742>, fd is 240 11/10 10:56:59 Buf::write():
condor_write() failed 11/10 10:56:59 SECMAN: Error sending response
classad!
11/10 10:57:12 DaemonCore: Command received via UDP from host
<192.168.1.5:2757> 11/10 10:57:12 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:57:12
vm1: State change: received RELEASE_CLAIM command 11/10 10:57:12 vm1:
Changing state and activity: Claimed/Idle -> Preempting/Vacating 11/10
10:57:12 vm1: State change: No preempting claim, returning to owner
11/10 10:57:12 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle 11/10 10:57:12 vm1: State change: IS_OWNER is false 11/10
10:57:12 vm1: Changing state: Owner -> Unclaimed 11/10 10:57:12
condor_write(): Socket closed when trying to write 287 bytes to
<192.168.1.5:2743>, fd is 388 11/10 10:57:12 Buf::write():
condor_write() failed 11/10 10:57:12 SECMAN: Error sending response
classad!
11/10 10:57:24 DaemonCore: Command received via UDP from host
<192.168.1.5:2758> 11/10 10:57:24 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:57:24
Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1601)
11/10 10:57:36 DaemonCore: Command received via TCP from host
<192.168.1.5:2744> 11/10 10:57:36 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/10
10:57:36 vm2: Got activate_claim request from shadow
(<192.168.1.5:2744>)
11/10 10:57:36 condor_write(): Socket closed when trying to write 13
bytes to <192.168.1.5:2744>, fd is 240 11/10 10:57:36 Buf::write():
condor_write() failed 11/10 10:57:36 vm2: Can't send eom to shadow.
11/10 10:57:48 condor_write(): Socket closed when trying to write 287
bytes to <192.168.1.5:2753>, fd is 240 11/10 10:57:48 Buf::write():
condor_write() failed 11/10 10:57:48 SECMAN: Error sending response
classad!
11/10 10:58:12 DaemonCore: Command received via TCP from host
<192.168.1.5:2754> 11/10 10:58:12 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 11/10
10:58:12 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1601)
11/10 10:58:24 DaemonCore: Command received via UDP from host
<192.168.1.5:2766> 11/10 10:58:24 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:58:24
vm2: State change: received RELEASE_CLAIM command 11/10 10:58:24 vm2:
Changing state and activity: Claimed/Idle -> Preempting/Vacating 11/10
10:58:24 vm2: State change: No preempting claim, returning to owner
11/10 10:58:24 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle 11/10 10:58:24 vm2: State change: IS_OWNER is false 11/10
10:58:24 vm2: Changing state: Owner -> Unclaimed 11/10 10:58:24
condor_write(): Socket closed when trying to write 287 bytes to
<192.168.1.5:2759>, fd is 336 11/10 10:58:24 Buf::write():
condor_write() failed 11/10 10:58:24 SECMAN: Error sending response
classad!
11/10 10:58:36 DaemonCore: Command received via UDP from host
<192.168.1.5:2767> 11/10 10:58:36 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:58:36
Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1602)
11/10 10:58:48 DaemonCore: Command received via TCP from host
<192.168.1.5:2760> 11/10 10:58:48 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/10
10:58:48 vm4: Got activate_claim request from shadow
(<192.168.1.5:2760>)
11/10 10:58:48 condor_write(): Socket closed when trying to write 13
bytes to <192.168.1.5:2760>, fd is 240 11/10 10:58:48 Buf::write():
condor_write() failed 11/10 10:58:48 vm4: Can't send eom to shadow.
11/10 10:59:00 DaemonCore: Command received via TCP from host
<192.168.1.5:2763> 11/10 10:59:00 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 11/10
10:59:00 Error: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1602)
11/10 10:59:13 DaemonCore: Command received via UDP from host
<192.168.1.5:2772> 11/10 10:59:13 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:59:13
vm4: State change: received RELEASE_CLAIM command 11/10 10:59:13 vm4:
Changing state and activity: Claimed/Idle -> Preempting/Vacating 11/10
10:59:13 vm4: State change: No preempting claim, returning to owner
11/10 10:59:13 vm4: Changing state and activity: Preempting/Vacating ->
Owner/Idle 11/10 10:59:13 vm4: State change: IS_OWNER is false 11/10
10:59:13 vm4: Changing state: Owner -> Unclaimed 11/10 10:59:13
condor_write(): Socket closed when trying to write 287 bytes to
<192.168.1.5:2768>, fd is 336 11/10 10:59:13 Buf::write():
condor_write() failed 11/10 10:59:13 SECMAN: Error sending response
classad!
11/10 10:59:25 DaemonCore: Command received via UDP from host
<192.168.1.5:2773> 11/10 10:59:25 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim) 11/10 10:59:25
Warning: can't find resource with ClaimId
(<192.168.1.165:1617>#1163119066#1603)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


The STARTLOG.vm1 is huge... and I got a .old file after about 22hr...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
.......
11/9 16:06:04 ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C
11/9 16:06:04 ERROR "LocalUserLog::logStarterError() called before
init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/9 16:06:04 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) 11/10 10:32:37
******************************************************
11/10 10:32:37 ** condor_starter (CONDOR_STARTER) STARTING UP 11/10
10:32:37 ** D:\condor\bin\condor_starter.exe 11/10 10:32:37 **
$CondorVersion: 6.8.2 Oct 12 2006 $ 11/10 10:32:37 ** $CondorPlatform:
INTEL-WINNT50 $ 11/10 10:32:37 ** PID = 2672 11/10 10:32:37 ** Log last
touched 11/9 16:06:04 11/10 10:32:37
******************************************************
11/10 10:32:37 Using config source: D:\condor\condor_config 11/10
10:32:37 Using local config sources:
11/10 10:32:37    D:\condor/condor_config.local
11/10 10:32:37 DaemonCore: Command Socket at <192.168.1.165:2507> 11/10
10:32:37 Setting resource limits not implemented!
11/10 10:32:37 Communicating with shadow <192.168.1.5:2348> 11/10
10:32:37 Submitting machine is "GRID05.RMS.local"
11/10 10:32:37 Error enabling account condor-reuse-vm1 11/10 10:32:37
Error setting password on account condor-reuse-vm1 11/10 10:32:37
LogonUser(condor-reuse-vm1, ... ) failed with status 1326 11/10 10:32:37
ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C 11/10 10:32:37 ERROR
"LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:32:37 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) 11/10 10:32:52
******************************************************
11/10 10:32:52 ** condor_starter (CONDOR_STARTER) STARTING UP 11/10
10:32:52 ** D:\condor\bin\condor_starter.exe 11/10 10:32:52 **
$CondorVersion: 6.8.2 Oct 12 2006 $ 11/10 10:32:52 ** $CondorPlatform:
INTEL-WINNT50 $ 11/10 10:32:52 ** PID = 1584 11/10 10:32:52 ** Log last
touched 11/10 10:32:37 11/10 10:32:52
******************************************************
11/10 10:32:52 Using config source: D:\condor\condor_config 11/10
10:32:52 Using local config sources:
11/10 10:32:52    D:\condor/condor_config.local
11/10 10:32:52 DaemonCore: Command Socket at <192.168.1.165:2518> 11/10
10:32:52 Setting resource limits not implemented!
11/10 10:32:52 Communicating with shadow <192.168.1.5:2361> 11/10
10:32:52 Submitting machine is "GRID05.RMS.local"
11/10 10:32:52 Error enabling account condor-reuse-vm1 11/10 10:32:52
Error setting password on account condor-reuse-vm1 11/10 10:32:52
LogonUser(condor-reuse-vm1, ... ) failed with status 1326 11/10 10:32:52
ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C 11/10 10:32:52 ERROR
"LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:32:52 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) 11/10 10:33:06
******************************************************
11/10 10:33:06 ** condor_starter (CONDOR_STARTER) STARTING UP 11/10
10:33:06 ** D:\condor\bin\condor_starter.exe 11/10 10:33:06 **
$CondorVersion: 6.8.2 Oct 12 2006 $ 11/10 10:33:06 ** $CondorPlatform:
INTEL-WINNT50 $ 11/10 10:33:06 ** PID = 2404 11/10 10:33:06 ** Log last
touched 11/10 10:32:52 11/10 10:33:06
******************************************************
11/10 10:33:06 Using config source: D:\condor\condor_config 11/10
10:33:06 Using local config sources:
11/10 10:33:06    D:\condor/condor_config.local
11/10 10:33:06 DaemonCore: Command Socket at <192.168.1.165:2522> 11/10
10:33:06 Setting resource limits not implemented!
11/10 10:33:06 Communicating with shadow <192.168.1.5:2373> 11/10
10:33:06 Submitting machine is "GRID05.RMS.local"
11/10 10:33:06 Error enabling account condor-reuse-vm1 11/10 10:33:06
Error setting password on account condor-reuse-vm1 11/10 10:33:06
LogonUser(condor-reuse-vm1, ... ) failed with status 1326 11/10 10:33:06
ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C 11/10 10:33:06 ERROR
"LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:06 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) 11/10 10:33:21
******************************************************
11/10 10:33:21 ** condor_starter (CONDOR_STARTER) STARTING UP 11/10
10:33:21 ** D:\condor\bin\condor_starter.exe 11/10 10:33:21 **
$CondorVersion: 6.8.2 Oct 12 2006 $ 11/10 10:33:21 ** $CondorPlatform:
INTEL-WINNT50 $ 11/10 10:33:21 ** PID = 2164 11/10 10:33:21 ** Log last
touched 11/10 10:33:06 11/10 10:33:21
******************************************************
11/10 10:33:21 Using config source: D:\condor\condor_config 11/10
10:33:21 Using local config sources:
11/10 10:33:21    D:\condor/condor_config.local
11/10 10:33:21 DaemonCore: Command Socket at <192.168.1.165:2526> 11/10
10:33:21 Setting resource limits not implemented!
11/10 10:33:21 Communicating with shadow <192.168.1.5:2385> 11/10
10:33:21 Submitting machine is "GRID05.RMS.local"
11/10 10:33:21 Error enabling account condor-reuse-vm1 11/10 10:33:21
Error setting password on account condor-reuse-vm1 11/10 10:33:21
LogonUser(condor-reuse-vm1, ... ) failed with status 1326 11/10 10:33:21
ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C 11/10 10:33:21 ERROR
"LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:21 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) 11/10 10:33:36
******************************************************
11/10 10:33:36 ** condor_starter (CONDOR_STARTER) STARTING UP 11/10
10:33:36 ** D:\condor\bin\condor_starter.exe 11/10 10:33:36 **
$CondorVersion: 6.8.2 Oct 12 2006 $ 11/10 10:33:36 ** $CondorPlatform:
INTEL-WINNT50 $ 11/10 10:33:36 ** PID = 2096 11/10 10:33:36 ** Log last
touched 11/10 10:33:21 11/10 10:33:36
******************************************************
11/10 10:33:36 Using config source: D:\condor\condor_config 11/10
10:33:36 Using local config sources:
11/10 10:33:36    D:\condor/condor_config.local
11/10 10:33:36 DaemonCore: Command Socket at <192.168.1.165:2530> 11/10
10:33:36 Setting resource limits not implemented!
11/10 10:33:36 Communicating with shadow <192.168.1.5:2397> 11/10
10:33:36 Submitting machine is "GRID05.RMS.local"
11/10 10:33:36 Error enabling account condor-reuse-vm1 11/10 10:33:36
Error setting password on account condor-reuse-vm1 11/10 10:33:36
LogonUser(condor-reuse-vm1, ... ) failed with status 1326 11/10 10:33:36
ERROR "Failed to create a user nobody" at line 436 in file
..\src\condor_c++_util\uids.C 11/10 10:33:36 ERROR
"LocalUserLog::logStarterError() called before init()"
at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C
11/10 10:33:36 Error disabling account condor-reuse-vm1 (INVALID
PARAMETER) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The files  STARTLOG.vm2, vm3, and VM4 are relatively small and look like
this...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
11/9 10:34:38 ******************************************************
11/9 10:34:38 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 10:34:38 ** D:\condor\bin\condor_starter.exe
11/9 10:34:38 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 10:34:38 ** $CondorPlatform: INTEL-WINNT50 $
11/9 10:34:38 ** PID = 1476
11/9 10:34:38 ** Log last touched time unavailable (No such file or
directory)
11/9 10:34:38 ******************************************************
11/9 10:34:38 Using config source: D:\condor\condor_config
11/9 10:34:38 Using local config sources:
11/9 10:34:38    D:\condor/condor_config.local
11/9 10:34:38 DaemonCore: Command Socket at <192.168.1.165:2669>
11/9 10:34:38 Setting resource limits not implemented!
11/9 10:34:38 Communicating with shadow <192.168.1.165:2664>
11/9 10:34:38 Submitting machine is "condor02.RMS.local"
11/9 10:34:38 File transfer completed successfully.
11/9 10:34:39 Starting a VANILLA universe job with ID: 2.0
11/9 10:34:39 IWD: D:\condor/execute\dir_1476
11/9 10:34:39 Output file: streaming from remote file std_output.log
11/9 10:34:39 Error file: D:\condor/execute\dir_1476\std_error.log
11/9 10:34:39 Renice expr "10" evaluated to 10
11/9 10:34:39 About to exec D:\condor\execute\dir_1476\condor_exec.exe
11/9 10:34:39 Create_Process succeeded, pid=2352
11/9 14:05:34 Got SIGQUIT.  Performing fast shutdown.
11/9 14:05:34 ShutdownFast all jobs.
11/9 14:05:34 Process exited, pid=2352, status=2
11/9 14:05:34 Last process exited, now Starter is exiting
11/9 14:05:34 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/9 14:05:36 ******************************************************
11/9 14:05:36 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 14:05:36 ** D:\condor\bin\condor_starter.exe
11/9 14:05:36 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 14:05:36 ** $CondorPlatform: INTEL-WINNT50 $
11/9 14:05:36 ** PID = 2008
11/9 14:05:36 ** Log last touched 11/9 14:05:34
11/9 14:05:36 ******************************************************
11/9 14:05:36 Using config source: D:\condor\condor_config
11/9 14:05:36 Using local config sources:
11/9 14:05:36    D:\condor/condor_config.local
11/9 14:05:36 DaemonCore: Command Socket at <192.168.1.165:2388>
11/9 14:05:36 Setting resource limits not implemented!
11/9 14:05:36 Communicating with shadow <192.168.1.165:2379>
11/9 14:05:36 Submitting machine is "condor02.RMS.local"
11/9 14:05:37 File transfer completed successfully.
11/9 14:05:38 Starting a VANILLA universe job with ID: 1.0
11/9 14:05:38 IWD: D:\condor/execute\dir_2008
11/9 14:05:38 Output file: streaming from remote file std_output.log
11/9 14:05:38 Error file: D:\condor/execute\dir_2008\std_error.log
11/9 14:05:38 Renice expr "10" evaluated to 10
11/9 14:05:38 About to exec D:\condor\execute\dir_2008\condor_exec.exe
ads_scenario.if1
11/9 14:05:38 Create_Process succeeded, pid=1256
11/9 14:11:10 Process exited, pid=1256, status=0
11/9 14:11:10 Got SIGQUIT.  Performing fast shutdown.
11/9 14:11:10 ShutdownFast all jobs.
11/9 14:11:10 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/9 15:34:33 ******************************************************
11/9 15:34:33 ** condor_starter (CONDOR_STARTER) STARTING UP
11/9 15:34:33 ** D:\condor\bin\condor_starter.exe
11/9 15:34:33 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/9 15:34:33 ** $CondorPlatform: INTEL-WINNT50 $
11/9 15:34:33 ** PID = 260
11/9 15:34:33 ** Log last touched 11/9 14:11:10
11/9 15:34:33 ******************************************************
11/9 15:34:33 Using config source: D:\condor\condor_config
11/9 15:34:33 Using local config sources:
11/9 15:34:33    D:\condor/condor_config.local
11/9 15:34:33 DaemonCore: Command Socket at <192.168.1.165:4716>
11/9 15:34:33 Setting resource limits not implemented!
11/9 15:34:33 Communicating with shadow <192.168.1.5:1378>
11/9 15:34:33 Submitting machine is "GRID05.RMS.local"
11/9 15:34:34 File transfer completed successfully.
11/9 15:34:35 Starting a VANILLA universe job with ID: 7.0
11/9 15:34:35 IWD: D:\condor/execute\dir_260
11/9 15:34:35 Output file: streaming from remote file std_output.log
11/9 15:34:35 Error file: D:\condor/execute\dir_260\std_error.log
11/9 15:34:35 Renice expr "10" evaluated to 10
11/9 15:34:35 About to exec D:\condor\execute\dir_260\condor_exec.exe
11/9 15:34:35 Create_Process succeeded, pid=696
11/9 16:09:03 Got SIGQUIT.  Performing fast shutdown.
11/9 16:09:03 ShutdownFast all jobs.
11/9 16:09:03 Process exited, pid=696, status=2
11/9 16:09:03 Last process exited, now Starter is exiting
11/9 16:09:03 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Bradford L Hazzard
Senior Software Engineer
Raytheon/RMS

Email: bhazzard@xxxxxxxxxxxx , bhazzard@xxxxxxx



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR