[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Condor falling over overnight



Hi,

We are having some - probably very basic - problems getting Condor
running at our site. This is the first time I have tried to set up
Condor. The OS is Windows XP Professional. 

Short jobs that run during the day have worked, but longer overnight
jobs are failing. This appears regularly in the StarterLog (shutdown 15
minutes after creating a process) on each machine in the pool, and I
don't know why.  

1/20 23:28:42 File transfer completed successfully.
1/20 23:28:43 Starting a VANILLA universe job with ID: 4.1
1/20 23:28:43 IWD: C:\Condor/execute\dir_3892
1/20 23:28:43 Output file: C:\Condor/execute\dir_3892\test.out
1/20 23:28:43 Renice expr "10" evaluated to 10
1/20 23:28:43 About to exec C:\WINDOWS\System32\cmd.exe /Q /C
condor_exec.bat 1
1/20 23:28:43 Create_Process succeeded, pid=1304
1/20 23:43:46 Got SIGQUIT.  Performing fast shutdown.
1/20 23:43:46 ShutdownFast all jobs.
1/20 23:44:42 Got SIGTERM. Performing graceful shutdown.
1/20 23:44:42 ShutdownGraceful all jobs.
1/20 23:44:46 Our Parent process (pid 1780) exited; shutting down
1/20 23:44:46 Process exited, pid=1304, status=0
1/20 23:44:46 condor_write(): send() returned -1, timeout=300,
errno=10054.  Assuming failure.
1/20 23:44:46 Buf::write(): condor_write() failed
1/20 23:44:46 ERROR "Assertion ERROR on (result)" at line 266 in file
..\src\condor_starter.V6.1\NTsenders.C
1/20 23:44:46 ShutdownFast all jobs.

The following is also a regular feature in the Shadowlog

1/29 03:27:24 ******************************************************
1/29 03:27:24 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/29 03:27:24 ** $CondorVersion: 6.6.0 Nov 24 2003 $
1/29 03:27:24 ** $CondorPlatform: INTEL-WINNT40 $
1/29 03:27:24 ** PID = 3928
1/29 03:27:24 ******************************************************
1/29 03:27:24 Using config file: C:\Condor\condor_config
1/29 03:27:24 Using local config files: C:\Condor/condor_config.local
1/29 03:27:24 DaemonCore: Command Socket at <192.168.0.74:3222>
1/29 03:27:25 Initializing a VANILLA shadow
1/29 03:27:25 (5.1) (3928): Request to run on <192.168.0.74:1033> was
ACCEPTED
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:26 ******************************************************
1/29 03:27:26 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/29 03:27:26 ** $CondorVersion: 6.6.0 Nov 24 2003 $
1/29 03:27:26 ** $CondorPlatform: INTEL-WINNT40 $
1/29 03:27:26 ** PID = 3172
1/29 03:27:26 ******************************************************
1/29 03:27:26 Using config file: C:\Condor\condor_config
1/29 03:27:26 Using local config files: C:\Condor/condor_config.local
1/29 03:27:26 DaemonCore: Command Socket at <192.168.0.74:3239>
1/29 03:27:27 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:27 Initializing a VANILLA shadow
1/29 03:27:27 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): Request to run on <192.168.0.36:2603> was
ACCEPTED
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:31 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone 

We're also seeing all of the Condor daemons exiting on the central
manager overnight whenever a large job is submitted. 

Messages 597 and 137 on this list also had (err=1722), but the list has
no information about how the problems were resolved. 

I sent a query about this problem to condor-admin over a week ago, but
have had no reply apart from the automatic one. 

Hope someone can help, thanks,
Simon


Simon Hoyle, 
Inter-American Tropical Tuna Commission
Scripps Institute of Oceanography
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027   Fax: (858) 546-7133 

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>