[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] multiple dags and schedd dieing




Hello,

Should I be able to have 2 DAGMan jobs executing at the same time? I assumed I should, as I couldn't find any info to the contrary.
If I'm wrong then ignore my rambling....

I have 2 separate dags, and I submit one after another with condor_submit_dag. Everything starts up good, I see two separate
jobs with condor_dagman.exe executing, each is processing its dag and submitting more jobs.
When the first DAGMan job finishes, it takes out schedd with it.  

$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: INTEL-WINNT40 $

dagman.out of first dag
6/2 14:39:49 POST Script of Job F completed successfully.
6/2 14:39:49 Of 154 nodes total:
6/2 14:39:49  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
6/2 14:39:49   ===     ===      ===     ===     ===        ===      ===
6/2 14:39:49   154       0        0       0       0          0        0
6/2 14:39:49 All jobs Completed!
6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING WITH STATUS 0

MasterLog
6/2 14:39:47 ProcAPI: pid # 3356 was not found
6/2 14:39:48 ProcAPI: pid # 3388 was not found
6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
....
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:39:49 ProcAPI: pid # 5268 was not found
6/2 14:39:49 ProcAPI: pid # 1716 was not found
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87). Maybe it exited already?
6/2 14:44:58 ProcAPI: pid # 4756 was not found
6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87). Maybe it exited already?
6/2 14:45:06 ProcAPI: pid # 2324 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 2968 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4788 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 5056 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4704 was not found
6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config file
6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds


SchedLog
6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with status 0
6/2 14:39:49 Writing record to user logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user
6/2 14:39:49 init_user_ids: want user 'user@machine', current is '(null)@(null)'
6/2 14:39:49 init_user_ids: Already have handle for user@machine, so returning.
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value of True
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 Unknown user notification selection
6/2 14:39:49         Notify user with subject: Condor Job 2563.0
6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config file
6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file ..\src\condor_schedd.V6\schedd.C

Eventually master restarts schedd, but this causes havoc with already running jobs.

Any help appreciated,
Pawel

*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************