[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] multiple dags and schedd dieing



Pawel,

There is no problem running multiple concurrent DAGMan jobs, provided they each have a unique DAG input file.

Your problem below is two-fold:

1) For some reason you don't have SMTP_SERVER defined in your config file. If you set this correctly, your problem should go away.

2) While the condor_master is cooly reporting this config problem and moving on, the condor_schedd is over-reacting and exiting, which is dumb. I'm considering this a bug and will fix it for the next stable & development releases.

Thanks for the report!

-Peter


On Jun 2, 2005, at 2:33 PM, Pawel.Micun@xxxxxxxxxxxxxxxx wrote:


Hello,

Should I be able to have 2 DAGMan jobs executing at the same time? I assumed I should, as I couldn't find any info to the contrary.
If I'm wrong then ignore my rambling....


I have 2 separate dags, and I submit one after another with condor_submit_dag. Everything starts up good, I see two separate
jobs with condor_dagman.exe executing, each is processing its dag and submitting more jobs.
When the first DAGMan job finishes, it takes out schedd with it.  


$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: INTEL-WINNT40 $

dagman.out of first dag
6/2 14:39:49 POST Script of Job F completed successfully.
6/2 14:39:49 Of 154 nodes total:
6/2 14:39:49  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
6/2 14:39:49   ===     ===      ===     ===     ===        ===      ===
6/2 14:39:49   154       0        0       0       0          0        0
6/2 14:39:49 All jobs Completed!
6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING WITH STATUS 0


MasterLog
6/2 14:39:47 ProcAPI: pid # 3356 was not found
6/2 14:39:48 ProcAPI: pid # 3388 was not found
6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
....
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:39:49 ProcAPI: pid # 5268 was not found
6/2 14:39:49 ProcAPI: pid # 1716 was not found
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87). Maybe it exited already?
6/2 14:44:58 ProcAPI: pid # 4756 was not found
6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87). Maybe it exited already?
6/2 14:45:06 ProcAPI: pid # 2324 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 2968 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4788 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 5056 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87). Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4704 was not found
6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config file
6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds



SchedLog
6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with status 0
6/2 14:39:49 Writing record to user logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user
6/2 14:39:49 init_user_ids: want user 'user@machine', current is '(null)@(null)'
6/2 14:39:49 init_user_ids: Already have handle for user@machine, so returning.
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value of True
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 Unknown user notification selection
6/2 14:39:49         Notify user with subject: Condor Job 2563.0
6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config file
6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file ..\src\condor_schedd.V6\schedd.C


Eventually master restarts schedd, but this causes havoc with already running jobs.

Any help appreciated,
Pawel

--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685