[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] multiple dags and schedd dieing




Peter,

Thanks, setting SMTP_SERVER fixed it.

Pawel




"Peter F. Couvares" <pfc@xxxxxxxxxxx>
Sent by: condor-users-bounces@xxxxxxxxxxx

06/02/2005 03:58 PM
Please respond to Condor-Users Mail List

       
        To:        Condor-Users Mail List <condor-users@xxxxxxxxxxx>
        cc:        Pawel.Micun@xxxxxxxxxxxxxxxx
        Subject:        Re: [Condor-users] multiple dags and schedd dieing



Pawel,

There is no problem running multiple concurrent DAGMan jobs, provided
they each have a unique DAG input file.

Your problem below is two-fold:

1) For some reason you don't have SMTP_SERVER defined in your config
file.  If you set this correctly, your problem should go away.

2) While the condor_master is cooly reporting this config problem and
moving on, the condor_schedd is over-reacting and exiting, which is
dumb.  I'm considering this a bug and will fix it for the next stable &
development releases.

Thanks for the report!

-Peter


On Jun 2, 2005, at 2:33 PM, Pawel.Micun@xxxxxxxxxxxxxxxx wrote:

>
> Hello,
>
> Should I be able to have 2 DAGMan jobs executing at the same time? I
> assumed I should, as I couldn't find any info to the contrary.
> If I'm wrong then ignore my rambling....
>
> I have 2 separate dags, and I submit one after another with
> condor_submit_dag. Everything starts up good, I see two separate
> jobs with condor_dagman.exe executing, each is processing its dag and
> submitting more jobs.
> When the first DAGMan job finishes, it takes out schedd with it.  
>
> $CondorVersion: 6.6.9 Mar 10 2005 $
> $CondorPlatform: INTEL-WINNT40 $
>
> dagman.out of first dag
> 6/2 14:39:49 POST Script of Job F completed successfully.
> 6/2 14:39:49 Of 154 nodes total:
> 6/2 14:39:49  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
> 6/2 14:39:49   ===     ===      ===     ===     ===        ===      ===
> 6/2 14:39:49   154       0        0       0       0          0        0
> 6/2 14:39:49 All jobs Completed!
> 6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING
> WITH STATUS 0
>
> MasterLog
> 6/2 14:39:47 ProcAPI: pid # 3356 was not found
> 6/2 14:39:48 ProcAPI: pid # 3388 was not found
> 6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> value of 0
> 6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4
> 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
> with err=87
> 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
> 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
> with err=87
> 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
> 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
> with err=87
> 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
> 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
> with err=87
> 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
> ....
> 6/2 14:39:49 ProcAPI: pid # 5060 was not found
> 6/2 14:39:49 ProcAPI: pid # 5268 was not found
> 6/2 14:39:49 ProcAPI: pid # 1716 was not found
> 6/2 14:39:49 ProcAPI: pid # 5060 was not found
> 6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87).
> Maybe it exited already?
> 6/2 14:44:58 ProcAPI: pid # 4756 was not found
> 6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87).
> Maybe it exited already?
> 6/2 14:45:06 ProcAPI: pid # 2324 was not found
> 6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87).
> Maybe it exited already?
> 6/2 14:45:10 ProcAPI: pid # 2968 was not found
> 6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87).
> Maybe it exited already?
> 6/2 14:45:10 ProcAPI: pid # 4788 was not found
> 6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87).
> Maybe it exited already?
> 6/2 14:45:10 ProcAPI: pid # 5056 was not found
> 6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87).
> Maybe it exited already?

> 6/2 14:45:10 ProcAPI: pid # 4704 was not found
> 6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
> 6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config
> file
> 6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds
>
>
> SchedLog
> 6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
> value of 0
> 6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with
> status 0
> 6/2 14:39:49 Writing record to user
> logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user
> 6/2 14:39:49 init_user_ids: want user 'user@machine', current is
> '(null)@(null)'
> 6/2 14:39:49 init_user_ids: Already have handle for user@machine, so
> returning.
> 6/2 14:39:49 TokenCache contents:
> user@machine
> 6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value
> of True
> 6/2 14:39:49 TokenCache contents:
> user@machine
> 6/2 14:39:49 Unknown user notification selection
> 6/2 14:39:49         Notify user with subject: Condor Job 2563.0
> 6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config
> file
> 6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file
> ..\src\condor_schedd.V6\schedd.C
>
> Eventually master restarts schedd, but this causes havoc with already
> running jobs.
>
> Any help appreciated,
> Pawel
>
--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685




*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************