[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with multiple schedds in 7.7+



Hi Matt,

The workaround is simple but the extent of the damage is not easily
evident. Since different schedds keep on over-writing same job queue log
file, schedd loses track of queues completely! We kept on suffering from
corrupted queues in the test environment until I found out what was
actually going on.

It will be useful to have the info in the manual on how defaults are
selected for config variables. Sometimes it is not obvious enough.

-- 
Thanks & Regards
+==========================================================
| Parag Mhashilkar
| Fermi National Accelerator Laboratory, MS 120
| Wilson & Kirk Road, Batavia, IL - 60510
|----------------------------------------------------------
| Phone: 1 (630) 840-6530 Fax: 1 (630) 840-2783
|----------------------------------------------------------
| Wilson Hall, 867E (Nov 17, 2010 - To date)
| Wilson Hall, 863E (Apr 24, 2007 - Nov 16, 2010)
| Wilson Hall, 856E (Mar 21, 2005 - Apr 23, 2007)
+==========================================================


On Thu, 2012-08-16 at 07:33 -0400, Matthew Farrellee wrote:
> Nice find!
> 
> This looks like a bug (and regression) IMHO.
> 
> src/condor_utils/param_info.in:
> [JOB_QUEUE_LOG]
> default=$(SPOOL)/job_queue.log
> 
> src/condor_schedd.V6/schedd_main.cpp:
>        // Initialize the job queue
>     char *job_queue_param_name = param("JOB_QUEUE_LOG");
> 
>     if (job_queue_param_name == NULL) {
>        // the default place for the job_queue.log is in spool
>        job_queue_name.sprintf( "%s/job_queue.log", Spool);
>     } else {
>        job_queue_name = job_queue_param_name; // convert char * to MyString
>        free(job_queue_param_name);
>     }
> 
> Because of the default the Spool/job_queue.log code won't be hit.
> 
> $ env _CONDOR_MATT.SPOOL=/tmp strace -e open condor_schedd -t -f 
> -local-name matt 2>&1 | grep -e spool -e tmp
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
> = -1 ENOENT (No such file or directory)
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
> 0644) = 7
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
> = -1 ENOENT (No such file or directory)
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
> 0644) = 7
> 08/16/12 07:27:27 (pid:14896) initLocalStarterDir: 
> /home/matt/Documents/CondorInstallation/spool/local_univ_execute already 
> exists, deleting old contents
> open("/tmp/spool_version", O_RDONLY)    = 11
> open("/home/matt/Documents/CondorInstallation/spool/job_queue.log", 
> O_RDWR) = 11
> 
> A workaround is to set JOB_QUEUE_LOG=
> 
> $ env _CONDOR_MATT.SPOOL=/tmp _CONDOR_JOB_QUEUE_LOG= strace -e open 
> condor_schedd -t -f -local-name matt 2>&1 | grep -e spool -e tmp
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
> = -1 ENOENT (No such file or directory)
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
> 0644) = 7
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
> = -1 ENOENT (No such file or directory)
> open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
> 0644) = 7
> 08/16/12 07:28:31 (pid:14908) initLocalStarterDir: 
> /home/matt/Documents/CondorInstallation/spool/local_univ_execute already 
> exists, deleting old contents
> open("/tmp/spool_version", O_RDONLY)    = 11
> open("/tmp/job_queue.log", O_RDWR)      = -1 ENOENT (No such file or 
> directory)
> open("/tmp/job_queue.log", O_RDWR|O_CREAT|O_EXCL, 0600) = 11
> 
> Note, SCHEDD_ADDRESS_FILE also has a default (defined in condor_config) 
> of $(SPOOL)/.schedd_address
> 
> Best,
> 
> 
> matt
> 
> On 08/13/2012 11:15 AM, John Weigand wrote:
> > Matt,
> >
> > You were correct in the problem being the job_queue.log.
> >
> > A JOB_QUEUE_LOG attribute was introduced in Condor 7.7.5
> > .. ticket 2598 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2598
> >
> > http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#16343
> >
> > Prior to the introduction of this feature a job_queue.log was always
> > maintained in the spool directory of each schedd.  With this change, it
> > appears (either a bug or by desire), the job queue log of each additional
> > schedd must be defined explicitly.
> >    SCHEDD.SCHEDDJOBS2.JOB_QUEUE_LOG =
> > $(SCHEDD.SCHEDDJOBS2.SPOOL)/job_queue.log
> >
> > If not explicitely stated, only 1 job_queue.log is used.  Hence, all
> > jobs are
> > assigned to all schedd queues on a restart.
> >
> > John Weigand
> >
> >
> >
> > On 6/4/2012 7:57 PM, Matthew Farrellee wrote:
> >> On 05/21/2012 09:37 AM, John Weigand wrote:
> >>> There appears to be a change in behavior in Condor when multiple schedds
> >>> are defined. I have tested this with 7.7.5 and 7.8. It does not occur
> >>> in 7.6.6 and prior.
> >>>
> >>> Test condition:
> >>> 1. 3 schedds are defined
> >>> 2. I submit 1 job.
> >>> 3. condor_q -g shows 1 schedd queue with the job
> >>> 4. I restart condor
> >>> 5. condor_q -g shows the same job in all 3 schedd queues and treats
> >>> them as independent jobs.
> >>>
> >>> I use the same configuration for all 3 versions of Condor for the
> >>> secondary schedds:
> >>>
> >>> SCHEDDJOBS2 = $(SCHEDD)
> >>> SCHEDDJOBS2_ARGS = -local-name scheddjobs2
> >>> SCHEDD.SCHEDDJOBS2.SCHEDD_NAME = schedd_jobs2
> >>> SCHEDD.SCHEDDJOBS2.SCHEDD_LOG =
> >>> $(LOG)/SchedLog.$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
> >>> SCHEDD.SCHEDDJOBS2.LOCAL_DIR =
> >>> $(LOCAL_DIR)/$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
> >>> SCHEDD.SCHEDDJOBS2.EXECUTE = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/execute
> >>> SCHEDD.SCHEDDJOBS2.LOCK = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/lock
> >>> SCHEDD.SCHEDDJOBS2.PROCD_ADDRESS =
> >>> $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/procd_pipe
> >>> SCHEDD.SCHEDDJOBS2.SPOOL = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/spool
> >>> SCHEDD.SCHEDDJOBS2.SCHEDD_ADDRESS_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_address
> >>>
> >>>
> >>>
> >>> SCHEDD.SCHEDDJOBS2.SCHEDD_DAEMON_AD_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_classad
> >>>
> >>>
> >>>
> >>> SCHEDDJOBS2_LOCAL_DIR_STRING = "$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)"
> >>> SCHEDD.SCHEDDJOBS2.SCHEDD_EXPRS = LOCAL_DIR_STRING
> >>> DAEMON_LIST = $(DAEMON_LIST), SCHEDDJOBS2
> >>> :
> >>> (same for schedd3)
> >>> :
> >>> DC_DAEMON_LIST = + SCHEDDJOBS2 SCHEDDJOBS3
> >>>
> >>>
> >>> This works in 7.6.6 and prior, just not in 7.7.5 and 7.8.
> >>>
> >>> Any ideas?
> >>>
> >>> John Weigand
> >>
> >> First thought it somehow all the Schedds are using the same spool.
> >> When you
> >> restart them they should log something like "About to rotate ClassAd log
> >> /var/lib/condor/spool/job_queue.log". Make sure they're all processing a
> >> different job_queue.log.
> >>
> >> Do you happen to have a wallaby dump of your configuration to share?
> >>
> >> Best,
> >>
> >>
> >> matt
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature