[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Changes in configuration: 8.6.13 to 8.8.12



Hello,

Do you manage to collect any information about these settings? We have upgraded from 8.5.8 to 8.8.5. Some strange issue is reported with one schedd. Job remains in idle state for a long time. We have flocking configured despite of available resources job is not getting scheduled.Â

# grep '2238959.0' /var/log/condor/SchedLog
08/11/21 01:43:19 (pid:9386) job_transforms for 2238959.0: 1 considered, 1 applied (SetTeam)
08/11/21 01:43:26 (pid:9386) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxtest.com<xx.xx.86.63:9618?addrs=xx.xx.86.63-9618&noUDP&sock=9535_cb9e_3> for testuser1 2238959.0
08/11/21 01:43:26 (pid:9386) Match record (slot1@xxxxxxxxxxxxxxxxtest.com<xx.xx.86.63:9618?addrs=xx.xx.86.63-9618&noUDP&sock=9535_cb9e_3> for testuser1, 2238959.0) deleted
08/11/21 02:03:44 (pid:9386) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxtest.com<xx.xx.84.136:9618?addrs=xx.xx.84.136-9618&noUDP&sock=515534_05ae_3> for testuser1 2238959.0
08/11/21 02:03:44 (pid:9386) Match record (slot1@xxxxxxxxxxxxxxxxtest.com<xx.xx.84.136:9618?addrs=xx.xx.84.136-9618&noUDP&sock=515534_05ae_3> for testuser1, 2238959.0) deleted
08/11/21 02:03:44 (pid:9386) Starting add_shadow_birthdate(2238959.0)
08/11/21 02:03:44 (pid:9386) Started shadow for job 2238959.0 on slot1@xxxxxxxxxxxxxxxxtest.com<xx.xx.87.240:9618?addrs=xx.xx.87.240-9618&noUDP&sock=57028_0b07_3> for testuser1, (shadow pid = 4127832)

While going through release notes found bugÂ

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7754 which looks related to settings mentioned in this thread. Would disabling them help us to avoid the bug?Â

Following link also doesn't provide much information.Â

https://github.com/htcondor/htcondor/blob/master/src/condor_utils/param_info.in

Thanks & Regards,
Vikrant Aggarwal


On Tue, Jul 13, 2021 at 3:17 PM <jcaballero.hep@xxxxxxxxx> wrote:
Hi again,

I have been checking many of those variables in the documentation.
Focusing only on the ones that have changed value (not the ones that
have been removed or added), I saw several of them are not for the
Central Managers, so I guess I can ignore them.
Everything is narrowing to a very short set of variables:
* NEGOTIATOR_MAX_TIME_PER_SCHEDD
* NEGOTIATOR_RESOURCE_REQUEST_LIST_SIZE
* NEGOTIATOR_PREFETCH_REQUESTS
* NEGOTIATOR_PREFETCH_REQUESTS_MAX_TIME

My problem now is with the last 2, as I am not able to find any
documentation about them. Not sure what they do, so no idea of the
consequences due their value change.
What are they about?

Cheers,
Jose

El miÃ, 7 jul 2021 a las 15:32, Jose Caballero
(<jcaballero.hep@xxxxxxxxx>) escribiÃ:
>
> Hi,
>
>
> We are in the process of upgrading our Central Managers from version
> 8.6.13 to 8.8.12.
> As expected,there are a few changes in the configuration. In
> particular in the default settings. Some of the previous ones are
> gone, some new ones have appeared.
>
> I have already filtered a bunch of variables that don't apply to a
> Linux-based Grid site.
> But I still have a list of variables to inspect.
>
> While I check the RELEASE NOTES, I was wondering if the developers, or
> someone who has done this upgrade recently, has any advice on which
> variables I should focus on, which new values should be replaced back
> by previous ones in order to keep functionalities, etc.
> I am just concerned that some of the new defaults may change
> drastically the behaviour of our pool.
>
> Here [1] is the reduced diff, and here [2] is a link to a temporary
> google doc with the same content in an easier to read format :)
>
>
> Thanks a lot in advance.
> Cheers,
> Jose
>
>
> =======================================================
> [1]
>
> diff 8.6.13 8.8.12
>
> > C_GAHP_CONTACT_SCHEDD_DELAY = 5
> > C_GAHP_MAX_FILE_REQUESTS = 10
> > CGAHP_SCHEDD_INTERACTION_TIME = 5
> > COLLECTOR_HOST_FOR_NEGOTIATOR = $(FULL_HOSTNAME)
> > COLLECTOR_QUERY_MAX_WORKTIME = 0
> > COLLECTOR_QUERY_WORKERS_PENDING = 50
> > COLLECTOR_QUERY_WORKERS_RESERVE_FOR_HIGH_PRIO = 1
> > CONDOR_Q_SHOW_OLD_SUMMARY = false
>
> < CURB_MATCHMAKING = false
> > CURB_MATCHMAKING = RecentDaemonCoreDutyCycle > 0.98
>
> > DAGMAN_AGGRESSIVE_SUBMIT = false
> > DAGMAN_ALLOW_ANY_NODE_NAME_CHARACTERS = false
>
> < DAGMAN_ALLOW_LOG_ERROR = false
>
> < DAGMAN_MAX_SUBMITS_PER_INTERVAL = 5
> > DAGMAN_MAX_SUBMITS_PER_INTERVAL = 100
>
> > DAGMAN_QUEUE_UPDATE_INTERVAL = 300
> > DAGMAN_REPORT_GRAPH_METRICS = false
>
> < DATABASE_PURGE_INTERVAL =
> < DATABASE_REINDEX_INTERVAL =
> < DBMSD = $(SBIN)/condor_dbmsd
> < DBMSMANAGER_NAME =
>
> < DEFAULT_JOB_MAX_RETRIES = 10
> > DEFAULT_JOB_MAX_RETRIES = 2
>
> > DEFRAG_DRAINING_START_EXPR = FALSE
> > DELEGATE_FULL_JOB_GSI_CREDENTIALS = false
>
> < EMAIL_NOTIFICATION_CC =
> < ENABLE_ADDRESS_REWRITING = true
>
> > ENABLE_HTTP_PUBLIC_FILES = false
> > ENABLE_MULTIFILE_TRANSFER_PLUGINS = false
>
> < ENABLE_WEB_SERVER = false
>
> < FILE_TRANSFER_DISK_LOAD_THROTTLE =
> > FILE_TRANSFER_DISK_LOAD_THROTTLE = 2.0
>
> > FILE_TRANSFER_STATS_LOG = $(LOG)/transfer_history
> > GAHP_SSL_CADIR =
> > GAHP_SSL_CAFILE =
>
> < HISTORY_HELPER_MAX_CONCURRENCY = 2
> > HISTORY_HELPER_MAX_CONCURRENCY = 50
>
> > HTTP_PUBLIC_FILES_ADDRESS = 127.0.0.1:80
> > HTTP_PUBLIC_FILES_ROOT_DIR = /usr/share/nginx/html
> > HTTP_PUBLIC_FILES_STALE_AGE = 604800
>
> < IS_OWNER = (START =?= False)
> > IS_OWNER = False
>
> > JOB_DEFAULT_LEASE_DURATION = 2400
>
> < JOB_PROXY_OVERRIDE_FILE =
>
> > JOB_ROUTER_ROUND_ROBIN_SELECTION = false
> > KEYRING_SESSION_CREATION_TIMEOUT = 20
>
> < MASTER_SQLLOG =
>
> > MAX_ACCEPTS_PER_CYCLE = 8
> > MAX_CONCURRENT_DOWNLOADS = 100
> > MAX_CONCURRENT_UPLOADS = 100
> > MAX_DRAINING_ACTIVATION_DELAY = 20
> > MAX_PENDING_STARTD_CONTACTS = 0
> > MAX_REAPS_PER_CYCLE = 0
> > MAX_REMAP_RECURSIONS = 128
>
> < MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER =
> > MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER = 200
>
> > MAX_TIMER_EVENTS_PER_CYCLE = 3
> > MAX_UDP_MSGS_PER_CYCLE = 1
>
> < MAX_XML_LOG = 1900000000
>
> < MOUNT_UNDER_SCRATCH =
> > MOUNT_UNDER_SCRATCH = "/tmp,/var/tmp"
>
> > NEGOTIATOR_DEPTH_FIRST = false
> > NEGOTIATOR_JOB_CONSTRAINT =
> > NEGOTIATOR_MAX_TIME_PER_CYCLE = 1200
>
> < NEGOTIATOR_MAX_TIME_PER_SCHEDD = 31536000
> > NEGOTIATOR_MAX_TIME_PER_SCHEDD = 120
>
> < NEGOTIATOR_PREFETCH_REQUESTS = false
> < NEGOTIATOR_PREFETCH_REQUESTS_MAX_TIME = 120
> > NEGOTIATOR_PREFETCH_REQUESTS_MAX_TIME = 60
> > NEGOTIATOR_PREFETCH_REQUESTS = true
>
> < NEGOTIATOR_RESOURCE_REQUEST_LIST_SIZE = 20
> > NEGOTIATOR_RESOURCE_REQUEST_LIST_SIZE = 200
>
> > NEGOTIATOR_SOCKET_CACHE_SIZE = 500
>
> < OBITUARY_LOG_LENGTH = 20
> > OBITUARY_LOG_LENGTH = 200
>
> < PREEN_MAX_SCHEDD_CONNECTION_TIME = 20
>
> > SCHEDD_ALLOW_LATE_MATERIALIZE = true
> > SEC_CREDENTIAL_REFRESH_INTERVAL = -1
>
> < SHARED_PORT_ADDRESS_REWRITING = false
> < STARTD_COMPUTE_AVAIL_STATS = false
> < STARTD_MAX_AVAIL_PERIOD_SAMPLES = 100
>
> < START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 200
> > START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 500
>
> > SUBMIT_DEFAULT_SHOULD_TRANSFER_FILES =
>
> < SUBMIT_SKIP_FILECHECKS =
> > SUBMIT_SKIP_FILECHECKS = true
>
> < SYSTEM_VALID_SPOOL_FILES = job_queue.log, job_queue.log.tmp,
> history, Accountant.log, Accountantnew.log, local_univ_execute,
> .quillwritepassword, .pgpass, .schedd_address, .schedd_address.super,
> .schedd_classad, OfflineLog
> > SYSTEM_VALID_SPOOL_FILES = job_queue.log, job_queue.log.tmp, history, Accountant.log, Accountantnew.log, local_univ_execute, .pgpass, .schedd_address, .schedd_address.super, .schedd_classad, OfflineLog
>
> > TRUST_LOCAL_UID_DOMAIN = true
>
> < WARN_ON_UNUSED_SUBMIT_FILE_MACROS =
> < WEB_ROOT_DIR =
>
> > WARN_ON_UNUSED_SUBMIT_FILE_MACROS = true
>
>
> [2]
> https://docs.google.com/document/d/1x43ksFjfDGLozJsVbMABJmtZeGR4i2XFgzGYqlXr5YA/edit?usp=sharing

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/