[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Possible delays for starting a shadow?



Hi,

current settings for FILE_DESCRIPTORS are in fact significantly lower than your suggestion, so I will try on the next opportunity.

The user has moved on for now is using the other sched - maybe next year.

In the meantime, I have held and released the job, to get some more info from the higher debug level.

On rematching the job, there are a few extra lines containing the job id, but they don't really seem helpful:

12/30/21 12:36:55 Sent job 55891514.0 (autocluster=2611 resources_requested=1) to the negotiator
12/30/21 12:36:55 Partitionable slot ... adjusted for job 55891514.0: cpus = 1, memory = 1024, disk = 22080399
12/30/21 12:36:55 Job 55891514.0: is runnable
12/30/21 12:36:55 record for job 55891514.0 skipped until PrioRec rebuild (already matched)
12/30/21 12:36:55 Scheduler::start_std - job=55891514.0 on ...
12/30/21 12:36:55 Cleared dirty attributes for job 55891514.0
12/30/21 12:36:55 Queueing job 55891514.0 in runnable job queue
12/30/21 12:36:55 Match (...)- running 55891514.0
12/30/21 13:38:44 Job prep for 55891514.0 will not block, calling aboutToSpawnJobHandler() directly
12/30/21 13:38:44 aboutToSpawnJobHandler() completed for job 55891514.0, attempting to spawn job handler
12/30/21 13:38:44 Starting add_shadow_birthdate(55891514.0)

The delay was "only" an hour here, before I held the job it was waiting to actually start for more than two hours, even though in state Running.

From the first line at 13:38:44 I would gess that "Job prep" was blocking until then, but I am not sure what that means.

Best
  Kruno


From: ervikrant06@xxxxxxxxx
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, 30 December, 2021 12:46:14
Subject: Re: [HTCondor-users] Possible delays for starting a shadow?
Hello, 
If I am understanding your query correctly, It could possibly be related to the following settings (set to higher value than default).

MAX_FILE_DESCRIPTORS = 102400
SCHEDD_MAX_FILE_DESCRIPTORS = 102400
SHARED_PORT_MAX_FILE_DESCRIPTORS = 102400

https://stackoverflow.com/questions/56650579/why-should-i-close-all-file-descriptors-after-calling-fork-and-prior-to-callin

Thanks & Regards,
Vikrant Aggarwal


On Thu, Dec 30, 2021 at 4:35 PM Sever, Krunoslav <krunoslav.sever@xxxxxxx> wrote:
Hi,

at the moment I am investigating a case in which a scheduler delays starting a shadow for unknown reasons.

The job is minimal, no file transfers, executes just some echo statements - as soon as the job is actually started, it is done immendiately.

When submitted to another sched (should have same config), there is consistently no delay.

>From the logs, which are currently still on default level I see:

* job is submitted to and transformed on sched
* negotiator matches the job to a worker a short time later

In the working case the shadow is started on the sched without any delay.

Not so on the sched I am looking at

* the job is in running state according to condor_q
* the job ad has no mention of the matched worker node (yet)
* on the worker I find nothing about the job id in the logs

I have actually a job id right now where I am waiting for the shadow to start with loglevel increased to D_FULLDEBUG but no output for the job yet - but the increase happened after matching, might have missed the interesting stuff.

Do you have any ideas what could cause this behavior?

Best
  Kruno

--
------------------------------------------------------------------------
Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
                                                            Notkestr. 85
phone:  +49-40-8998-1648                                   22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx                                  Germany
------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
------------------------------------------------------------------------
Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
                                                            Notkestr. 85
phone:  +49-40-8998-1648                                   22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx                                  Germany
------------------------------------------------------------------------