[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Held jobs: unable to establish standard (output|error) stream



hmmm, 112 is not really helpful as it is just 'the job should be put on hold' according to [1]

I guess there is not much more in the Starter log on your execute node/slot `slot1_3@xxxxxxxxxxxxxxxxxxx` compared to the shadow

Sicne it is affecting jobs over all your execute nodes, I would concure that it might be a screwed up job - has the user tried to run the job interactively (if it can be run manually as a DAG??)

just guessing: is the job failing (and not producing the expected output files) and Condor stumbles over the files in TransferOutput not being transfer'able?

Cheers,
  Thomas

[1]
https://research.cs.wisc.edu/htcondor/manual/v8.9.1/CondorshadowExitCodes.html

On 07/01/2021 11.51, Steffen Grunewald wrote:
On Thu, 2021-01-07 at 09:59:57 +0100, Thomas Hartmann wrote:
Hi Steffen,

no good idea here, but have you checked for anything suspicious on the
scheduler? The `Shadow exception!` sounds suspicious to me - if it is not a
job fault but a problem on the sched with the job-shadows??

Is there maybe something in the sched logs? (or maybe just space run out or
available file handles got used up?)

Disk space is being monitored, there's lots of space.
Here's something I found in the ScheddLog:

21-01-07_09:00:57  (pid:2049729) Calling Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> (7)
21-01-07_09:00:57  (pid:2049729) Return from Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> 0.000031s
21-01-07_09:00:57  (pid:2049729) Calling Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> (8)
21-01-07_09:00:57  (pid:2049729) Return from Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> 0.000019s
21-01-07_09:00:57  (pid:2049729) Calling Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> (9)
21-01-07_09:00:57  (pid:2049729) Return from Handler <TransferQueueManager::HandleReport TRANSFER_QUEUE_REQUEST> 0.000015s
21-01-07_09:00:57  (pid:2049729) DaemonCore: pid 2316452 exited with status 28672, invoking reaper 2 <child_exit>
21-01-07_09:00:57  (pid:2049729) Shadow pid 2316452 for job 347480.0 exited with status 112
21-01-07_09:00:57  (pid:2049729) Putting job 347480.0 on hold
21-01-07_09:00:57  (pid:2049729) WriteUserLog checking for event log rotation, but no lock
21-01-07_09:00:57  (pid:2049729) DaemonCore: return from reaper for pid 2316452
21-01-07_09:00:57  (pid:2049729) DaemonCore: pid 2316456 exited with status 28672, invoking reaper 2 <child_exit>
21-01-07_09:00:57  (pid:2049729) Shadow pid 2316456 for job 347481.0 exited with status 112
21-01-07_09:00:57  (pid:2049729) Putting job 347481.0 on hold
21-01-07_09:00:57  (pid:2049729) WriteUserLog checking for event log rotation, but no lock
21-01-07_09:00:57  (pid:2049729) DaemonCore: return from reaper for pid 2316456
21-01-07_09:00:57  (pid:2049729) DaemonCore: pid 2316493 exited with status 28672, invoking reaper 2 <child_exit>
21-01-07_09:00:57  (pid:2049729) Shadow pid 2316493 for job 347482.0 exited with status 112
21-01-07_09:00:57  (pid:2049729) Putting job 347482.0 on hold
21-01-07_09:00:57  (pid:2049729) WriteUserLog checking for event log rotation, but no lock

Status 112 probably rings a bell with you. (Where to lookup those?)

I'm also worried about those TRANSFER_QUEUE_REQUESTs, I suspect - although there's
a shared filesystem for everything - that he's using the file transfer mechanism
to write output back to his home on the schedd node, not the worker one. I haven't
found the submit file yet, but a long condor_q of one of the jobs involved looks
very suspicious (they have been spawned by DagMan, so they're almost identical):

# condor_q -l 347480.0 | egrep -i trans
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)
ShouldTransferFiles = "YES"
TransferExecutable = false
TransferIn = false
TransferInput = "hwinjcbc_1249852249.xml,lalinf_touch_output"
TransferInputSizeMB = 0
TransferOutput = "lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5V1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5L1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5H1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5_snr.txt"
TransferOutputRemaps = "lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07.resume=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07.resume;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5V1-PSD.dat=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5V1-PSD.dat;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5L1-PSD.dat=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5L1-PSD.dat;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5H1-PSD.dat=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5H1-PSD.dat;lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5_snr.txt=/.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/engine/lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5_snr.txt"
WhenToTransferOutput = "ON_EXIT_OR_EVICT"

(/home is automounted, via a symlink to /.auto/home, this used to work for
everyone for almost 20 years now)

With that lot of output remaps (I've counted 20 of them), and 16 jobs trying
to run simultaneously, the "ulimit -n" of 1024 should still be far away?
In the remap target directory (which is ${Iwd}/engine and existent, 0755 owner)
there are only files from another run (where the base names end in -15 IIRC).

Let's check for that job in ShadowLog:

21-01-07_09:00:55  Initializing a VANILLA shadow for job 347480.0
21-01-07_09:00:55  (347480.0) (2316452): Request to run on slot1_3@xxxxxxxxxxxxxxxxxxx <10.150.1.15:42229?addrs=10.150.1.15-42229> was ACCEPTED
21-01-07_09:00:55  (347480.0) (2316452): File transfer completed successfully.
21-01-07_09:00:55  (347480.0) (2316452): WriteUserLog checking for event log rotation, but no lock
21-01-07_09:00:55  (347480.0) (2316452): Job 347480.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:57  (347480.0) (2316452): **** condor_shadow (condor_SHADOW) pid 2316452 EXITING WITH STATUS 112

There's nothing we didn't know yet, isn't there?

Checking for all those occurrences since the last "condor_release":

21-01-07_09:00:55  (347482.0) (2316493): Job 347482.0 going into Hold state (code 9,0): Error from slot1_7@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:55  (347481.0) (2316456): Job 347481.0 going into Hold state (code 9,0): Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:55  (347480.0) (2316452): Job 347480.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:57  (347484.0) (2316572): Job 347484.0 going into Hold state (code 9,0): Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:57  (347483.0) (2316563): Job 347483.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:57  (347486.0) (2316622): Job 347486.0 going into Hold state (code 9,0): Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (347487.0) (2316694): Job 347487.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348225.0) (2316731): Job 348225.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348226.0) (2316768): Job 348226.0 going into Hold state (code 9,0): Error from slot1_2@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348227.0) (2316784): Job 348227.0 going into Hold state (code 9,0): Error from slot1_2@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (347485.0) (2316807): Job 347485.0 going into Hold state (code 9,0): Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348228.0) (2316886): Job 348228.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348230.0) (2316935): Job 348230.0 going into Hold state (code 9,0): Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348229.0) (2316894): Job 348229.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348231.0) (2316981): Job 348231.0 going into Hold state (code 9,0): Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
21-01-07_09:00:58  (348232.0) (2317021): Job 348232.0 going into Hold state (code 9,0): Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream

OK, this is NOT a black hole in the worker node section.

Time to restart the schedd node?!?

Running out of feasible ideas,

- Steffen


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature