[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Held jobs: unable to establish standard (output|error) stream



Hi Steffen,

no good idea here, but have you checked for anything suspicious on the scheduler? The `Shadow exception!` sounds suspicious to me - if it is not a job fault but a problem on the sched with the job-shadows??

Is there maybe something in the sched logs? (or maybe just space run out or available file handles got used up?)

Cheers,
  Thomas

On 07/01/2021 09.09, Steffen Grunewald wrote:
Good morning,

in my pool, there's a couple of jobs going into Hold state, with the HoldReason(s)
shown in the subject.
The last log lines look like this:

...
012 (347486.000.000) 01/07 09:00:58 Job was held.
         Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
         Code 9 Subcode 0
...
012 (347487.000.000) 01/07 09:00:58 Job was held.
         Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
         Code 9 Subcode 0
...
007 (347485.000.000) 01/07 09:00:58 Shadow exception!
         Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
         0  -  Run Bytes Sent By Job
         2469644  -  Run Bytes Received By Job
...
012 (347485.000.000) 01/07 09:00:58 Job was held.
         Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
         Code 9 Subcode 0
...

I ran "condor_q -l $jobid | egrep '^(UserLog|Out|Err)'" and checked the existence of the
files on all pool nodes (inlcuding the head nodes) - nothing suspicious.

How to further debug this? Do I have a gaping black hole in the pool (that only affects
this particular user), is there something in the submit file (which I haven't found yet)
that's different from everything else? condor_release doesn't reset the error state...

Any suggestion is appreciated.

condor_version is 8.8.3 on the HN, and 8.8.x (x >= 3) anywhere else (due to scattered
reinstalls, a full update currently isn't possible)

Thanks,
  S

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature