[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Held jobs: unable to establish standard (output|error) stream



Good morning,

in my pool, there's a couple of jobs going into Hold state, with the HoldReason(s)
shown in the subject.
The last log lines look like this:

...
012 (347486.000.000) 01/07 09:00:58 Job was held.
        Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
        Code 9 Subcode 0
...
012 (347487.000.000) 01/07 09:00:58 Job was held.
        Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
        Code 9 Subcode 0
...
007 (347485.000.000) 01/07 09:00:58 Shadow exception!
        Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
        0  -  Run Bytes Sent By Job
        2469644  -  Run Bytes Received By Job
...
012 (347485.000.000) 01/07 09:00:58 Job was held.
        Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream
        Code 9 Subcode 0
...

I ran "condor_q -l $jobid | egrep '^(UserLog|Out|Err)'" and checked the existence of the
files on all pool nodes (inlcuding the head nodes) - nothing suspicious.

How to further debug this? Do I have a gaping black hole in the pool (that only affects
this particular user), is there something in the submit file (which I haven't found yet)
that's different from everything else? condor_release doesn't reset the error state...

Any suggestion is appreciated.

condor_version is 8.8.3 on the HN, and 8.8.x (x >= 3) anywhere else (due to scattered
reinstalls, a full update currently isn't possible)

Thanks,
 S

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~