[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ULOG_RD_ERROR while parsing log with JobEventLog



Dear experts,
we in CMS CRAB have been parsing hundreds of thousands of user log files
in last years w/o a single glitch (thanks !!!) using
htcondor.JobEventLog() from HTC python bindings.

Today for the first time I had one instance of unrecoverable error.
I worry because it is the first scheduler which we just moved
to HTCondor23 and we have to worry if there is something new going on.
I had send 3 test DAGs and 2 were OK, one had this error.

The error can easily be reproduced with the simple script below, but
it has no details. Can you help me to figure out what happened ?

I tried to look at the file around the event which raises the exception,
but did not see anything suspicious.

Of course there are three questions here
1. what exactly is wrong in the log
2. why did it happen
3. is there a way to make log parsing skip the bad event and go on ?

Chances are that the malformed entry (assuming that this is the case)
is an irrelevant event for our needs. We parse the log to create
a summary of the DAG combined with some job details every few minutes
to avoid doing too many condor_q.


Thanks for your help, please find below how to reproduce the problem.

Stefano


The log itself is a bit large (204K lines, 7.8MB) you can downloaded from
https://belforte.web.cern.ch/belforte/misc/job_log

Here's the script

import htcondor
print(f"HTCondor version {htcondor.version()}")
jel = htcondor.JobEventLog('job_log')
count = 0
try:
    for event in jel.events(0):
        count += 1
        #print(count)
except Exception as e:
    print(f"got exception {e}")
    print(f"after event n. {count}")
    print(f"last event:\n{event}")


And this is the output
bash-4.2$ python3 tc.py
HTCondor version $CondorVersion: 23.0.2 2023-11-20 BuildID: 690948 PackageID: 23.0.2-1 $
got exception ULOG_RD_ERROR
after event n. 10631
last event:
007 (385.000.000) 2024-02-23 11:04:47 Shadow exception!
Error from slot1_2@glidein_112_406363936@b9g20p5446.cern.ch: The job wrapper failed to execute the job: Wrapper script /pool/condor/dir_3970574/glide_73IqZw/condor_job_wrapper.sh failed (1): ERROR If you get this error when you did not specify required OS, your VO does not support any valid default Singularity image
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job

bash-4.2$