[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to open '.update.ad'



as a side note i have this same error plastered through my logs.  for
both success and error jobs.  we don't use file transfer.

lack of time means i haven't tracked it down to anything.  but i
suspect this is more fundamental in the condor config (which i've
likely done too) then something specific in the job he's running.



On Tue, Dec 8, 2020 at 9:52 AM Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
>
> Hello,
>
> I don't think this is related to the .update.ad file. Here are a
> couple things that look suspicious. First, the following two lines:
>
> 12/07/20 10:36:07 (pid:1802827) Create_Process succeeded, pid=1802841
> 12/07/20 10:36:09 (pid:1802827) Process exited, pid=1802841, status=1
>
> These indicate that the worker did in fact run your job, but the
> executable exited two seconds later with error status 1. If you
> haven't already, try setting the error and output files for this job
> and look there for any information.
>
> Also, the following two lines seem to indicate you're trying to
> transfer a file that doesn't exist:
>
> 12/07/20 10:36:09 (pid:1802827) ReliSock::put_file_with_permissions():
> Failed to stat file
> '/var/lib/condor/execute/dir_1802827/run_mtdna_mito-13-JX-B_L4_1.log':
> No such file or directory (errno: 2, si_error: 1)
> 12/07/20 10:36:09 (pid:1802827) DoUpload: (Condor error code 13,
> subcode 2) STARTER at 172.17.23.227 failed to send file(s) to
> <172.17.23.141:9618>: error reading from
> /var/lib/condor/execute/dir_1802827/run_mtdna_mito-13-JX-B_L4_1.log:
> (errno 2) No such file or directory;
>
> Is that .log file something you were expecting to be generated by the job?
>
> Mark
>
> On Sun, Dec 6, 2020 at 9:06 PM åä <kan.wu@xxxxxxxxxxxxx> wrote:
> >
> > the condor worker failed to run:
> >
> > the log content in /var/log/condor/StarterLog.slot1 is below:
> >
> >
> > 12/07/20 10:36:07 (pid:1802827) Output file: /var/lib/condor/execute/dir_1802827/_condor_stdout
> > 12/07/20 10:36:07 (pid:1802827) Error file: /var/lib/condor/execute/dir_1802827/_condor_stderr
> > 12/07/20 10:36:07 (pid:1802827) Renice expr "0" evaluated to 0
> > 12/07/20 10:36:07 (pid:1802827) Running job as user gtx
> > 12/07/20 10:36:07 (pid:1802827) About to exec /var/lib/condor/execute/dir_1802827/condor_exec.exe
> > 12/07/20 10:36:07 (pid:1802827) Create_Process succeeded, pid=1802841
> > 12/07/20 10:36:09 (pid:1802827) Process exited, pid=1802841, status=1
> > 12/07/20 10:36:09 (pid:1802827) Failed to open '.update.ad' to read update ad: No such file or directory (2).
> > 12/07/20 10:36:09 (pid:1802827) ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_1802827/run_mtdna_mito-13-JX-B_L4_1.log': No such file or directory (errno: 2, si_error: 1)
> > 12/07/20 10:36:09 (pid:1802827) DoUpload: (Condor error code 13, subcode 2) STARTER at 172.17.23.227 failed to send file(s) to <172.17.23.141:9618>: error reading from /var/lib/condor/execute/dir_1802827/run_mtdna_mito-13-JX-B_L4_1.log: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <172.17.23.227:32543>
> > 12/07/20 10:36:09 (pid:1802827) JICShadow::notifyJobTermination(): Sending mock terminate event.
> > 12/07/20 10:36:09 (pid:1802827) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
> > 12/07/20 10:36:09 (pid:1802827) Returning from CStarter::JobReaper()
> > 12/07/20 10:36:09 (pid:1802827) Got SIGQUIT.  Performing fast shutdown.
> > 12/07/20 10:36:09 (pid:1802827) ShutdownFast all jobs.
> > 12/07/20 10:36:09 (pid:1802827) Failed to open '.update.ad' to read update ad: No such file or directory (2).
> > 12/07/20 10:36:09 (pid:1802827) condor_read(): Socket closed abnormally when trying to read 21 bytes from <172.17.23.141:58247>, errno=104 Connection reset by peer
> > 12/07/20 10:36:09 (pid:1802827) Lost connection to shadow, waiting 2400 secs for reconnect
> > 12/07/20 10:36:09 (pid:1802827) Failed to open '.update.ad' to read update ad: No such file or directory (2).
> > 12/07/20 10:36:09 (pid:1802827) Failed to send job exit status to shadow
> > 12/07/20 10:36:09 (pid:1802827) All jobs have exited... starter exiting
> > 12/07/20 10:36:09 (pid:1802827) **** condor_starter (condor_STARTER) pid 1802827 EXITING WITH STATUS 0
> >
> > what's the problem maybe?
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/