[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Retaining executable for debugging



On Tue, Jan 2, 2018 at 12:47 AM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> On Sat, Dec 30, 2017 at 2:52 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> On Sat, Dec 30, 2017 at 2:47 PM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
>>> On 2017-12-29 19:42, Larry Martell wrote:
>>> ...Any other suggestions on
>>>>
>>>> how to debug this?
>>>
>>>
>>> Make your script pprint sys.path to stderr and make sure you're saving
>>> condor's error file?
>>
>> Thanks for the pointers. I did find in the StarterLog.slot1 log the
>> full command line, and also this 'Running job as user nobody' -
>> perhaps that is causing a permission issue? I googled that and found
>> this thread: https://www-auth.cs.wisc.edu/lists/htcondor-users/2014-March/msg00013.shtml
>> - I want to try that, but we are having an NFS issue now and our sys
>> admin is not available to fix, so I am stuck for a while.
>
> So I did add a print to stderr and it does not appear in the err file.
> This makes me feel that the script that condor is running is not the
> version I think it is running.
>
> I am submitting the job from machine A using the python API to the
> condor host and that is in turn running the script on an execute
> hosts. The python script being run is referenced from a NFS mounted
> dir that is the same on all 3 hosts (I checked and it is).
>
> In the error file I see it's running this (where the number after dir_
> is different each time):
>
> /var/lib/condor/execute/dir_169123/condor_exec.exe
>
> Is there a log that shows what is copied from where to that file? Is
> there a way to keep that dir around after the jobs terminates?

Through some fast copying I was able to grab the execute dir before it
was removed and I eventually figured out my import issue. But still
printing to stderr did not come out in the err file.