[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job does not run



I should add that ROCm has this issue at the moment, OpenCL programs
only run as root, because the OpenCL ICD driver is only recognized when
logged in as root.


On Mon, 2019-06-17 at 20:08 +0200, Valerio Bellizzomi wrote:
> Hi,
> apart the other issues I did a test on the execute node, I think the
> reason for which the job remains idle is due to an error. I have run
> condor_startd by hand on machine compute02 and got an error:
> 
> ocl.getPlatformIDs returned error=-1001 and 0 platforms
> 
> That means the OpenCL ICD is not found, but this is anomalous as I can
> run the job locally on the execute node, opencl is installed correctly.
> The only reason this can happen is that the process does not have
> privileges to access the opencl platform, but I am running condor_startd
> as root.
> 
> 
> 
> -------- Forwarded Message --------
> From: Valerio Bellizzomi <valerio@xxxxxxxxxx>
> Reply-to: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] job does not run
> Date: Mon, 17 Jun 2019 17:31:46 +0200
> 
> On Mon, 2019-06-17 at 11:52 +0000, Bockelman, Brian wrote:
> > 
> > > On Jun 17, 2019, at 2:28 AM, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
> > > 
> > > Hi,
> > > 
> > > On Sun, 2019-06-16 at 16:10:00 +0200, Valerio Bellizzomi wrote:
> > >> Greetings,
> > >> after submitting a job, the job is in idle state. Diagnostics with
> > >> condor_q -analyze show "no match found".
> > >> 
> > >> In the submit file I have:
> > >> 
> > >> RANK = (Machine == "compute02")
> > > 
> > > Please verify (using e.g. condor_status -l compute02) that the machine
> > > name is correct (is there no domain part?)
> > > 
> > >> 1) is this sufficient to select the target machine ?
> > > 
> > > With the correct string, IMHO yes
> > 
> > Do note that you used "RANK" and not "REQUIREMENTS" -- the job will show a preference for "compute02" if there are multiple available compute hosts.  However, it will still be allowed to run on any host.
> > 
> > It might be useful to post the output of "condor_q -better-analyze".  Another thing that could be going wrong is that the Machine attribute is using a FQDN ("compute02.example.com") whereas you are only querying the host ("compute02").
> 
> Hi,
> I have verified that the compute02 node has a problem, that is ps
> command shows condor_procd running but not condor_startd. Master and
> Startd are listed in the configuration but condor_startd does not start
> at first.
> 
> Second problem I found and corrected: Schedd was not running on the
> central manager machine. I was using the DAEMON_LIST generated by the
> condor_configure --type=manager command and schedd was not in the list.
> 
> 
> 
> 
> 
> 
> > Brian
> > 
> > > 
> > >> 2) where is the htcondor log file for the job ?
> > > 
> > > Did you specify a path in your submit file?
> > > 
> > > - S
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/