[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job does not run



Hi,
to try to remove the -1001 error I have setup on the execute node:

SLOT_TYPE_1 = cpus=1, gpus=1
NUM_SLOTS_TYPE_1 = 2

and

SLOT1_USER = root
SLOT2_USER = root

but the error is still present.


On Mon, 2019-06-17 at 20:08 +0200, Valerio Bellizzomi wrote:
> Hi,
> apart the other issues I did a test on the execute node, I think the
> reason for which the job remains idle is due to an error. I have run
> condor_startd by hand on machine compute02 and got an error:
> 
> ocl.getPlatformIDs returned error=-1001 and 0 platforms
> 
> That means the OpenCL ICD is not found, but this is anomalous as I can
> run the job locally on the execute node, opencl is installed correctly.
> The only reason this can happen is that the process does not have
> privileges to access the opencl platform, but I am running condor_startd
> as root.
> 
> 
> 
> -------- Forwarded Message --------
> From: Valerio Bellizzomi <valerio@xxxxxxxxxx>
> Reply-to: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] job does not run
> Date: Mon, 17 Jun 2019 17:31:46 +0200
> 
> On Mon, 2019-06-17 at 11:52 +0000, Bockelman, Brian wrote:
> > 
> > > On Jun 17, 2019, at 2:28 AM, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
> > > 
> > > Hi,
> > > 
> > > On Sun, 2019-06-16 at 16:10:00 +0200, Valerio Bellizzomi wrote:
> > >> Greetings,
> > >> after submitting a job, the job is in idle state. Diagnostics with
> > >> condor_q -analyze show "no match found".
> > >> 
> > >> In the submit file I have:
> > >> 
> > >> RANK = (Machine == "compute02")
> > > 
> > > Please verify (using e.g. condor_status -l compute02) that the machine
> > > name is correct (is there no domain part?)
> > > 
> > >> 1) is this sufficient to select the target machine ?
> > > 
> > > With the correct string, IMHO yes
> > 
> > Do note that you used "RANK" and not "REQUIREMENTS" -- the job will show a preference for "compute02" if there are multiple available compute hosts.  However, it will still be allowed to run on any host.
> > 
> > It might be useful to post the output of "condor_q -better-analyze".  Another thing that could be going wrong is that the Machine attribute is using a FQDN ("compute02.example.com") whereas you are only querying the host ("compute02").
> 
> Hi,
> I have verified that the compute02 node has a problem, that is ps
> command shows condor_procd running but not condor_startd. Master and
> Startd are listed in the configuration but condor_startd does not start
> at first.
> 
> Second problem I found and corrected: Schedd was not running on the
> central manager machine. I was using the DAEMON_LIST generated by the
> condor_configure --type=manager command and schedd was not in the list.
> 
> 
> 
> 
> 
> 
> > Brian
> > 
> > > 
> > >> 2) where is the htcondor log file for the job ?
> > > 
> > > Did you specify a path in your submit file?
> > > 
> > > - S
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/