[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job does not run



Hi,
apart the other issues I did a test on the execute node, I think the
reason for which the job remains idle is due to an error. I have run
condor_startd by hand on machine compute02 and got an error:

ocl.getPlatformIDs returned error=-1001 and 0 platforms

That means the OpenCL ICD is not found, but this is anomalous as I can
run the job locally on the execute node, opencl is installed correctly.
The only reason this can happen is that the process does not have
privileges to access the opencl platform, but I am running condor_startd
as root.



-------- Forwarded Message --------
From: Valerio Bellizzomi <valerio@xxxxxxxxxx>
Reply-to: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] job does not run
Date: Mon, 17 Jun 2019 17:31:46 +0200

On Mon, 2019-06-17 at 11:52 +0000, Bockelman, Brian wrote:
> 
> > On Jun 17, 2019, at 2:28 AM, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
> > 
> > Hi,
> > 
> > On Sun, 2019-06-16 at 16:10:00 +0200, Valerio Bellizzomi wrote:
> >> Greetings,
> >> after submitting a job, the job is in idle state. Diagnostics with
> >> condor_q -analyze show "no match found".
> >> 
> >> In the submit file I have:
> >> 
> >> RANK = (Machine == "compute02")
> > 
> > Please verify (using e.g. condor_status -l compute02) that the machine
> > name is correct (is there no domain part?)
> > 
> >> 1) is this sufficient to select the target machine ?
> > 
> > With the correct string, IMHO yes
> 
> Do note that you used "RANK" and not "REQUIREMENTS" -- the job will show a preference for "compute02" if there are multiple available compute hosts.  However, it will still be allowed to run on any host.
> 
> It might be useful to post the output of "condor_q -better-analyze".  Another thing that could be going wrong is that the Machine attribute is using a FQDN ("compute02.example.com") whereas you are only querying the host ("compute02").

Hi,
I have verified that the compute02 node has a problem, that is ps
command shows condor_procd running but not condor_startd. Master and
Startd are listed in the configuration but condor_startd does not start
at first.

Second problem I found and corrected: Schedd was not running on the
central manager machine. I was using the DAEMON_LIST generated by the
condor_configure --type=manager command and schedd was not in the list.






> Brian
> 
> > 
> >> 2) where is the htcondor log file for the job ?
> > 
> > Did you specify a path in your submit file?
> > 
> > - S
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/