[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issues with Condor on Rocks cluster



I think I figured it out just a few minutes ago. I joined the compute nodes to the domain and now I don't get the error. It didn't work like I expected, it runs as the user and not as the condor service account.

To answer your question, I was getting the correct response from condor_status.

I used su to run the job as the condor account and it completed, so that led me to it being an issue with authentication.

Thank you for your reply.

Somehow I have added all of my cores on my head node to be used for computing. Any idea how to limit that? I have 32 cores on my head node, but I want too reserve half of those for other overhead processes.

On Jun 19, 2014 6:38 PM, "Philip Papadopoulos" <philip.papadopoulos@xxxxxxxxx> wrote:
When you run condor_status from the frontend, do all of your nodes report in as expected.

e.g.
 x86_64 yes
[root@rocks-76 ~]# condor_status
Name        OpSys   ÂArch  State   Activity LoadAv Mem  ActvtyTime

slot1@vm-container LINUX   ÂX86_64 Unclaimed Idle   Â0.000 2010 64+02:37:53
slot2@vm-container LINUX   ÂX86_64 Unclaimed Idle   Â0.000 2010 64+02:38:15
slot3@vm-container LINUX   ÂX86_64 Unclaimed Idle   Â0.000 2010 64+02:38:16
slot4@vm-container LINUX   ÂX86_64 Unclaimed Idle   Â0.000 2010 64+02:38:17


?


On Thu, Jun 19, 2014 at 12:10 PM, Thomas Erickson <twerickson@xxxxxxxxx> wrote:
Is there any documentation available that is specific to Rocks Cluster for getting Condor working?

I am running RHEL 6.3 with Rocks 6.1 with the condor roll (and other rolls). Condor is version 7.8.5.

When I submit a job it just sits there idle. One error I get is "request has not yet been considered by the matchmaker".

All of the documentation I find involves a regular install so it is not Rocks specific. Some older rocks documentation had some info, but a lot of it was not correct for the version I'm running.

It says the job submission has been accepted and nothing happens and a little while later it tries to submit it again. It's as if the Head node is submitting it and the compute node is not receiving it.

Also, the users SSH in from Win7 computers using Putty. The users are AD (Active Directory) users and the Head node is in AD, but the compute nodes are not. I was under the impression that the "condor" user would handle all of the work on the backend.

I have 2 separate networks, Frontend and backend. The Frontend has connections to the workstations, the DC, and the head node. The backend is the head node and compute nodes only. The storage is attached to the head node via iSCSI and an SMB share. I have verified that the AD user that is submitting the job can create file where the job is submitted. I am trying to run the hello.sub/hello.sh test job.

Also, my cluster is in a lab not connected to the internet and I can't post log files. But I will get any information that I can and post it here for anyone willing to assist.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/