[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Matching Problem With Custom ClassAdd



Hi Todd,

Thank you for the suggestion and information. Oddly enough the reverse analyze tells me that the jobs should match the slots:

$ condor_q -analyze -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2544.0


-- Submitter: gcecs.heprc.uvic.ca : <206.12.154.47:8081> : gcecs.heprc.uvic.ca

-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements _expression_ for this slot is

  ( START ) && ( IsValidCheckpointPlatform ) &&
  ( WithinResourceLimits )

  START is ( Owner == "mleblanc" )

This slot defines the following attributes:

  CheckpointPlatform = "LINUX X86_64 3.10.44-74.cernvm.x86_64 normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2"
  Cpus = 8
  Disk = 153054548
  Memory = 30161
  IsValidCheckpointPlatform = true
  WithinResourceLimits = false

Job 2544.0 has the following attributes:

  TARGET.Owner = "mleblanc"
  TARGET.JobUniverse = 5
  TARGET.NumCkpts = 0
  TARGET.RequestCpus = 1
  TARGET.RequestDisk = 10000000
  TARGET.RequestMemory = 29500

The Requirements _expression_ for this slot reduces to these conditions:

   ÂClusters
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 1 ÂOwner == "mleblanc"
[1] Â Â Â Â Â 1 ÂIsValidCheckpointPlatform
[3] Â Â Â Â Â 1 ÂWithinResourceLimits

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
  1 (100.00 %) match both slot and job requirements.
  1 match the requirements of this slot.
  1 have job requirements that match this slot.


But this did get me to check the worker node out more closely, there is an odd message in the StartLog:

10/22/14 17:32:10 CCBListener: failed to receive message from CCB server [central.manager]
10/22/14 17:32:10 CCBListener: connection to CCB serverÂ[central.manager]Âfailed; will try to reconnect in 60 seconds.
10/22/14 17:33:10 CCBListener: registered with CCB server gcecs.heprc.uvic.ca as ccbidÂ[central.manager]:9618#1947
10/22/14 18:23:14 WARNING: forward resolution of 404 doesn't match [worker ip]!

That got me to investigate the network on the worker and some of my contextualization scripts mangled the worker's network configuration. I'll get back to you with whether fixing those scripts removed the problems I'm seeing.

Cheers,
-Frank


On 22 October 2014 16:09, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 10/22/2014 5:28 PM, Frank Berghaus wrote:

Suggestions:

  ÂCondition            ÂMachines Matched  Suggestion
  Â---------            Â----------------  ----------
1Â ÂVMType is "atlas-worker"Â Â Â Â Â 0Â Â Â Â Â Â Â Â Â ÂMODIFY TO
"atlas-worker"
2Â ÂTarget.Arch == "x86_64"Â Â Â Â Â Â4
3Â Â( TARGET.OpSys == "LINUX" )Â Â Â Â4
4Â Â( TARGET.Disk >= 10000000 )Â Â Â Â4
5Â Â( TARGET.Memory >= 29500 )Â Â Â Â 4
6Â Â( TARGET.HasFileTransfer )Â Â Â Â 4

These two strings ("atlas-worker") look the same, yet the match fails.

I think suggestion from -better-analyze is a bug, and a red herring.

Most likely the real reason your job is not matching is because some condition in the machine requirements (aka the START _expression_ in your condor_config) is not being met. Recall that for a match to happen, both the job requirements and the machine requirements need to evaluate to True. By default, condor_q -analyze only tries to analyze your job's requirements _expression_. I suggest you do a "condor_status" and pick an unclaimed slot that you think your job should match with. Lets say it is called 'slot1@xxxxxxx'. And lets say your job is job id 50.0. Try entering the following command:

 condor_q -analyze -reverse -machine slot1@xxxxxxx 50.0

Doing the above may identify a clause in the requirements of the machine that is causing the machine to dislike your job.

Doing a 'condor_q -analyze -reverse -machine xxxx' has solved many matching mysteries for me.

Hope the above helps,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
----------
Frank Berghaus
University of Victoria
Research Associate
Physics & Astronomy
UVic Phone: +1 (250) 721-7741
UVic Office: Elliot 212