[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to use condor_chirp?



Hi Andrew,

yes, any actual matching of cgroups and slots/jobs ClassAds is something
I have put off so far ;)

Probably aggregating ClassAd information and cgroup information on the
node and send a combined measurement is really the most convenient way
in the end (have to see, how well everything scales...)

Cheers,
  Thomas

On 2016-03-15 15:50, andrew.lahiff@xxxxxxxxxx wrote:
> Hi Thomas,
> 
> Note that you'll probably find the labels you get in cAdvisor or an alternative won't match directly to HTCondor job ids, e.g. they'll look like:
> 
> condor_pool_condor_slot1_3@xxxxxxxxxxxxxxxxxxxxxxx
> 
> (i.e. how the cgroups are named), which makes the Grafana plots a little hard to use. One way of getting around this could be to have a cron on each worker node which queries both cAdvisor (using its rest api) and HTCondor, and takes all the stats from cAdvisor but labels them in a more appropriate way, e.g. GlobalJobId, owner, etc. It can then send the appropriately tagged data to InfluxDB, rather than getting cAdvisor to do it directly. I've done this for a different cluster manager and it seems to work well, but haven't tried it yet with HTCondor. However, there may be better ways of getting the same result that I haven't thought of :-)
> 
> Regards,
> Andrew.
> 
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Thomas Hartmann [thomas.hartmann@xxxxxxx]
> Sent: Tuesday, March 15, 2016 2:06 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] How to use condor_chirp?
> 
> Hi Brian,
> 
> thanks for the warning.
> 
> Our idea is to update a job's ClassAd with information on the cgroups
> context its slot is using.
> In the end, we would like to to be able to refer a job to the cgroups
> statistics, i.e., to monitor the local cgroups and send statistics into
> an InfluxDB and in parallel accumulate basic job information from Condor
> (frequency, scaling, only accumulation? etc. remains to be seen...).
> 
> In the end, it would be nice to be able to plot basic jobs statistics
> from InfluxDB with Grafana for at least a subset of jobs (individual
> users - large scale/institutional users may be not necessary) for a
> range of a few days (regular pruning of data points/measurements in
> InfluxDB - scales how well?).
> 
> One question for me is, if such condor job statistics are better
> accumulated/send from the schedd or better (somehow?) from the startds?
> For cgroup statistics I would try to send them directly from the nodes
> (cAdvisor or similar approach).
> 
> Cheers,
>   Thomas
> 
> On 2016-03-15 12:23, Brian Bockelman wrote:
>> Hi Thomas,
>>
>> Two things to note:
>> 1) "condor_chirp set_job_attr” requires +WantIOProxy=true in the classad.  This updates the ClassAd immediately in the schedd - which can be a scalability concern.
>> 2) “condor_chirp set_job_attr_delayed” works by default.  However, it only sends the attribute in the next scheduled ClassAd update (for CPU and memory usage); there’s less scalability concern.  Additionally, attributes set with this command must start with the prefix “Chirp” (case-sensitive).
>>
>> Can you give some background on what you’re trying to accomplish?
>>
>> Brian
>>
>>> On Mar 14, 2016, at 12:44 PM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>>>
>>> Hi all,
>>>
>>> I would like to inject some system information into jobs' ClassAds.
>>> As I understand condor_chirp, I cannot inject/manipulate ClassAds on a
>>> worker for a running job, but only during submission. (I assume, that
>>> ~/.job.ad is where a job keeps its ClassAds [1] -- but can I inject it
>>> from outside the job 'properly'?)
>>>
>>> We get our grid jobs currently via an ARC-CE so I suppose the best place
>>> would be there (where?) to enable the communication token by appending
>>> +WantIOProxy = TRUE
>>> to job submissions scripts, or?
>>>
>>> I am not sure, where to actually call condor_chirp on the worker?
>>> My best guess is so far, to find the template for the job wrappers, i.e.,
>>> /var/lib/condor/execute/dir_*/condor_exec.exe
>>> and chirp from it, or? Probably starting a separate thread for updating
>>> changing ClassAds.
>>> Would that be reasonable or is there a better way?
>>> Where would I find the wrapper for condor_exec.exe?
>>>
>>> Cheers and thanks,
>>>  Thomas
>>>
>>>
>>>
>>> [1]
>>>> cat /proc/`ps axf | grep "/bin/bash -l /var/lib/condor/execute/dir" |
>>> tail -n1 | cut -d " " -f 2`/environ  | tr '\0' '\n' | grep "_CONDOR_JOB_AD"
>>> _CONDOR_JOB_AD=/var/lib/condor/execute/dir_38259/.job.ad
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature