[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] curl file transfer problem



Hi everyone,

I have, on one 64-core centos 6/condor-8.6.13-1.el6.x86_64 worker machine:

> # ps -AF | grep condor
> condor      3920       1  0 14180  3744  18 Mar14 ?        00:00:31 condor_master -pidfile /var/run/condor/condor_master.pid
> root        4096    3920  0  8202 10592   9 Mar14 ?        03:36:19 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 501
> condor      4097    3920  0 14149  3792   9 Mar14 ?        00:01:06 condor_shared_port -f
> condor      4105    3920  0 16260 12624   8 Mar14 ?        05:40:03 condor_startd -f
> condor      4130    3920  0 19245  5568  14 Mar14 ?        00:01:06 condor_schedd -f
> condor    651153    4105  0 16517  4188  15 Apr03 ?        00:00:07 condor_starter -f -a slot1_2 exocet.bmrb.wisc.edu
> bbee      651156  651153  0 16515  1840  32 Apr03 ?        00:00:00 condor_starter -f -a slot1_2 exocet.bmrb.wisc.edu
> bbee      651157  651156  0 21411  2280   0 Apr03 ?        00:00:33 /usr/libexec/condor/curl_plugin http://proxy.chtc.wisc.edu/SQUID/bmrb/3.8/combined.tgz.enc /var/lib/condor/execute/dir_651153/combined.tgz.enc
...
> condor    661937    4105  0 16517  4188   0 Apr04 ?        00:00:06 condor_starter -f -a slot1_64 exocet.bmrb.wisc.edu
> bbee      661940  661937  0 16515  1844  16 Apr04 ?        00:00:00 condor_starter -f -a slot1_64 exocet.bmrb.wisc.edu
> bbee      661941  661940  0 21411  2276  42 Apr04 ?        00:00:36 /usr/libexec/condor/curl_plugin http://proxy.chtc.wisc.edu/SQUID/bmrb/3.8/combined.tgz.enc /var/lib/condor/execute/dir_661937/combined.tgz.enc

i.e. all 64 slots have been hanging for a couple of weeks waiting for a
file.

First question: is there an easy way to see what state condor thinks a
job is in, based on its PID? Although in this case, based on execute
host will work as they're all stuck.

Second question: is there a way to set a timeout on curl-plugin
transfers? As distinct from the overall periodic_remove? Also is this
plugin-specific as there is no FILE_TRANSFER_QUEUE_AGE or anything we've
changed for these jobs, and 2 weeks is way more than the default?

TIA
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature