[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] File xfer error, setting mismatch



Dear Mark,

yes, there was a difference in default values on the two hosts.

my CE has: CondorVersion: 8.8.9 May 06 2020 BuildID: 503068 PackageID: 8.8.9-1 $

There was no default setting according to "condor_config_val -v -dump SEC_PASSWORD"

Also, the CollectorLog had this line:

06/12/20 16:03:54 error fetching pool password; SEC_PASSWORD_FILE not defined


On the other hand, my WN has $CondorVersion: 8.9.7 May 19 2020 BuildID: 504263 PackageID: 8.9.7-1 $

Here the default setting does have "passwords.d".

# Configuration from machine: simclu-wn01.mydomain

# Parameters with names that match SEC_PASSWORD:
SEC_PASSWORD_DIRECTORY = /etc/condor/passwords.d
Â# at: <Default>
Â# expanded: /etc/condor/passwords.d
Â# default: /etc/condor/passwords.d
SEC_PASSWORD_DOMAIN =
Â# at: <Default>
Â# expanded:
SEC_PASSWORD_FILE = $(SEC_PASSWORD_DIRECTORY)/POOL
Â# at: <Default>
Â# expanded: /etc/condor/passwords.d/POOL
Â# default: $(SEC_PASSWORD_DIRECTORY)/POOL


I have identical yum.repo file for htcondor on both hosts. Even so, yum has fetched different versions of condor because of slight differences in OS (one is SL7.x and other is Centos7.8).

Thanks for helping me out of this!

Regards,

Nagaraj


On 6/12/20 3:46 AM, Mark Coatsworth wrote:


Hi Nagaraj,

This could be related to some security configuration changes we made recently. We now expect the passwords directory to be "passwords.d".

Can you try updating this line in your security configuration file to:

SEC_PASSWORD_FILE = /etc/condor/passwords.d/POOL

Also make sure the /etc/condor/passwords.d folder exists with 700 permissions?

PleaseÂgive this a try and let us know what happens,

Mark

On Thu, Jun 11, 2020 at 4:22 PM <pn@xxxxxxxxxxx> wrote:

Hi,

I have setup HTCondor on linux cluster. I installed from yum repo, on Centos7.8. CM is dual nic and all exec nodes are on private LAN. I plan to use file transfer method rather than use a shared filesystem. I submit jobs and slots of the exec node are alotted but job fails because of file transfer failure. Below is clipping from the job log

007 (024.009.000) 06/12 01:40:08 Shadow exception!
ÂÂÂÂÂÂÂ Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Sent By Job
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Received By Job
...

Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In the security config file, the following is the line

SEC_PASSWORD_FILE = /etc/condor/password.d/POOL

However, in the StarterLog of the particular slot on the exec node, the directory is "passwords.d". I am unable to figure out where the directory is set as "passwords.d" instead of "password.d". I grepped through the config files, failed to find.

Below are more lines from the StarterLog of the slog (on the exec node)

06/12/20 02:43:29 (pid:39209) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name in starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd in starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO false, Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime ads to /var/lib/condor/execute/dir_39209/.job.ad and /var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/box_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/gdrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER: "/usr/libexec/condor/onedrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <MailScanner warning: numerical links are often malicious: 158.144.55.71:9618>, errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed (status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer files" at line 2533 in file /var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket closed when trying to write 222 bytes to <MailScanner warning: numerical links are often malicious: 192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write(): condor_write() failed

Where could it be picking up different setting than what is in the file in config.d? Or any other error?

Thanks for helping out!

Nagaraj


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/