[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] File xfer error, setting mismatch



Hi Nagaraj,

This could be related to some security configuration changes we made recently. We now expect the passwords directory to be "passwords.d".

Can you try updating this line in your security configuration file to:

SEC_PASSWORD_FILE = /etc/condor/passwords.d/POOL

Also make sure the /etc/condor/passwords.d folder exists with 700 permissions?

Please give this a try and let us know what happens,

Mark

On Thu, Jun 11, 2020 at 4:22 PM <pn@xxxxxxxxxxx> wrote:

Hi,

I have setup HTCondor on linux cluster. I installed from yum repo, on Centos7.8. CM is dual nic and all exec nodes are on private LAN. I plan to use file transfer method rather than use a shared filesystem.  I submit jobs and slots of the exec node are alotted but job fails because of file transfer failure. Below is clipping from the job log

007 (024.009.000) 06/12 01:40:08 Shadow exception!
        Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...

Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In the security config file, the following is the line

SEC_PASSWORD_FILE = /etc/condor/password.d/POOL

However, in the StarterLog of the particular slot on the exec node, the directory is "passwords.d".  I am unable to figure out where the directory is set as "passwords.d" instead of "password.d". I grepped through the config files, failed to find.

Below are more lines from the StarterLog of the slog (on the exec node)

06/12/20 02:43:29 (pid:39209) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name in starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd in starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO false, Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime ads to /var/lib/condor/execute/dir_39209/.job.ad and /var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/box_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/gdrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER: "/usr/libexec/condor/onedrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <158.144.55.71:9618>, errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed (status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer files" at line 2533 in file /var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket closed when trying to write 222 bytes to <192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write(): condor_write() failed

Where could it be picking up different setting than what is in the file in config.d? Or any other error?

Thanks for helping out!

Nagaraj


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison