[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] File xfer error, setting mismatch



Hi,

I have setup HTCondor on linux cluster. I installed from yum repo, on Centos7.8. CM is dual nic and all exec nodes are on private LAN. I plan to use file transfer method rather than use a shared filesystem. I submit jobs and slots of the exec node are alotted but job fails because of file transfer failure. Below is clipping from the job log

007 (024.009.000) 06/12 01:40:08 Shadow exception!
ÂÂÂÂÂÂÂ Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Sent By Job
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Received By Job
...

Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In the security config file, the following is the line

SEC_PASSWORD_FILE = /etc/condor/password.d/POOL

However, in the StarterLog of the particular slot on the exec node, the directory is "passwords.d". I am unable to figure out where the directory is set as "passwords.d" instead of "password.d". I grepped through the config files, failed to find.

Below are more lines from the StarterLog of the slog (on the exec node)

06/12/20 02:43:29 (pid:39209) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name in starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd in starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO false, Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime ads to /var/lib/condor/execute/dir_39209/.job.ad and /var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/box_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/gdrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER: "/usr/libexec/condor/onedrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <158.144.55.71:9618>, errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed (status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer files" at line 2533 in file /var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket closed when trying to write 222 bytes to <192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write(): condor_write() failed

Where could it be picking up different setting than what is in the file in config.d? Or any other error?

Thanks for helping out!

Nagaraj