[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Gahp server (failure issues ) exited with status 1 unexpectedly



Hi Jaime,

I donât see any GridmanagerLog.hbaig file under /var/log/condor/ directory. All I see are the following files under this directory: 

-rw-r--r-- 1 root   root       2813 Dec  8 11:32 KernelTuning.log
-rw-r--r-- 1 condor condor    25079 Jan 12 12:32 MasterLog
-rw-r--r-- 1 root   root     289589 Jan 12 13:29 ProcLog
-rw-r--r-- 1 root   root    1000095 Jan 12 08:22 ProcLog.old
-rw-r--r-- 1 condor condor  3921735 Jan 12 13:30 SharedPortLog
-rw-r--r-- 1 condor condor 10485973 Dec  9 03:37 SharedPortLog.old


Let me share a complete message which I get when I try to add the cluster. I run the following command (on a web server) to add the cluster on the Bosco compute pools server. 

[user@servername ~]$ bosco_cluster --add user@hostname slurm
Enter the password to copy the ssh keys to user@hostname:
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ControlSocket /tmp/bosco_ssh_control.user@hostname:22 already exists, disabling multiplexing
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
Downloading for user@xxxxxxxxxxxxxxx
Unpacking.
Sending libraries to user@hostname.
Creating BOSCO for the WN's..........................................
Installing on cluster user@xxxxxxxxxxx
Installation complete
The cluster user@hostname has been added to BOSCO
It is available to run jobs submitted with the following values:
> universe = grid
> grid_resource = batch slurm user@hostname


Looking forward to hearing back soon.

Thanks 

Regards
Hasan

On Jan 11, 2021, at 6:05 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

Can you send the portion of GridmanagerLog.hbaig file in the HTCondor log directory from time around one of these jobs going to held status?

The "ImportError: No module named siteâ is suspicious, and odd that itâs not printed when you run remote_gahp on the command line.

The RESOURCE USAGE POLICY banner could also be the cause. Such banners are usually suppressed when ssh is given a command to run, and the output of remote_gahp is interpreted by the HTCondor gridmanager daemon, which isnât expecting the banner.

 - Jaime

On Jan 7, 2021, at 5:10 PM, hasanbaigg@xxxxxxxxx wrote:

Hi Again,

I also tried to monitor the status of submitted and the result are given below that might be helpful for you to figure out what is going on:

$ condor_q -hold

-- Schedd: <hostname> : <127.0.0.1:11000?... @ 01/07/21 18:01:30
 ID      OWNER          HELD_SINCE  HOLD_REASON
  46.0   hbaig         1/6  13:34  Failed to start GAHP: Agent pid 3832\nImportError: No module named site\nAgent pid 3832 killed\n

Thanks for any help. 

regards
Hasan

On Jan 7, 2021, at 4:37 PM, Hasan Baig <hasanbaigg@xxxxxxxxx> wrote:

Hello,

Thanks for the response. I tried to run the command you suggested and got the following response

Agent pid 14621
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink: /home/FCAM/hbaig/bosco/glite/bin/../lib/condor/libglobus_common.so.0: no version information available (required by /home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink)
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

I am able to connect to remote server where bosco is installed and donât  understand how could it be an SSH issue. 

Sorry for asking naive questions but I am totally a beginner and do not understand how to proceed with it. Thanks for your due help and responses. 

regards
Hasan

On Jan 7, 2021, at 2:29 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Jan 7, 2021, at 9:32 AM, hasanbaigg@xxxxxxxxx wrote:

I am working on a web-based tool which take jobs from a user and submit it to bosco resources (compute nodes). I am using a bosco version (condor 8.8.12) on Linux CentOS 7. The web interface allows a user to add a bosco pool which user can use to submit jobs. However, when I try to submit a job, it fails. I tried to test the pool as well by using the following command:


bosco_cluster --test 


It gives me the following GAHP error:


This a probably an ssh failure (network, authentication, or authorization). Bosco runs the following command to access the remote cluster submit host:

<sbin>/remote_gahp <user>@<hostname> batch_gahp

You can run it on the command line to get more details about what's going wrong. remote_gahp is a bash script, so you can dig in further, if necessary.

 - Jaime

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/