[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fwd: MPI jobs are not writing.






From: "Malathi Deenadayalan" <malathi@xxxxxxxx>
To: "htcondor-admin" <htcondor-admin@xxxxxxxxxxx>
Sent: Friday, May 25, 2018 9:55:48 AM
Subject: MPI jobs are not writing.

Hi,

I am using parallel universe for submitting mpi jobs and I am testing some benchmarking for the IO performance,
So If I submit with 10 or 100 cores with absolute path the program is running.

Problem is:

1) when I dont use  absolute path the program is not writing in to my /home.

2) If I increase the core to 600 the program is going to idle state and start running and then idle state like that it is looping but actually not running.

3) what could be the reason.

Here attached programs and submit file.

Program 1.

#include <mpi.h>
#include <stdio.h>
#include <time.h>
#include<string.h>
#include <stdlib.h>
#include <unistd.h>
#define MB 1048576

char * get_current_time()
/* To put time stamp when needed  */
{
    time_t current_time;
    char *ct;
    current_time = time(NULL);
    ct=ctime(&current_time);
    ct[strlen(ct)-1]='\0';
    return ct;
}

int main(int argc, char** argv)
/* Most of the work */
{
    int size;             int ii;              int rank;
    char host_name[255];  char *buffer;        FILE *fd;
    long buffer_size=MB;   int name_len;        clock_t begin, end;
    int elapsed_secs;     long writen_rec_len;  char cwd[500];
   
    begin= clock();
   
    if (argc == 2 )
    {
        buffer_size*=atoi(argv[1]);
    }
    /*Initialize the MPI environment */
    MPI_Init(NULL, NULL);
    /* Get the number of processes */
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    /* Get the rank of the process*/
    char file_names[500];
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    /* Get the name of the processor */
    MPI_Get_processor_name(host_name, &name_len);
    /* host name Not much usefull */
    if (getcwd(cwd, sizeof(cwd)) != NULL)
        printf( "%s:Current working dir: %s\n",get_current_time(), cwd);
    printf("%s:[process=%d]Buffer size in MB=%ld\n",get_current_time(),rank,buffer_size/MB);
    buffer=(char*)malloc(buffer_size);
    memset(buffer,'a',buffer_size);
    sprintf(file_names,"%s_%d_created_file.dat",host_name,rank);
    fd=fopen(file_names,"wb");
    if (fd==NULL)
    {
        printf("%s: %s, ***FILE NOT OPENED ****\n",get_current_time(), file_names);
    }
    else
    {
        printf("%s: %s, is created \n",get_current_time(), file_names);
        writen_rec_len=fwrite(buffer,buffer_size,1,fd );
        if (writen_rec_len != 1)
        {
            printf("%s:[rank=%d] **** WRITE FAILED **** \n",get_current_time(),rank);
           
        }
        else
        {
            printf("%s:[rank=%d] writen %ld number of records of size %ld\n",get_current_time(),rank, writen_rec_len,buffer_size );
        }
        fclose(fd);
    }
    /* Print off a hello world message */
    MPI_Finalize();
    /* Finalize the MPI environment. */
    end = clock();
    elapsed_secs =(end - begin);
    printf("%s:%s,rank %d out of %d processors, lapsed time in pulses =%d\n",get_current_time(), host_name, rank, size, elapsed_secs);
    free(buffer);
}

=======================================================================================================================
submit file:
########################################
## This for openmpi script to work
#######################################
JOBNAME = data_write
universe = parallel
machine_count = 50
buffer_size= 124
executable = ~/parallel_IO/openmpiscript
arguments = ~/parallel_IO/mpi_code/data_write $(buffer_size)
#transfer_input_files = test_prog,condor_ssh,sshd.sh
#request_cpus = 1
getenv = true
#should_transfer_files = IF_NEEDED
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = $(JOBNAME).out
error  = $(JOBNAME).err
log    = $(JOBNAME).log

queue

==============================================================================

I have 23 nodes with 32core (2processor) each and this is the output for

[malathi.d@cn101:] $ condor_status -af:h Name DedicatedScheduler
Name         DedicatedScheduler     
slot1@cn101  DedicatedScheduler@cn101
slot2@cn101  DedicatedScheduler@cn101
slot3@cn101  DedicatedScheduler@cn101
slot4@cn101  DedicatedScheduler@cn101
slot5@cn101  DedicatedScheduler@cn101

slot1@cn102  DedicatedScheduler@cn101
slot2@cn102  DedicatedScheduler@cn101
slot3@cn102  DedicatedScheduler@cn101
slot4@cn102  DedicatedScheduler@cn101
slot5@cn102  DedicatedScheduler@cn101
slot6@cn102  DedicatedScheduler@cn101
slot7@cn102  DedicatedScheduler@cn101
slot8@cn102  DedicatedScheduler@cn101

slot1@cn103  DedicatedScheduler@cn101
slot2@cn103  DedicatedScheduler@cn101
slot3@cn103  DedicatedScheduler@cn101
slot4@cn103  DedicatedScheduler@cn101
slot5@cn103  DedicatedScheduler@cn101
slot6@cn103  DedicatedScheduler@cn101
slot7@cn103  DedicatedScheduler@cn101

slot1@cn104  DedicatedScheduler@cn101
slot2@cn104  DedicatedScheduler@cn101
slot3@cn104  DedicatedScheduler@cn101
slot4@cn104  DedicatedScheduler@cn101
slot5@cn104  DedicatedScheduler@cn101

slot32@cn104 DedicatedScheduler@cn101
slot1@cn105  DedicatedScheduler@cn101
slot2@cn105  DedicatedScheduler@cn101
slot3@cn105  DedicatedScheduler@cn101
slot4@cn105  DedicatedScheduler@cn101
slot5@cn105  DedicatedScheduler@cn101


Can you help me ?

And I want to do performance testing for nfs and gpfs please kindly advice.

Regards,
Malathi.