[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why does my job fail under HTCondor?



Hi Kent,

out of curiosity:
Why did the job fail when it had a *higher* limit of file descriptors?

Cheers,
Max

> Am 16.02.2017 um 21:59 schrieb R. Kent Wenger <wenger@xxxxxxxxxxx>:
> 
> "I have a job that runs on the command line; but it crashes when run
> under HTCondor." -- probably many of us have faced a problem like
> this.
> 
> We recently worked with a user who had a job that exhibited this
> behavior (it segfaulted when run under HTCondor).  It
> took us a while to figure out what the cause was -- environment variables
> under HTCondor differed only trivially from the command line (the job
> was using "getenv = true"), and the command line arguments were exactly
> the same.
> 
> We eventually figured out that the job was crashing because the file
> descriptor limit when run under HTCondor was higher than when it was
> run from the command line(!).  This was a bit of a surprise, and clearly
> indicates problems in the code of the program; but it also points up
> an important, and somewhat non-obvious, way in which running a job under
> HTCondor differs from running it on the command line.
> 
> (HTCondor jobs inherit their limits from the HTCondor daemon that
> spawns them.  In the case of the file descriptor limit, some HTCondor
> daemons need higher limits that most user jobs typically need.
> We are considering changing this in the future, but this is the
> current situation.)
> 
> At any rate, system limits are something to keep in mind when debugging
> this type of problem.
> 
> Another thing that is likely to be different between running on the
> command line and running under HTCondor is the umask setting (controlling
> the permissions of files created by the job).  This is one more thing
> to check if you are having problems with jobs not working correctly
> under HTCondor.
> 
> Here's an example of a job that prints out the limits, changes the
> stack size limit, and prints out the limits again.
> 
> # File: change_limits.csh
>  #! /bin/csh
>  limit
>  echo ""
>  echo "Changing stacksize"
>  limit stacksize 4096
>  echo ""
>  limit
> 
> # File: change_limits.sub
>  universe = vanilla
>  executable = change_limits.csh
>  output = change_limits.out
>  queue
> 
> # File: change_limits.out
>  cputime      unlimited
>  filesize     unlimited
>  datasize     unlimited
>  stacksize    unlimited
>  coredumpsize unlimited
>  memoryuse    unlimited
>  vmemoryuse   unlimited
>  descriptors  1024
>  memorylocked 64 kbytes
>  maxproc      1024
> 
>  Changing stacksize
> 
>  cputime      unlimited
>  filesize     unlimited
>  datasize     unlimited
>  stacksize    4096 kbytes
>  coredumpsize unlimited
>  memoryuse    unlimited
>  vmemoryuse   unlimited
>  descriptors  1024
>  memorylocked 64 kbytes
>  maxproc      1024
> 
> Note that the limits on your process under HTCondor will depend on your
> HTCondor configuration.  Also, the limits may vary according to which
> universe your job runs under.
> 
> This information is also posted on the HTCondor wiki for future
> reference:
> https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=JobFailsUnderCondor
> 
> --
> R. Kent Wenger (wenger@xxxxxxxxxxx, 608-262-6627,
> http://www.cs.wisc.edu/~wenger/)
> Computer Sciences Department
> University of Wisconsin-Madison
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature