R

Description

UBELIX no longer features the R version from EPEL as this version gets automatically updated and therefore things are not reproducible. R isn now provided by an environment module and must be loaded explicitly:

module load R/3.4.4-foss-2018a-X11-20180131

-bash-4.1$ R --version
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

The Vital-IT project is also providing some versions. The following commands will list the available versions:

module load vital-it
module avail 2>&1 | grep " R\/"
  R/3.4.2
  R/latest

To use one of these version, you have to load the respective module, which then masks the system’s version, i.e.

module load vital-it
module load R/3.4.2

Do not forget to put those two lines into your job script as well in order to use the same version from within the job later on a compute node!

Basic Topics

Customizing the Workspace

At startup, unless –no-init-file, or –vanilla was given, R searches for a user profile in the current directory (from where R was started), or in the user’s home directory (in that order). A different path of the user profile file can be specified by the R_PROFILE_USER environment variable. The found user profile is then sourced into the workspace. You can use this file to customize your workspace, i.e., to set specific options, define functions, load libraries, and so on. Consider the following example:

.Rprofile

# Set some options
options(stringsAsFactors=FALSE)
options(max.print=100)
options(scipen=10)

# Load  class library
library(class)

# Don't save workspace by default
q <- function (save="no", ...) {
  quit(save=save, ...)
}

# User-defined function for setting standard seed
mySeed <- function() set.seed(5450)


# User-defined function for calculating L1/L2-norm, returns euclidian distance (L2-norm) by default
myDistance <- function(x, y, type=c("Euclidian", "L2", "Manhattan", "L1")) {
  type <- match.arg(type)
  if ((type == "Manhattan") | (type == "L1")) {
    d <- sum( abs(x - y) )
  } else {
    d <- sqrt( sum( (x - y) ^ 2) )
  }
  return(d)
}

Installing Packages

Run R interactively. To install additional R packages call the install.packages() function with the name of the package as argument. Upon installing the first package, you will receive a warning that you do not have sufficient permissions to write to “/usr/lib64/R/library”. Type “y” to use a personal library instead:

> install.packages("doParallel")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
Warnung in install.packages("doParallel")
  'lib = "/usr/lib64/R/library" ist nicht schreibbar
Would you like to use a personal library instead?  (y/n)

Next, type “y” to create your personal library at the default location within your HOME directory:

Would you like to create a personal library
~/R/x86_64-redhat-linux-gnu-library/3.4

Next, select a CRAN mirror to download from. The mirrorlist will be not the same as below. The mirrolist is constantly changing, but will look like it.

Pick any country nearby, i.e. Switzerland. If https makes problems, pick “(HTTP mirrors)” and then select something nearby as shown below

--- Bitte einen CRAN Spiegel für diese Sitzung auswählen ---
Error in download.file(url, destfile = f, quiet = TRUE) :
  nicht unterstütztes URL Schema
HTTPS CRAN mirror
 1: 0-Cloud [https]                2: Austria [https]
 3: Chile [https]                  4: China (Beijing 4) [https]
 5: Colombia (Cali) [https]        6: France (Lyon 2) [https]
 7: France (Paris 2) [https]       8: Germany (Münster) [https]
 9: Iceland [https]               10: Mexico (Mexico City) [https]
11: Russia (Moscow) [https]       12: Spain (A Coruña) [https]
13: Switzerland [https]           14: UK (Bristol) [https]
15: UK (Cambridge) [https]        16: USA (CA 1) [https]
17: USA (KS) [https]              18: USA (MI 1) [https]
19: USA (TN) [https]              20: USA (TX) [https]
21: USA (WA) [https]              22: (HTTP mirrors)

Selection: 22
HTTP CRAN mirror
 1: 0-Cloud                       2: Algeria
 3: Argentina (La Plata)          4: Australia (Canberra)
 5: Australia (Melbourne)         6: Austria
 7: Belgium (Antwerp)             8: Belgium (Ghent)
(...)
65: Slovakia                     66: South Africa (Cape Town)
67: South Africa (Johannesburg)  68: Spain (A Coruña)
69: Spain (Madrid)               70: Sweden
71: Switzerland                  72: Taiwan (Chungli)
73: Taiwan (Taipei)              74: Thailand
75: Turkey (Denizli)             76: Turkey (Mersin)
(...)
93: USA (OH 2)                   94: USA (OR)
95: USA (PA 2)                   96: USA (TN)
97: USA (TX)                     98: USA (WA)
99: Venezuela
Selection: 71

Finally, the package gets installed. After installing the package you can close the interactive session by typing q().

Do not forget to load the corresponding library (for each R session) before using functions provided by the package:

> library(doParallel)

Batch Execution of R

The syntax for running R non-interactively with input read from infile and output send to outfile is:

R CMD BATCH [options] infile [outfile]

Suppose you placed your R code in a file called foo.R:

foo.R

set.seed(3000)
valx<-seq(-2,2,0.01)
valy<-2*valx+rnorm(length(valx),0,4)
# Save plot to pdf
pdf('histplot.pdf')
hist(valy,prob=TRUE,breaks=20, main="Histogram and PDF",xlab="y", ylim=c(0,0.15))
curve(dnorm(x,mean(valy),sd(valy)),add=T,col="red")
dev.off()

To execute foo.R on the cluster, add the R call to your job script…

Rbatch.sh

#! /bin/bash
#SBATCH --mail-user=<put your valid email address here!>
#SBATCH --mail-type=end,fail
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=2G

# Put your code below this line
module load vital-it
module load R/3.4.2
R CMD BATCH --no-save --no-restore foo.R

…and submit your job script to the cluster:

sbatch Rbatch.sh

Advanced Topics

Parallel R

By default, R will not make use of multiple cores available on compute nodes to parallelize computations. Parallel processing functionality is provided by add-on packages. Consider the following contrived example to get you started. To follow the example, you need the following packages installed, and the corresponding libraries loaded:

> library(doParallel)
> library(foreach)

The foreach package provides a looping construct for executing R statements repeatedly, either sequentially (similar to a for loop) or in parallel. While the binary operator %do% is used for executing the statements sequentially, the %dopar% operator is used to execute code in parallel using the currently registered backend. The getDoParWorkers() function returns the number of execution workers (cores) available in the currently registered doPar backend, by default this corresponds to one worker:

> getDoParWorkers()
[1] 1

Hence, the following R code will execute on a single core (even with the %dopar% operator):

> start.time <- Sys.time()
> foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% {
+ Sys.sleep(3*i)
+ i
+ }
end.time <- Sys.time()
exec.time <- end.time - start.time
[1] 4 3 2 1

Let’s measure the runtime of the sequentiall execution:

> start.time <- Sys.time(); foreach(i=4:1, .combine='c', .inorder=TRUE) %dopar% { Sys.sleep(3*i); i }; end.time <- Sys.time(); exec.time <- end.time - start.time; exec.time
[1] 4 3 2 1
Time difference of 30.04088 secs

Now, we will register a parallel backend to allow the %dopar% operator to execute in parallel. The doParallel package provides a parallel backend for the %dopar% operator. Let’s find out the number of cores available on the current node

> detectCores()
[1] 24

To register the doPar backend call the function registerDoParallel(). With no arguments provided, the number of cores assigned to the backend matches the value of options(“cores”), or if not set, to half of the cores detected by the parallel package.

 registerDoParallel()
> getDoParWorkers()
[1] 12

To assign 4 cores to the parallel backend:

> registerDoParallel(cores=4)
> getDoParWorkers()
[1] 4

Request the correct number of slots

Because it is crucial to request the correct number of slots for a parallel job, we propose to set the number of cores for the doPar backend to the number of slots allocated to your job: registerDoParallel(cores=Sys.getenv("SLURM_CPUS_PER_TASK"))

Now, run the example again:

> foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% {
+ Sys.sleep(3*i)
+ i
+ }
[1] 4 3 2 1

Well, the output is basically the same (the results are combined in the same order!). Let’s again measure the runtime of the parallel execution on 4 cores:

The binary operator %do% willl always execute a foreach-loop sequentially even if registerDoParallel was called before! To correctly run a foreach in parallel, two conditions must be met:

  • registerDoParallel() must be called with a certain number of cores
  • The %dopar% operator must be used in the foreach-loop to have it run in parallel!

Installing DESeq2 from Bioconductor packages

DESeq21 installed from Bioconductor2 has many dependencies. Two odd facts are hindering a succesful build of DESeq2 in first place:

  • data.table is needed by Hmisc, which in turn is needed by DESeq2. While Hmisc is automatically installed prior to DESeq2, data.table is not and has to be installed manually first.