Scientific Computing & Data Analysis Section

Software Notes

Here are short notes for software offered by the Scientific Computing sections. Some software needs special care, or you may need to do something as a user to be able to use it. If you're unsure about running a specific software package, take a look here and see if there are any notes about it.

Augustus and BUSCO

Assess genome assembly and annotation completeness.

Augustus and BUSCO has an issue where BUSCO creates a new model for Augustus to use, and then needs so save it to where Augustus can find it. Unfortunately, Augustus defaults to its installation directory, which is not writable by users (and you couldn't have multiple users overwrite each others models anyhow).

To make this work, you need to copy the Augustus configuration folder to somewhere, then tell it to use that folder instead.

Create a new folder:

$ mkdir $HOME/augustusdata

Copy the config folder from our installation:

$ cp -r /apps/free81/augustus/3.3.3/config augustus/

In your script, set AUGUSTUS_CONFIG_PATH to that folder:

export AUGUSTUS_CONFIG_PATH="$HOME/augustus/config"

Jupyter

Browser-accessible Python notebook environment for creating documents that contain live code, equations, visualizations and explanatory text.

Jupyter runs as a web server on a compute node. You access it from your local web browser. First run jupyter itself as an interactive job. Below we load jupyter for Python 3.7.3, then run it for 4 hours, with 16GB memory and 8 cores:

$ module load jupyter.py/3.7
$ srun -t 0-4 --mem=16G -c 8 --pty jupyter notebook

Once it runs, it prints some messages that end with something like:

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://sango[something].oist.jp:8888/?token=[long string of characters]

Copy the entire URL beginning with http:// and paste into your browsers address bar. You should get the top page of the Jupyter environment.

Materials Studio

Solve key materials and chemical research problems with an integrated, multi-scale modeling environment that delivers a complete range of simulation methods.

Materials Studio has issues running on Deigo at the moment. We have temporarily set it up on Saion:

Set server gateway to "saion-login2"
Set port to "18888"
Use the "intel" partition. You can use up to 160 cores.
The "knl" partition is also available but will be very slow.

By default, your job will get an 8 hour limit. To extend the running time, you need to set it as an extra parameter in the Server Console:

Select Tools->Server Console.
right-click on server gateways->saion-login2.oist.jp and choose "properties".
select the "queuing" tab.
In the "extra parameters", set the Slurm time parameter. For 3 days and 12 hours, set it to "--time=3-12"

MOLGEN

Molecular structure generator of the molecular graphs that correspond to a given chemical formula and prescribed and forbidden substructures.

MOLGEN comes with a couple of preset forbidden substructure files. Once you have loaded the module file, you can use them via the MOLGEN_ROOT_DIR environment variable, like this:

$ mgen -badlist $MOLGEN_ROOT_DIR/share/badlist.sdf

The MOLGEN documentation is available in $MOLGEN_ROOT_DIR/docs/manual_molgen50.pdf

NAMD (not installed on Deigo)

Parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.

NAMD uses the IB network directly for communication, but uses openmpi for managing processes. Below is an example Slurm script for running a NAMD simulation for one hour, across two nodes with 24 cores each:


#!/usr/bin/bash
#SBATCH --partition=compute
#SBATCH --ntasks-per-node=20
#SBATCH -N 2 
#SBATCH -t 0-1 
#SBATCH --mem=120G 

module load namd 
charmrun +p $SLURM_NTASKS ++mpiexec $NAMD/namd2 +setcpuaffinity +isomalloc_sync stmv.namd

NWChem (not installed on Deigo)

An ab initio computational chemistry software package which also includes quantum chemical and molecular dynamics functionality.

NWChem needs a configuration file in your home directory. We have a default configuration in the installation directory, so either copy it or create a symbolic link to it:

$ ln -s /apps/free72/nwchem/[version]/data/default.nwchemrc ${HOME}/.nwchemrc

Replace [version] with the version you want to run. If you are using "old" Sango, link from "/apps/free/nwchem" instead.

OpenFOAM (not installed on Deigo)

An open-source CFD software package.

When you use OpenFOAM version 4 or newer, you first need to source the configuration file:

$ source /apps/free72/OpenFOAM/4.1/OpenFOAM-4.1/etc/bashrc

Picard

A set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

Picard is a Java application, and the online instructions tell you to run it explicitly with Java:

$ java -jar $PICARD

While you can do so, we have also created a shortcut so you can instead run it simply like:

$ picard

Python

A widely used general-purpose, high-level dynamic programming language. It has extensive support for data analysis, numerical computation, graphics and system management.

We strongly recommend Python 3. Python 2 is still available, but is no longer supported. Please do not use it if you can. Also, while we have a version of Python install system-wide, we recommend that you use one of the versions supplied as a module.

The module versions come with an extensive set of modules. If you need to install your own, you can install them into your own home directory:

$ pip3 install --user [package]

If you need a newer version of a module that's already installed, you can tell pip to ignore ('-I') existing versions, and explicitly tell it which version you want:

$ pip3 install -I --user [package]==[version]

R

A popular software environment for statistical computing and graphics.

A lot of the functionality in R is implemented through packages, not embedded in the core language. There are a very large number of packages, all implemented independently. Unlike most other frameworks, R packages also do not try to stay backwards or forwards compatible to any great degree.

This means that a package may only be available for a specific range of R versions. And two packages may be mutually incompatible; if you install one, you can't install the other. This implies that:

We can not provide a consistent set of installed packages across different versions of R;
We can not install all packages users may request into the R installation itself.

We have a (large) set of common packages installed into each version of R. If you require any packages beyond those, you need to install it yourself into your own /home. This is quite straightforward.

First, create the local directory where you want to install R packages, and then set the path for R to find:

mkdir -p ~/R/library
echo 'R_LIBS_USER="~/R/library"' >  $HOME/.Renviron

Then install a package using the "install.packages" function. Run it interactively in an R session, or as a one-line script:

Rscript -e 'install.packages("acp", repos="https://cran.ism.ac.jp/")'

Relion

Empirical Bayesian approach to refinement of 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).

Graphical interface

You run "relion" on the login node, in the directory where you stored your data. The relion program itself is just a front-end to the actual data processing applications, so it is safe to run on the login nodes.

You can submit data processing steps as compute jobs directly from Relion itself. Once you've set up the parameters, go to the right-most "Running" tab and set the job parameters as described below. You can also get the command line for a script you submit manually, or point Relion to your script.

MPI jobs

Many processing steps use multiple cores in parallel to speed up the calculation. You select the number of MPI processes with "Number of MPI procs", and the number of cores per process with "Number of threads". The number of threads can not exceed the number of cores on a single node (128 for Deigo), and many tools work best with using just a few or a single core per process.

Note: The memory is _per core._ That is, it sets the Slurm "--mem-per-cpu" parameter. If you use, say, 8 cores, the actual memory you allocate will be 8 times the amount you enter here. Be careful that you not ask for more memory than is available on the nodes.

GPU jobs

A few tools need a GPU to run. You need access to the GPU partition on Saion in order to run them. For a GPU job you set the "Queue name" in the Running tab to "gpu", and make sure the MPI process and core parameters are set to "1".

Issues

Relion does not reliably detect when a job is finished. You need to keep an eye on the output and on the job on Sango (using 'squeue'), then mark the job as finished in the middle-left dropdown menu.

SnpEff

A genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).

SnpEff is a set of Java applications. The online instructions tell you to run it with Java explicitly:

$ java -jar /apps/free72/snpeff/[version]/bin/snpEff.jar
$ java -jar /apps/free72/snpeff/[version]/bin/snpSift.jar

We have created shortcuts so you can instead run it like:

$ snpEff
$ snpSift

Trimmomatic

A flexible trimmer for Illumina sequence data.

Trimmomatic is a Java application. The online instructions tell you to run it with Java explicitly:

$ java -jar /apps/free81/Trimmomatic/[version]/lib/trimmomatic-[version].jar

That is not very convenient, so we have created a shortcut so you can instead run it like:

$ trimmomatic

Or, for backwards compatibility:

$ Trimmomatic.sh

Varscan

A platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Ro che/454, and similar instruments.

Varscan is a Java application. The online instructions tell you to run it with Java explicitly:

$ java -jar /apps/free72/varscan/[version]/bin/varscan.jar

We have created a shortcut so you can instead run it like:

$ varscan