Bioinformatics

About this page

I first entered the world of bioinformatics in August of 2020, with zero background in computational work or statistics. The learning curve to get started in computational biology is steep and resources are scattered and not always well documented. Since I got started, I have been involved in over 10 projects at various stages of maturity, including published works, and written new methods to analyse data types I deal with routinely.

I still consider myself to be a very junior bioinformatician and am learning more every day. However, as I go through this process, I thought it might be useful to compile a list of resources that I have found most useful thus far.

Here, you will find a list of documents, packages, softwares and tutorials that I have used and/or routinely use (links included). I have included these resources here as I have found them to be well documented and their outputs biologically useful and meaningful. Importantly, there are thousands of resources available that I have yet to explore and this list will continue to grow. But hopefully, this will be useful, from one junior bioinformatician to another! Please feel free to get in touch with suggestions / clarifications!

Getting started with R / Python

Udemy R/Python A-Z for Data Science by Kirill Eremenko
- Much more affordable compared to equivalent courses available
- Short and engaging
- Provides the bare minimum to get started with R / Python syntax

Managing Conda and Jupyter: Quick tutorial here, documentation for Conda environments here.

Bulk RNA sequencing

Differential expression analysis

DESeq2
- Staple for DE analysis in RNAseq
- My personal 'go-to' for bulkRNAseq analysis

EdgeR
- Alternative method for DE analysis, uses TMM normalisation

Limma
- Useful for microarray DE analysis, uses quantile normalisation

For detailed read on differences in these methods, read Dillies et al., Briefings in Bioinformatics, 2013.

Enrichment analysis

Gene set enrichment analysis (GSEA)
-
Important to understand principles, read original paper here.
- GSEA website: software and curated molecular signatures.
- Quick implementation (in R) with fgsea and msigdbr.

Single-sample GSEA
-
GSEA without contrasts, eg. if you want to analyse each sample as its own independent variable.
- Read more here.
- Web tool to implement ssGSEA available on GenePattern.

Gene ontology (GO)-based hypergeometric test
-
topGO
- Note that this type of functional enrichment analysis is problematic when implemented wrongly. Read more here.

Single-cell RNA sequencing

Analysis

Scanpy
- Basic toolkit for analysing scRNAseq data in Python.

Seurat
- Basic toolkit for analysing scRNAseq data in R.

Integration

scVI
- Probablistic models for scRNAseq analysis
- Also very useful for reference mapping and label transfer

Others
- BBKNN
- Scanorama
-
Harmony

Which tool should I use? - Depends on the data and computational setup available!
Read Luecken et al. Nature Methods (2022) for benchmarking of different scRNAseq integration methods.

Useful tools

Trajectory analysis
- Palantir
- Slingshot

Cell-cell interaction analysis
- CellphoneDB

Curated data base for published single-cell transcriptomic datasets
Svensson, Beltrame and Pachter, Database 2020. https://doi.org/10.1093/database/baaa073

Interfacing Single-cell and bulk sequencing

Pseudobulk scRNAseq analysis
- Increases robustness and reduces false-discovery rate in single-cell differential expression.
- Read more about DE testing in scRNAseq here.
- Method by Marioni lab: pseudoBulkDGE.
- My personal method: CLpseudobulk.

Deconvolution
- CIBERSORTx: very common and well accepted deconvolution tool
-
MuSiC: One of my favourite tools for direct single-cell directed bulk deconvolution
- DWLS: Another useful tool for single-cell directed bulk deconvolution.
Benchmarking deconvolution pipelines: Cobos et al. Nature Communications, 2020.

Spatial transcriptomics

Analysis
- Scanpy and Seurat built-in functions (see scRNAseq) sufficient for most analyses.
-
Squidpy provides additional tools and is built on Scanpy.

Deconvolution and Spatial mapping of cell types
- Cell2location
-
Tangram

Statistics

stats in R - staple for all types of statstical analyses.

GLMnet for gemeralised linear models.

Data visualisation

ggplot2 in R - a staple! The R course on Udemy gives a good crash course on ggplot (see above).

ktplots: Useful for visualising single-cell RNAseq data and cell-cell interactions.

dittoSeq: Visualising single-cell and bulkRNAseq, with colour-blind accesibility considerations.

ggpubr: ggplot2 theme for publication-ready plots.

EnhancedVolcano: Publication-ready volcano plots.

A twitter thread on useful considerations when visualising data here.