Connect to Roar Collab

Or, connect via terminal using:

ssh [psuid]@submit.hpc.psu.edu

Request 1 node with 8 cores, 128 GB memory, enter:

salloc --partition=sla-prio --ntasks=1 --cpus-per-task=8 --mem=128G

Conda Environment Setup

You only need to do this once. If you already have these programs installed, you can skip this step.

Create a conda environment named kraken2 and install the required

tools: fastp, bowtie2, kraken2, and bracken.

conda create -n kraken2 -c bioconda -c conda-forge fastp bowtie2 kraken2 bracken

Get the Data

Activate kraken2 Conda Environment

Activate the kraken2 conda environment in your work directory.

cd $HOME/work
module load anaconda
conda activate kraken2

Run Fastp for Read QC

Go to your scratch directory. Create output directory for fastp and run

cd $HOME/scratch/DAWG_MGS_1
mkdir fastp_qc
fastp --in1 dawg_mgs1_R1.fastq.gz --in2 dawg_mgs1_R2.fastq.gz --out1 fastp_qc/dawg_mgs1_R1.fastq.gz --out2 fastp_qc/dawg_mgs1_R2.fastq.gz --trim_poly_g --html fastp_qc/dawg_mgs1_report.html --json fastp_qc/dawg_mgs1_report.json --thread 8

 --in1 Input FASTQ file 
 --in2 Input FASTQ file
 --out1 Output file for the cleaned forward reads
 --out2 Output file for the cleaned reverse reads
 --trim_poly_g Trims poly-G tails, which are artifacts
 --html Generate a quality control report HTML format
 --json Generates a quality control report JSON format
 --thread Use multiple threads for faster processing

Remove host reads using Bowtie2

You only want the microbial reads. You can use Bowtie2 and the human genome to remove human reads to decontaminate your samples.

unzip GRCh38_noalt_as.zip
mkdir bowtie2_qc

bowtie2 -p 8 -x GRCh38_noalt_as/GRCh38_noalt_as -1 fastp_qc/dawg_mgs1_R1.fastq.gz -2 fastp_qc/dawg_mgs1_R2.fastq.gz --very-sensitive-local --un-conc-gz dawg_mgs1_host_removed > bowtie2_qc/dawg_mgs1_mapped_and_unmapped.sam

mv *host_removed* ./bowtie2_qc

cd bowtie2_qc
  
mv dawg_mgs1_host_removed.1 dawg_mgs1_host_removed_R1.fastq.gz
mv dawg_mgs1_host_removed.2 dawg_mgs1_host_removed_R2.fastq.gz

Run Kraken2 for Taxonomic Classification

Create output directory and run kraken2 on trimmed and decontaminatedbfiles. Replace SAMPLE_ID with your sequence name.

cd $HOME/scratch/DAWG_MGS_1

mkdir kraken2

kraken2 --db k2_standard_20250714 --threads 8 --report-minimizer-data --report kraken2/dawg_mgs1.report --output kraken2/dawg_mgs1.kraken2 --use-names bowtie2_qc/dawg_mgs1_host_removed_R1.fastq.gz bowtie2_qc/dawg_mgs1_host_removed_R2.fastq.gz

kraken 2 Run the Kraken2 taxonomic classifier
 --db Use the specified Kraken2 database for classification
 --threads Use multiple threads for faster processing; $NTHREADS is a variable
 --quick Stop classification at the first match for each read to speed up processing (less accurate)
 --report-minimizer-data Include minimizer data in the report for more detailed analysis
 --report  Save a summary classification report for the sample
 --output Save full classification results for each read
 --use-names Include taxonomic names in the output instead of just IDs
 input file forward reads after quality control
 input file reverse reads after quality control

Run Bracken for Species Abundance Estimation

bracken -d k2_standard_20250714 -i kraken2/dawg_mgs1.report -l F -r 150 -t 8 -o kraken2/dawg_mgs1.species_abundances -w kraken2/dawg_mgs1.bracken
  
bracken Run Bracken to estimate species abundance from Kraken2 classification results
  -d Path to the Kraken2 database used for classification
 -i Input file: Kraken2 report for the sample
  -l S Taxonomic level to estimate abundance at (F = Family)
  -r 150 Read length used in the sequencing data (important for accurate abundance estimation)
  -t Number of threads to use for faster processing
  -o Output file with estimated species abundances
  -w Output file with Bracken's internal abundance estimation data

Plot Relative Abundances in R

Start an R session on Roar Collab. Open the OnDemand interactive desktop

icon and select RStudio Server. Launch with default options.

Set working directory to where your input files are located.

# Read the data
setwd("~/Desktop/") # change to where your file is located
data <- read.delim("dawg_mgs1.species_abundances", sep="\t")

# Sort data by relative abundance
data <- data %>% arrange(fraction_total_reads)

# Plot horizontal bar plot
p1 <- ggplot(data, aes(x = reorder(name, fraction_total_reads), y = fraction_total_reads)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Family Relative Abundance",
       x = "Families",
       y = "Relative Abundance (Fraction of Total Reads)") +
  theme_minimal()

ggsave("horizontal_bar.png", plot = p1, width = 10, height = 6, dpi = 300)


# Exclude human reads
data <- data %>% filter(name != "Hominidae") 
# Removes rows where the family name is "Hominidae" to focus on microbial taxa

# Normalize to ensure total sums to 1
data <- data %>%
  mutate(fraction_total_reads = fraction_total_reads / sum(fraction_total_reads))  
# Recalculates relative abundances so they sum to 1, useful if filtering changed the total

# Add a dummy column for stacking
data$Sample <- "Microorganisms"  
# Adds a new column called 'Sample' with the same value for all rows, useful for faceted or stacked plots

# Plot stacked bar plot
p2 <- ggplot(data, aes(x = Sample, y = fraction_total_reads, fill = name)) +
  geom_bar(stat = "identity", width = 0.5) +
  labs(title = "Stacked Bar Plot of Relative Abundances",
       x = NULL,
       y = "Relative Abundance (Fraction of Total Reads)",
       fill = "Family") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "right")


ggsave("stacked_bar.png", plot = p2, width = 10, height = 6, dpi = 300)