Connect to Roar Collab
Go to the following link and enter “RC Shell Access”
Or, connect via terminal using:
ssh [psuid]@submit.hpc.psu.edu
Request 1 node with 8 cores, 128 GB memory, enter:
salloc --partition=sla-prio --ntasks=1 --cpus-per-task=8 --mem=128G
Conda Environment Setup
You only need to do this once. If you already have these programs
installed, you can skip this step.
Create a conda environment named kraken2 and install
the required
Activate kraken2 Conda Environment
Activate the kraken2 conda environment in your work directory.
cd $HOME/work
module load anaconda
conda activate kraken2
Run Fastp for Read QC
Go to your scratch directory. Create output directory for fastp and
run
cd $HOME/scratch/DAWG_MGS_1
mkdir fastp_qc
fastp --in1 dawg_mgs1_R1.fastq.gz --in2 dawg_mgs1_R2.fastq.gz --out1 fastp_qc/dawg_mgs1_R1.fastq.gz --out2 fastp_qc/dawg_mgs1_R2.fastq.gz --trim_poly_g --html fastp_qc/dawg_mgs1_report.html --json fastp_qc/dawg_mgs1_report.json --thread 8
--in1 Input FASTQ file
--in2 Input FASTQ file
--out1 Output file for the cleaned forward reads
--out2 Output file for the cleaned reverse reads
--trim_poly_g Trims poly-G tails, which are artifacts
--html Generate a quality control report HTML format
--json Generates a quality control report JSON format
--thread Use multiple threads for faster processing
Remove host reads using Bowtie2
You only want the microbial reads. You can use Bowtie2 and the human
genome to remove human reads to decontaminate your samples.
unzip GRCh38_noalt_as.zip
mkdir bowtie2_qc
bowtie2 -p 8 -x GRCh38_noalt_as/GRCh38_noalt_as -1 fastp_qc/dawg_mgs1_R1.fastq.gz -2 fastp_qc/dawg_mgs1_R2.fastq.gz --very-sensitive-local --un-conc-gz dawg_mgs1_host_removed > bowtie2_qc/dawg_mgs1_mapped_and_unmapped.sam
mv *host_removed* ./bowtie2_qc
cd bowtie2_qc
mv dawg_mgs1_host_removed.1 dawg_mgs1_host_removed_R1.fastq.gz
mv dawg_mgs1_host_removed.2 dawg_mgs1_host_removed_R2.fastq.gz
Run Kraken2 for Taxonomic Classification
Create output directory and run kraken2 on trimmed and
decontaminatedbfiles. Replace SAMPLE_ID with your sequence name.
cd $HOME/scratch/DAWG_MGS_1
mkdir kraken2
kraken2 --db k2_standard_20250714 --threads 8 --report-minimizer-data --report kraken2/dawg_mgs1.report --output kraken2/dawg_mgs1.kraken2 --use-names bowtie2_qc/dawg_mgs1_host_removed_R1.fastq.gz bowtie2_qc/dawg_mgs1_host_removed_R2.fastq.gz
kraken 2 Run the Kraken2 taxonomic classifier
--db Use the specified Kraken2 database for classification
--threads Use multiple threads for faster processing; $NTHREADS is a variable
--quick Stop classification at the first match for each read to speed up processing (less accurate)
--report-minimizer-data Include minimizer data in the report for more detailed analysis
--report Save a summary classification report for the sample
--output Save full classification results for each read
--use-names Include taxonomic names in the output instead of just IDs
input file forward reads after quality control
input file reverse reads after quality control
Run Bracken for Species Abundance Estimation
bracken -d k2_standard_20250714 -i kraken2/dawg_mgs1.report -l F -r 150 -t 8 -o kraken2/dawg_mgs1.species_abundances -w kraken2/dawg_mgs1.bracken
bracken Run Bracken to estimate species abundance from Kraken2 classification results
-d Path to the Kraken2 database used for classification
-i Input file: Kraken2 report for the sample
-l S Taxonomic level to estimate abundance at (F = Family)
-r 150 Read length used in the sequencing data (important for accurate abundance estimation)
-t Number of threads to use for faster processing
-o Output file with estimated species abundances
-w Output file with Bracken's internal abundance estimation data
Plot Relative Abundances in R
Start an R session on Roar Collab. Open the OnDemand interactive
desktop
icon and select RStudio Server. Launch with default options.