MetaG and metaT approaches both have limitations. In metaG, we know who is there and what they could do, but we don’t know which genes are actively expressed and to what extent. In metaT, we know which genes are actively expressed and to what extent, but we do not necessarily know who is doing the work. Thus, it is advantageous to combine the two approaches. Genome- (or MAG) resolved metaT tells us both which genes are actively transcribed/to what extent and in which organisms. So, we are able to have a sense for who is doing what in the community.
NOTE: the approach in this workshop is very basic. Please consider the specific parameters that would make sense to use for your individual project.
Enter the Roar dashboard and start a persistent terminal with 16 threads and 64 GB RAM
Download all the files and data:
cd $HOME/scratch
cp -r /scratch/azv5523/DAWG_wkshop/ ./
Load anaconda with module load anaconda
Make the needed conda environment and install the packages needed:
conda create -n DAWG_S26_2
conda activate DAWG_S26_2
mamba install -c bioconda -c conda-forge fastp bowtie2 samtools bakta kallisto -y
conda install -c bioconda/label/cf201901 metabat2 -y
mkdir fastp_qc
cd fastp_qc
mkdir metaT
mkdir metaG
cd $HOME/scratch/DAWG_wkshop
fastp –in1 reads/metaG/metaG_1.fastq.gz –in2 reads/metaG/metaG_2.fastq.gz –out1 fastp_qc/metaG/metaG_1_clean.fastq.gz –out2 fastp_qc/metaG/metaG_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16
fastp –in1 reads/metaT/metaT_1.fastq.gz –in2 reads/metaT/metaT_2.fastq.gz –out1 fastp_qc/metaT/metaT_1_clean.fastq.gz –out2 fastp_qc/metaT/metaT_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16
Take a look at the html outputs from fastp.
Don’t do this during the workshop. It takes hours. Use the ‘contigs.fa’ file generated from this process in the following steps.
mkdir megahit
megahit -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -o ./megahit
bowtie2-build megahit/contigs.fa contigs_index This will
take ~15 mins. Run this only if there is plenty of time, and want to see
the bowtie2 index format. No need to run this if not, as we already
provide the SAM file.
mkdir alignments
bowtie2 -x contigs_index -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -S alignments/sample1.sam -p 12##DO
NOT RUN THIS (takes hours, use output_depth.txt file we provided for
step 5).
samtools view -bS alignments/sample1.sam | samtools sort -o alignments/sample1.bam –threads 12
samtools index alignments/sample1.bam
jgi_summarize_bam_contig_depths –outputDepth output_depth.txt alignments/sample1.bam
mkdir -p metabat2_bins
metabat2 -i megahit/contigs.fa -o metabat2_bins/bin -a output_depth.txt –numThreads 16 –seed 42
mkdir bakta_output
Run the bakta.sh bash script to annotate all of the MAGs
with bakta. This will take hours, so use the .ffn files provided
for the next steps.
These files are really handy because they contain both the names of the indentified genes/CDSs and their DNA sequence for each MAG in fasta format. We will use these files to make indexes for mapping metaT reads.
cd bakta_output
mkdir ffn_files
mv *.ffn ffn_files
cd $HOME/scratch/DAWG_wkshop/
mkdir kallisto_indexes
Run the kallisto_indexes.sh bash script for making an
index for each .ffn file.
mkdir kallisto_quant
Run the kallisto_quant.sh bash script for mapping metaT
reads back to the indexes to quantify transcript abundance.
This will take several hours, so we have provided you an example output for ‘bin.1’.
conda deactivate
Take a look at the output for bin.1 from the
kallisto_quant.sh script:
cd kallisto_quant/bin.1
less abundances.tsv
Press ‘Q’ to exit the file view.
The ‘target IDs’ are from the bin.1.ffn file. So, to get the conversions to gene names we can understand, we can make a conversion file with the following commands:
cd $HOME/scratch/DAWG_wkshop/bakta_output/ffn_files/
cat bin.1.ffn | grep '>' > conv_intermediate.txt
sed 's/>//g' conv_intermediate.txt > conversions.txt
less conversions.txt
Now we know what the target ID’s mean.
Compare samples from 2 environments/treatments with differential gene expression analysis, resolved by bin. This tells you which genes are differentially expressed and which organisms are responsible. You can use DESeq2 or Sleuth.
Use KEGG or GO tools to visualize/identify expressed pathways in your sample.