SNPs_pipeline

VCF pipeline for downstream analyses: filtering, cleaning, analyzing SNPs, finally infer a phylogeny.

View the Project on GitHub DianaCarolinaVergara/SNPs_pipeline

Phylogeny Construction with SNPs

Better explanaitions in our wiki: Wiki

Or the R Markdown with all the commands >file<

Or a full population genetics R pipeline in GrunLab

Pipeline

We received a vcf file where SNPsaurus converted genomic DNA into nextRAD genotyping-by-sequencing libraries (SNPsaurus, LLC). The nextRAD libraries were sequenced on a HiSeq 4000 on two lanes of 150bp reads.

With this pipeline you would be able to:

Modified from population genetics http://grunwaldlab.github.io/Population_Genetics_in_R/qc.html

genepop, parallel, poppr, dartR, devtools, phytools, seqinr, phylotools, adegenet, pegas, hierfstat

library(vcfR)
library(ggplot2)
library(reshape2)
install_github("whitlock/OutFLANK")
biocLite("qvalue")

library("genepopedit")
library("devtools")
library("pcadapt")
library("qvalue")
library("OutFLANK")
library("ggplot2")
library(genepop)
library(vcfR)
library(ade4)
...

  1. Remove indels (insertions-deletions)
Muricea.vcf_ID_SNPs <- extract.indels(Muricea.vcf_ID,return.indels = FALSE)
Muricea.vcf_ID_SNPs  
Muricea.vcf_ID_Indels <- extract.indels(Muricea.vcf_ID,return.indels = TRUE)
Muricea.vcf_ID_Indels
  1. SNPs upper and lower 20% of depth distribution
   Muricea_dp<- extract.gt(Muricea.vcf_ID_SNPs, element = "DP", as.numeric=TRUE)
  1. Delete:

    3.1 Samples (missingness >70%)

    Muricea.vcf_ID_SNPs_miss <- apply(Muricea.vcf_ID_SNPs_dp, MARGIN = 2, function(x){ sum(is.na(x)) })
    Muricea.vcf_ID_SNPs_miss <- Muricea.vcf_ID_SNPs_miss/nrow(Muricea.vcf_ID_SNPs_dp)*100
    

    3.2 SNPs - variants (>90%) with high degree of missingness information

    Muricea_miss <- apply(Muricea_dp, MARGIN = 1, function(x){ sum( is.na(x) ) } )
    Muricea_miss <- Muricea_miss / ncol(Muricea_dp)*100
    Muricea_DP_filtered <- Muricea.vcf_ID_SNPs[Muricea_miss<90]
    
  2. Rewrite vcf file

   write.vcf(Muricea_DP_filtered,file= "Muricea_DP_90Miss_filtered.vcf")

genind_Muricea <- vcfR2genind(Muricea_filtered_no_clone_DP_Miss90_MAF_vcf_last)
genind_Muricea
Muricea_distances_FINAL.dist<-diss.dist(genind_Muricea, mat=TRUE, percent = TRUE)
write.table(Muricea_distances_FINAL.dist, file="Muricea_distances_FINAL.dist.txt", col.names = TRUE, row.names = TRUE, quote = FALSE)
write.csv(Muricea_distances_FINAL.dist, file="Muricea_distances.dist_FINAL.csv", col.names = TRUE, row.names = TRUE, quote = FALSE)

To verify quality filters

  1. Heatmaps
  heatmap.bp(Muricea_DP_70Miss_MAF01_Bialle_LD150_filtered.vcf_dp[1:1000,], rlabels = FALSE)
  1. Barplots

  2. Violinplots

  dp <- extract.gt(Muricea_DP_90Miss_filtered.vcf,  element = "DP", as.numeric = TRUE)
  class(dp)


  dpf <- melt(dp, varnames = c("Index", "Sample"),
              value.name = "Depth", na.rm = TRUE)
  dpf <- dpf[ dpf$Depth > 0, ]
  p <- ggplot(dpf, aes(x = Sample, y = Depth))
  p <- p + geom_violin(fill = "#C0C0C0", adjust = 1.0,
                       scale = "count", trim = TRUE)
  p <- p + theme_bw()
  p <- p + theme(axis.title.x = element_blank(),
                 axis.text.x = element_text(angle = 60, hjust = 1))
  p <- p + scale_y_continuous(trans = scales::log2_trans(),
                              breaks = c(1, 10, 100, 800),
                              minor_breaks = c(1:10, 2:10 * 10, 2:8 * 100))
  p <- p + theme(panel.grid.major.y = element_line(color = "#A9A9A9", size = 0.6))
  p <- p + theme(panel.grid.minor.y = element_line(color = "#C0C0C0", size = 0.2))
  p <- p + ylab("Depth (DP)")
  p

Suitable for RAxML

Or you can use another tool from CIPRESS Phylogenetic Collection (BEAST2, MRBAYES, RAXML) CyVerse


Need:


References

___________________

.

.

.

Future pipeline:

  1. Learn the SNPs format

  2. SNPs calling

  3. SNPs filtering 3.1. Samples quality 3.2. Variants quality