Automated processing of ChIP-seq and ATAC-seq samples — process

This function performs all necessary steps in the ChIP-seq processing pipeline.

Usage

process_epigenome(
  fastq_files,
  out_name = NULL,
  seq_type = c("ATAC", "CHIP", "CT"),
  type = "SE",
  cores = 8,
  path_fastqc = "FastQC/",
  path_bam = "BAM/",
  path_peaks = "Peaks/",
  path_logs = "Logs/",
  run_fastqc = TRUE,
  index = "/vault/refs/indexes/hg38",
  extra_bowtie2 = "",
  remove = c("chrM", "chrUn", "_random", "_hap", "_gl", "EBVls"),
  blacklist = "/vault/refs/hg38-blacklist.v2.bed",
  type_peak = c("narrow", "broad"),
  shift = c(TRUE, FALSE),
  chunk = 1e+07,
  gen_sizes = "/vault/refs/hg38.chromSizes.txt"
)

Arguments

fastq_files: Character string (single-end) or character vector of length 2 (paired-end) with the file names of the samples to be analysed.
out_name: Character vector, with the same length as fastq_files, indicating the output filenames.
seq_type: Experiment type, either "ATAC" (default) or "CHIP".
type: Sequence type, one of "SE" (single end) or "PE" (paired end).
cores: Number of threads to use for the analysis.
path_fastqc: Character indicating the output directory for the FastQC reports.
path_bam: Character indicating the output directory for the bam files.
path_peaks: Character indicating the output directory for the peak files.
path_logs: Character indicating the output directory for the logs.
run_fastqc: Logical indcating whether to run (TRUE) or not (FALSE) FastQC. Default: TRUE.
index: Character indicating the location and basename for the Bowtie2 index.
extra_bowtie2: Character containing additional arguments to be passed to bowtie2 alignment call.
remove: Character vector with chr that will be filtered out. Any chromosome name containing matches for these characters will be removed.
blacklist: Character indicating the file containing blacklist regions in bed format. Any reads overlapping these regions will be discarded.
type_peak: Character indicating the type of peak to be called with MACS2, either "narrow" or "broad".
shift: Logical indicating whether the reads should be shifted -100bp and extended to 200bp (TRUE) or not (FALSE, default).
chunk: Size of the chunk to load into memory for ATAC-seq read offset. This argument is necessary only when type="SE".
gen_sizes: Character string indicating the path where the file with chromosome name and sizes can be found. This argument is necessary only when type="SE".

Value

Creates the folders path_fastqc, path_bam, path_peaks, path_logs, by default in your working directory, containing the output files from de different analyses.

Details

This function ocesses ATAC-seq or ChIP-seq from FastQ files using the following pipeline:

Quality Control (FastQC).
Alignment to reference genome (Bowtie2).
Post-processing (Samtools), including removing duplicates, blacklisted regions and non-reference chromosomes.
(only for ATAC-seq) Offset correction (Samtools).
Peak calling (MACS2).

This function can process paired and single end FastQ files:

Single end files. The argument fastq_files should be a character vector with the name of each file.
Paired end files. The argument fastq_files should be a list, where each element is a vector of size 1, where the first one is the R1 and the second one is the R2.

Examples

if (FALSE) {
process_epigenome(fastq_files=c("path/to/file.fastq.gz", "path/to/file2.fastq.gz"),
                  seq_type="ATAC",
                  out_name=c("sample1", "sample2"),
                  type="SE",
                  cores=8)
}