DCS Tools

DCS CloudAbout 31 min

1. Introduction

1.1 Description

DCS Tools is an advanced software leveraging CPU acceleration technology, designed to serve as an ideal alternative to BWA-MEM, GATK, and large-scale population joint calling in data analysis. It reduces storage costs by compressing Fastq.gz file sizes. Its functionalities include germline mutation analysis for whole-genome sequencing (WGS) and whole-exome sequencing (WES), joint calling for population variant detection, and a compression/decompression tool for Fastq files based on reference genome sequences.

1.2 Advantages and Value

1)Consistent and reliable germline variant detection tool

This tool ensures the core algorithmic logic of modules like SOAPnuke, BWA, and GATK remains unchanged while incorporating excellent parallelization design and restructuring, accelerating computation and reducing unnecessary I/O operations. It not only significantly speeds up processing but also ensures high consistency with the GATK system.

Tool highlights:

  • Uses the same mathematical methods as the BWA-GATK best practices workflow from the Broad Institute but is 16x faster from FASTQ to VCF and 42x faster from BAM to VCF (measured in core·hours).

  • No runtime variability, no downsampling in high-coverage regions.

  • Pure software solution running on CPU-based systems, cross-platform compatible with X86, ARM and cloud environments.

  • Tested on Alibaba Cloud ECS i4g.8xlarge.

    • 30x genome, NA12878 (HG001), completed in 1.82 hours (58 core·hours).

2)Joint Calling: A tool for merging gVCFs and joint calling in population sequencing projects

Tool highlights

Fast: Completes joint variant detection for 52,000 whole genomes in 6 days. Accurate: GIAB standard sample SNP F1 score 99.44%, INDEL F1 score 98.95%. Scalable: Supports joint calling for millions of samples. User-friendly: Automatically allocates tasks, eliminating the need for users to design parallel strategies.

3)FASTQ format data compression tool

Follows the logic of most tools by independently compressing the ID, sequence, and quality values in FASTQ files. For sequence compression, in addition to entropy encoding, it provides sequence alignment modules based on HASH or BWT to improve the compression ratio for successfully aligned sequences. It also employs a data block design to balance compression ratio and performance, while the introduction of a packaging format ensures data security, independent partial data access, and backward compatibility.

Specialized Optimization

Maintains high compression and decompression speeds while ensuring a high compression ratio.

Reference genome sequences can exceed 4G.

Effective compression for both short and long reads.

Convenience and Flexibility

Supports Linux platforms, allows flexible setting of concurrent threads, and decompression has no license restrictions.

Complete Ecosystem

Seamlessly integrates with other tools in DCS Tools, enabling simultaneous decompression and computation.

Data is encapsulated in basic units, facilitating format extension and ensuring backward compatibility.

1.3 Platform Requirements

DCS Tools is designed to run as an executable program on Linux systems. Recommended Linux distributions include CentOS 7, Debian 8, Ubuntu 14.04, and later versions.

Hardware requirements for DCS Tools: •CPU: At least 8 cores. •Memory: At least 128G. •Recommended use of SSDs for faster read/write speeds.

1.4 License Verification

To verify the legality of the license during use, you need a compute node (also referred to as a "license node") to handle the verification communication. The compute node must ensure stable communication with the license server. Alternatively, this node can also serve as the verification reference (as shown in the example diagram at the end of this section). Additionally, you will need a license file, which can be obtained by contacting the platform (cloud@stomics.tech) for purchase or trial.

1.4.1 Obtaining the Compute Node's IP Address

Ensure that the compute node is not behind HTTP_PROXY or HTTPS_PROXY and can access cloud.stomics.tech.

When applying for a license file, you need to provide the external IP address of the compute node. You can obtain this IP address by running the following command:

curl --noproxy "\*" -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo  

The returned data field will contain the IP address:

{""code"":0,""msg"":""Success"",""data"":""172.19.29.215""}

If the following error occurs:

curl: (35) error:0A000152:SSL routines::unsafe legacy renegotiation disabled  

Run the following command to obtain the IP address:

echo -e "openssl\_conf = openssl\_init\n\[openssl\_init\]\nssl\_conf = ssl\_sect\n\[ssl\_sect\]\nsystem\_default = system\_default\_sect\n\[system\_default\_sect\]\nOptions = UnsafeLegacyRenegotiation" | OPENSSL\_CONF=/dev/stdin curl --noproxy "\*" -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo  

1.4.2 Obtaining the License File

Contact the platform at cloud@stomics.tech to purchase or request a trial license.

1.4.3 Deploying the License

After obtaining the license file, unzip the package and place the license file in the libexec directory.

1.4.4 Configuring the License Proxy on the Compute Node

On the compute node, run the foreground license proxy using the following command:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic  

Replace /path/to/XXXXXXXXX.lic with the actual path to your license file. The license proxy defaults to listening on 0.0.0.0:8909. If this address is already in use, you will see a warning:

"\[2025-03-27T06:34:46Z WARN  pingora\_core::listeners::l4\] 0.0.0.0:8909 is in use, will try again"  

The system will automatically retry, or you can manually stop the process with Ctrl + C and specify a different address:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic --bind 0.0.0.0:8910  

If no errors occur, the proxy will start successfully:

\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Bootstrap starting    
\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Bootstrap done    
\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Server starting  

Before proceeding, verify that the license proxy can communicate with the platform:

curl http://127.0.0.1:8909/api/licensor/licenses/anon/pool  

The expected response indicates successful communication:

{""code"":0,""msg"":""Success""}

To run the proxy in the background, use the -d flag:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic -d  

This will run the proxy as a daemon process.

Note: Only one license proxy is needed per node when running in the background. Multiple license proxies are supported for the same Linux user on the same node, but not for multiple Linux users on the same node. To stop the proxy, use kill or delete the file /tmp/pingora.pid.

1.4.5 Running DCS Tools

After completing the above steps, the compute node can communicate with the license node to run DCS Tools. For processing WGS sequencing data, you can use the quick_start script provided by the platform. Refer to the @example demonstration script for details.

Before running, set the PROXY_HOST environment variable to the IP and port of the license proxy. Ensure that the current node can ping the IP. If the compute node and license node are the same machine, use 127.0.0.1 as the IP.

To verify the connection between the compute node and license node:

$DCS\_HOME/bin/dcs lickit --query $PROXY\_HOST  

To compress a FASTQ file using DCS Tools:

export PROXY\_HOST=<ip>:<port>    
$DCS\_HOME/bin/dcs SeqArc compress -i input.fastq.gz -o output.fastq  
Figure 1.1 DCS Tools License Verification Example Diagram
Figure 1.1 DCS Tools License Verification Example Diagram

1.4.6 Steps After Upgrading

After obtaining an upgraded version, follow the instructions provided. Generally, you do not need to reconfigure the license proxy (version 1.4.4). Simply unzip the new package and update the DCS_HOME path to the new directory.

1.5 Tool List

ScenarioTool (Module) NameFunction
WGS/WES AnalysisalignerQuality control, alignment, sorting, and duplicate marking
bqsrBase quality score recalibration
variantCallerVariant detection (similar to GATK HaplotypeCaller)
genotyperGVCF->VCF
reportOutput report
Joint Calling for Large CohortsjointcallerJoint variant detection, supports millions of samples
FASTQ Data Compression/DecompressionSeqArcHigh-efficiency lossless FASTQ compression tool
License Query, etc.lickitQuery available threads, test network connection, upload error logs, etc.
Utility Toolsextract-vcfSimplify VCF files, reduce file size, alleviate I/O pressure, primarily for knownSites (e.g., dbSNP) input for bqsr and genotyper
bam-statAlignment statistics
vcf-statVariant statistics
tbi2lixConvert TBI index files to custom linear index files for joint calling

2. Typical Usage

2.1 WGS

WGS (Whole Genome Sequencing) is a technology that sequences the entire genome of an organism. The WGS analysis pipeline typically includes steps such as data quality control, sequence alignment, duplicate marking, base quality score recalibration, and variant detection. DCS Tools follows the original algorithms while optimizing performance, resulting in a 16x faster analysis pipeline compared to BWA-GATK.

2.1.1 Preparations

Input files for the WGS analysis pipeline:

  1. Files required for sequence alignment:
Index FileDescription
species.faReference genome FASTA file
species.fa.faiFor rapid location and random access to FASTA file regions
species.dictIndex for quick lookup of reference sequences in subsequent processes
species.fa.*Alignment index files

Build alignment indexes using the aligner tool. Place the index files in the same directory as the FASTA file. The command to build alignment indexes is:

dcs aligner --threads <INT> --build-index <FASTA>
  1. FASTQ sequencing data, supports gz format.
  2. Single nucleotide polymorphism (dbSNP) data and known variant sites in VCF format.

The WGS analysis pipeline provided by DCS Tools includes the following steps:

  1. Sequence alignment and duplicate marking: Quality control is performed before alignment. The sequences are aligned to the reference genome recorded in the FASTA file, and the alignment results are written to a BAM file after duplicate marking.
  2. Base quality score recalibration: Adjusts the quality scores assigned to each base in the sequence to eliminate experimental biases introduced by sequencing methods.
  3. Variant detection: Identifies variant sites in the sequencing data relative to the reference genome and calculates genotypes for each sample at these sites.

2.1.2 Sequence Alignment and Duplicate Marking

Align sequences to the reference genome to determine the position of each sequence, providing a foundation for variant detection. Duplicate marking eliminates PCR-amplified duplicate sequences, reducing false positives and improving variant detection accuracy. These steps are completed by the aligner tool. The aligner also integrates data quality control functionality, filtering and processing sequences before alignment to reduce error rates and avoid interference from noisy data in subsequent analyses.

dcs aligner --threads             THREADNUM \
            --aln-index           INDEX \
            --fq1                 FASTQ1 \
            --fq2                 FASTQ2 \
            --read-group          RGROUP \
            --bam-out             BAMFILE \
            --fastq-qc-dir        QCDIR \
            --high-speed-storage  TEMPDIR

The command requires the following inputs:

  1. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.
  2. INDEX: Path to the reference genome FASTA file. Place the alignment index files built by the alignment software in the same directory as the FASTA file.
  3. FASTQ1: Path to the FASTQ1 file for paired-end sequencing.
  4. FASTQ2: Path to the FASTQ2 file for paired-end sequencing.
  5. RGROUP: Format is @RG\tID:GROUP_NAME\tSM:SAMPLE_NAME\tPL:PLATFORM, where GROUP_NAME is the read group identifier, SAMPLE_NAME is the sample name, and PLATFORM is the sequencing platform.
  6. BAMFILE: BAM file after duplicate marking.
  7. QCDIR: Output location for the quality control report.
  8. TEMPDIR: Storage path for temporary files. Placing this path on high-speed storage devices can improve software performance.

2.1.3 Base Quality Score Recalibration

Due to systematic errors, the base quality scores of sequences are not always entirely accurate. Recalibration corrects these biases, generating more precise quality scores to improve variant detection accuracy and reduce false positives and negatives. This step aligns with GATK's best practices and is completed by the bqsr tool.

dcs bqsr --threads          THREADNUM \
         --in-bam           BAMFILE \
         --reference        FASTA \
         --known-sites      KNOWN_SITES \
         --recal-table      RECAL_TABLE

The command requires the following inputs:

  1. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.
  2. BAMFILE: Input file. The BAM file generated after sequence alignment and duplicate marking.
  3. FASTA: Path to the reference genome FASTA file. Ensure it is the same as the one used in the sequence alignment stage.
  4. KNOWN_SITES: Single nucleotide polymorphism database data and known variant sites file.
  5. RECAL_TABLE: Output file. Saves information for recalibrating base quality scores.

2.1.4 Variant Detection

Detects SNP and INDEL variants, generates a GVCF file, and produces BAM statistics. The statistics file is stored in the same directory as the GVCF file, named bamStat.txt. This step is completed by the variantCaller tool.

dcs variantCaller -I INPUT_BAM \
                  -O OUTPUT_GVCF \
                  -R REFERENCE \
                  --recal-table RECAL_TABLE
                  -t THREADUM

The command requires the following inputs:

  1. INPUT_BAM: The BAM file output from step 2.1.2 (sorted and duplicate-marked).
  2. OUTPUT_GVCF: Output GVCF file.
  3. REFERENCE: Reference genome FASTA file.
  4. RECAL_TABLE: Input file containing information for base quality score recalibration, generated in step 2.1.3.
  5. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.

For a single sample, if only a VCF file containing variant records is needed, the genotyper tool can be used.

dcs genotyper -I INPUT_GVCF \
              -O OUTPUT_VCF \
              -s QUAL \
              -t THREDNUM

The command requires the following inputs:

  1. INPUT_GVCF: GVCF file from variantCaller.
  2. OUTPUT_VCF: Output VCF file.
  3. QUAL: Variant quality score threshold. Records with quality scores greater than or equal to QUAL will be output.
  4. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.

2.1.5 Report Generation

Consolidates statistics from each sample to generate a visual HTML report.

dcs report --sample SAMPLE_NAME \
           --filter QCDIR \
           --bam_stat BAMStat \
           --vcf_stat VCFStat \
           --output OUTPUT_DIR

The command requires the following inputs:

  1. SAMPLE_NAME:Sample name.
  2. QCDIR: Quality control statistics report folder generated by the aligner.
  3. BAMStat: bamStat.txt in the variantCaller result directory.
  4. VCFStat: VCF statistics results, defaulting to vcfStat.txt in the working directory.
  5. OUTPUT_DIR: Report output directory.

2.2 WES

The steps are largely the same as WGS, with the only difference being in the variantCaller step, which requires an additional interval parameter (-L).

dcs variantCaller -I INPUT_BAM \
              -O OUTPUT_GVCF \
              -R REFERENCE \
              --recal-table RECAL_TABLE
              -L INTERVAL_FILE
              -t THREADUM

INTERVAL_FILE: BED file recording WES intervals. By convention, each line follows the format: chr<tab>start<tab>end Where start and end positions are 0-based and left-closed, right-open intervals.

2.3 Joint Calling

Joint Calling is a method for detecting population variants, specifically referring to the process of merging variants from multiple samples and recalculating sample genotypes using multiple GVCF files as input. DPGT is a distributed Spark program that efficiently performs joint calling for large-scale samples. dcs jointcaller is a wrapper for the DPGT tool, used to run DPGT in standalone mode.

# Set JAVA_HOME to jdk8  
export JAVA_HOME=/path_to/jdk8  
# Add spark bin to PATH for calling spark-submit

export PATH=/path_to/spark/bin/:$PATH
dcs jointcaller \
    -i GVCF_LIST \
    -r REFERENCE \
    -o OUTDIR \
    -j JOBS \
    --memory MEMORY \
    -l REGION

DPGT depends on Java 8. Set JAVA_HOME=/path_to/jdk8. Spark version 2.4.5 or higher is recommended. Download Spark 2.4.5 from: <https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.6.tgz>

The command requires the following inputs:

GVCF_LIST: Input GVCF list, a text file with one GVCF path per line. Note that GVCFs must have matching TBI index files. The program defaults to searching for TBI index files by appending .tbi to the GVCF path.

REFERENCE: Path to the reference genome FASTA file. Ensure this FASTA file is consistent with the one used to generate the GVCFs. The program also requires matching .dict and .fai files for the FASTA file (e.g., if the FASTA file is hg38.fasta, the program will also read hg38.dict and hg38.fasta.fai).

OUTDIR: Output folder. The resulting VCF path is OUTDIR/result.*.vcf.gz.

JOBS: Number of parallel tasks. The number of threads used is equal to JOBS, an integer ranging from 1 to the number of machine threads.

MEMORY: Maximum Java heap memory size. The unit is "g" or "m". Memory size is related to the JOBS value and sample size. For 1,000 samples, allocate 2g per job; for 2,000-5,000 samples, allocate 4g per job; for 5,000-100,000 samples, allocate 6g per job; for 100,000-500,000 samples, allocate 8g per job.

REGION: Target region. Supports input strings or BED paths. The region string format is consistent with bcftools, e.g., "chr20:1000-2000", "chr20:1000-", "chr20". To specify multiple regions, use multiple -l parameters, e.g., -l chr1 -l chr2:1000-2000.

2.4 FASTQ Compression

FASTQ compression involves replacing standard FASTQ or FASTQ.gz files with smaller compressed files to save storage space, reduce data transfer time, and improve processing efficiency.

The SeqArc tool under DCS Tools is suitable for compressing all FASTQ data, though compression ratios and performance may vary depending on two main factors: whether reference-based or reference-free mode is used, and whether the sequencing technology produces short reads (second-generation) or long reads (third-generation).

Please refer to the following feature descriptions to understand these differences.

2.4.1 Feature Description

2.4.1.1 Reference-Based/Reference-Free Compression Modes

FASTQ files record sequenced sample fragments. Taking genome sequencing as an example, each sequence in a FASTQ file comes from a small segment of the sample genome. Since genomic differences between samples of the same species are minimal, providing a reference genome of the species during compression allows most sequences in the FASTQ file to find highly similar segments in the reference genome, thereby improving the compression ratio (though this search process incurs additional computational costs).

Based on this principle, the tool offers two modes: reference-based and reference-free compression.

  • Reference-based compression uses a reference sequence (in FASTA format) corresponding to the sequenced species during FASTQ file compression and decompression. This increases compression ratio at the cost of additional computation time.
  • Reference-free compression does not use a reference sequence, resulting in faster speeds but lower compression ratios.

Q: How to choose between these two modes? A: We recommend using the reference-based mode whenever the corresponding FASTA file is available. If possible, comparing both modes is ideal.

Q: When might the FASTA file be unavailable? A: Two scenarios:

  1. Multi-species sequencing: Since reference-based compression only supports a single FASTA file as input, data from multi-species scenarios (e.g., metagenome sequencing) must use reference-free mode. An exception is when the data contains a mix of a few species (e.g., human + E. coli), where manually merging their FASTA files into a non-standard FASTA can serve as the reference.
  2. No reference sequence: Many species lack reference FASTA files on NCBI. In such cases, reference-free mode is required. Alternatively, a reference genome from a closely related species may be tried.

Q: How do the two modes differ in usage? Can the FASTA be directly used as a parameter? A:

  • Reference-free mode: Compression and decompression can proceed directly without prerequisites.
  • Reference-based mode: Requires creating an additional index file for the FASTA file beforehand for sequence retrieval. The command dcs SeqArc index -r REF -o OUTDIR generates two indices in the OUTDIR directory: one for short reads and another for long reads. For example, the human genome GRCh38.fa (3.1GB) takes ~30 minutes to index, producing a short-read index (GRCh38.fa.mini_28_10, 14GB) and a long-read index (GRCh38.fa.similar_26_23, 6.1GB). Peak memory usage is ~35GB, scaling proportionally for larger genomes.

During reference-based compression/decompression, the -r parameter specifies the path to the reference sequence (xxx.fa, not the index). The same reference must be used for compressing and decompressing the same FASTQ file.

Q: Why are there two indices? A: Two major sequencing technologies exist:

  • Second-generation short-read sequencing: Produces fragments of a few hundred bp with high accuracy and throughput.
  • Third-generation long-read sequencing: Yields reads of several KB or longer but with lower accuracy.

These technologies produce data with distinct characteristics, necessitating different logic for reference-based mode. Both indices are pre-built during indexing to avoid manual specification.

2.4.1.2 Species Identification Feature

Reference-based mode requires manual specification of the reference sequence during compression/decompression. For scenarios with large or diverse FASTQ datasets (e.g., data warehouses), this incurs significant manual effort. The species identification feature automates this by requiring only a pre-built index directory containing FASTA files of relevant species. During compression/decompression, the tool automatically matches the FASTQ to the most similar FASTA.

Note: Despite using multiple FASTA files, the feature selects only the closest match for compression/decompression.

Steps to prepare species identification:

  1. Create a multi-species directory (e.g., /path/to/fa) as the repository.
  2. Download the taxonomy file taxdump.tar.gz from NCBIopen in new window to /path/to/fa.
  3. Extract the file:
tar xzf taxdump.tar.gz
  1. Download and decompress FASTA files for relevant species to /path/to/fa:
gunzip ./*fa.gz
  1. Create the species identification library index:
dcs SeqArc index -K /pathcs SeqArc index -K /path/to/fa
  1. Build alignment indices for each FASTA (e.g., aaa.fa):
dcs SeqArc index -r /path/to/fa/aaa.fa

2.4.2 Workflow

File List for Compression:

FileDescription
*.faReference sequence file
*.fa.mini_28_10Short-read alignment index
*.fa.similar_26_23Long-read alignment index
*.dmpInput files for species identification
*.cdbSpecies identification library index
*.arcCompressed output file

Input FASTQ files can be plain text, gzip, mgz, or piped. Decompression outputs text by default (or mgzip with --rawtype if the input was mgz).

2.4.3 Creating Reference Indices

  1. Set file system handles:
ulimit -n 65535# Soft limit (temporary)
ulimit -Hn 65535 # Hard limit (requires root)
  1. For a single FASTA:
dcs SeqArc index -r REF -o OUTDIR
  1. For species identification:
# i. Create species library index (outputs clsf*.cdb in REFDIR)
dcs SeqArc index -K REFDIR
# ii. Create alignment indices for each species
dcs SeqArc index -r REF -o REFDIR

Parameters:

  • REF: Path to reference sequence.
  • REFDIR: Directory with multi-species references.
  • OUTDIR: Output directory.

2.4.4 Compression Commands

  1. Known species (reference-based):
dcs SeqArc compress -r REF -i FASTQ -o ARC -t THREADNUM
  1. Unknown species (auto-detection):
dcs SeqArc compress -K REFDIR -i FASTQ -o ARC -t THREADNUM
  1. Reference-free:
dcs SeqArc compress -i FASTQ -o ARC -t THREADNUM
  1. Directory compression:
dcs SeqArc compress -K REFDIR -a FASTQDIR -o ARC -t THREADNUM

Parameters:

  • FASTQ: Input FASTQ path.
  • ARC: Output compressed file path.
  • THREADNUM: Thread count.

2.4.5 Decompression Commands

  1. Known reference:
dcs SeqArc decompress -r REF -i ARC -o FASTQ -t THREADNUM
  1. Species library:
dcs SeqArc decompress -r REFDIR -i ARC -o FASTQ -t THREADNUM
  1. Reference-free:
dcs SeqArc decompress -i ARC -o FASTQ -t THREADNUM

2.4.6 Standalone Decompression Tool

If customers have not purchased DCS Tools but need to decompress ARC-format files, they can use the independent free decompression tool (Seqarc_decompress). The software package can be obtained from sales.

  1. If the reference sequence used for compression is known, users can directly specify it with the -r parameter:
Seqarc_decompress-r REF \
-i ARC \
-o FASTQ \
-t THREADNUM
  1. If compression was performed using a species identification library, the library path must be specified during decompression:
Seqarc_decompress-r REFDIR \
-i ARC \
-o FASTQ \
-t THREADNUM
  1. For reference-free compression, decompression is also reference-free:
Seqarc_decompress-i ARC \
-o FASTQ \
-t THREADNUM

Required Parameters:

  1. REF: Path to the reference sequence.
  2. REFDIR: Directory containing multi-species reference sequences.
  3. FASTQ: Output FASTQ file path.
  4. ARC: Input compressed file path.
  5. THREADNUM: Number of threads to use.

3. Detailed Tool Parameter Descriptions

3.1 aligner

Supported command-line parameters for aligner:

  • --threads: Number of threads. Recommended not to exceed the number of CPU cores.
  • --fq1: Input FASTQ1 file. For multiple files, supports comma separation or file lists.
  • --fq2: Input FASTQ2 file. For multiple files, supports comma separation or file lists.
  • --seqarc-ref: Index file required for decompressing SeqArc format files.
  • --read-group: Format is @RG\tID:GROUP_NAME\tSM:SAMPLE_NAME\tPL:PLATFORM, where GROUP_NAME is the read group identifier, SAMPLE_NAME is the sample name, and PLATFORM is the sequencing platform.
  • --build-index: Build index files required for alignment tools.
  • --aln-index: FASTA path required for sequence alignment. Ensure the accompanying index files are in the same directory as the FASTA file.
  • --mark-split-hits-secondary: Mark shorter split hits as secondary.
  • --use-soft-clipping-for-sup-align: Use soft clipping for supplementary alignment.
  • --no-markdup: Do not perform duplicate marking.
  • --bam-out: Output BAM file.
  • --high-speed-storage: Temporary file storage path. Placing this path on high-speed storage devices can improve performance.
  • --fastq-qc-dir: Output directory for FASTQ quality control reports.
  • --no-filter: Do not filter FASTQ data.
  • --output-clean-fq: Output clean reads.
  • --qualified-quality-threshold: Qualified base quality threshold (default: 12).
  • --low-quality-ratio: Low-quality base ratio threshold (default: 0.5).
  • --mean-qual: Mean quality threshold.
  • --quality-offset: Quality value offset. Typical values are 33 or 64. If set to 0, the value is inferred by the program (default: 33).
  • --adapter1: Adapter sequence for read1, in uppercase.
  • --adapter2: Adapter sequence for read2, in uppercase.
  • --adapter-max-mismatch: Maximum allowed base mismatches during adapter search (default: 2).
  • --adapter-match-ratio: Minimum matching length ratio for adapters (default: 0.5).
  • --trim-adapter: Trim adapters from read sequences instead of discarding reads.
  • --min-read-len: Minimum read length threshold (default: 30).
  • --n-base-ratio: N base ratio threshold(default: 0.1).
  • --trim-ends: Trim a specified number of bases from both ends of the reads, with the format being comma-separated integers.

3.2 bqsr

Supported command-line options for bqsr:

  • --threads: Number of threads. Recommended not to exceed the number of CPU cores.
  • --in-bam: Duplicate-marked BAM file.
  • --out-bam: Path to store the BAM file after base quality score recalibration. If this parameter is omitted, only the recalibration table is generated.
  • --reference: Reference genome FASTA file. Must be the same as the one used in the sequence alignment stage.
  • --recal-table: Output file containing information for base quality score recalibration.
  • --known-sites: Single nucleotide polymorphism database data and known variant sites file. Supports text and gz compressed formats.
  • --interval: BED file.
  • --interval-padding: Number of bases to pad on both sides of the BED interval.

3.3 variantCaller

Detailed parameter descriptions for variantCaller:

  • -I, --input <file1> <file2>...: Input BAM file paths. Multiple BAM paths can be accepted, separated by spaces.
  • -O, --output <file>: Output VCF/GVCF BGZF file. The program automatically creates the corresponding index file.
  • -R, --reference <file>: Reference genome FASTA file path.
  • --recal-table <file>: Input file containing information for base quality score recalibration.
  • -H, --help: Print help information.
  • --version: Print version number.
  • -L, --interval <file>: Target interval BED file path.
  • -P, --interval-padding <int>: Number of bases to extend on both sides of the interval. Must be a non-negative integer (default: 0).
  • -Q, --base-quality-score-threshold <int>: Base quality score threshold. Bases with quality scores below this threshold will be set to the minimum value of 6 (default: 18).
  • -D, --max-reads-depth <int>: Maximum number of reads retained per alignment start position. Reads exceeding this threshold will be downsampled (default: 50).
  • -G, --gvcf-gq-bands <int1> <int2>...: Reference confidence model GQ groups. Input multiple integers separated by spaces (default: [1,2,3,...,59,60,70,80,90,99]).
  • -t, --threads <int>: Number of threads used by the program, ranging from 1 to 128 (default: 30).
  • --pcr-indel-model <str>: PCR indel model. Parameter range: {NONE, HOSTILE, CONSERVATIVE, AGGRESSIVE} (default: CONSERVATIVE).
  • --emit-ref-confidence <str>: Run the reference confidence scoring model. Parameter range: {NONE, BP_RESOLUTION, GVCF}. NONE: Do not run the reference confidence scoring model; BP_RESOLUTION: Run the reference confidence scoring model, outputting a variant record for each site; GVCF: Run the reference confidence scoring model, grouping reference confidence sites (ALT=<NON_REF>) by GQ value according to the -G parameter (default: NONE).
  • --compression-level <int>: Output VCF/GVCF BGZF file compression level, ranging from 0 to 9 (default: 6).

3.4 genotyper

The genotyper processes single-sample GVCF files, outputting VCF files containing only variant records. For joint calling of multiple GVCF files, use the high-performance DPGT software. Detailed parameter descriptions for genotyper:

  • -I, --input <file>: Input GVCF file path.
  • -O, --output <file>: Output VCF BGZF file containing genotypes. The program automatically creates the corresponding index file.
  • -s, --stand-call-conf <float>: Variant quality score threshold. Variants with quality scores greater than or equal to this value will be output (default: 10.0).
  • -D, --dbsnp <file>: dbSNP file.
  • -t, --threads <int>: Number of threads used by the program (default: 6).
  • --version: Print version number.
  • -h, --help: Print help information.

3.5 jointcaller (DPGT)

jointcaller is a wrapper for the DPGT tool, used to conveniently run DPGT in standalone mode. DPGT is a distributed joint calling tool. Its detailed parameter descriptions are as follows:

  • -i, --input <FILE>: Input GVCF file list, a text file with one GVCF path per line.
  • --indices <FILE>: Optional parameter. The program defaults to searching for index files by appending .tbi to the GVCF path. This parameter can specify the index file path, requiring a text file with one index path per line, matching the order of GVCF paths.
  • --use-lix <Boolean>: Use LIX index format. LIX files are simplified index files implemented by DPGT. TBI indexes can be converted to LIX indexes using the tbi2lix program. LIX indexes are approximately 1/80 the size of TBI indexes. Options: true, false (default: false).
  • --header <FILE>: Optional parameter. Use a precomputed VCF header file to avoid repeated calculations.
  • -o, --output <DIR>: Output folder.
  • -r, --reference <FILE>: Reference genome FASTA file.
  • -l, --target-regions <STRING>: Target region string or BED file. Defaults to processing the entire genome.
  • -j, --jobs <INT>: Number of parallel tasks (default: 10).
  • -n, --num-combine-partitions <INT>: Number of partitions during the variant merging phase. Defaults to the square root of the sample count, rounded down (default: -1).
  • -w, --window <INT>: Region size processed by each merge-genotype calculation (default: 300M).
  • -d, --delete <Boolean>: Whether to delete temporary files. Options: true, false (default: true).
  • -s, --stand-call-conf <FLOAT>: Quality score threshold. Variant records with quality scores greater than or equal to this value will be output (default: 30.0).
  • --dbsnp <FILE>: dbSNP VCF file path, used for annotating dbSNP IDs.
  • --use-old-qual-calculator <Boolean>: Whether to use the old quality score calculator. When set to true, the old calculator is used, matching the results of GATK v4.1.0 with --use-old-qual-calculator true --use-new-qual-calculator false. When set to false, the new calculator is used, matching the results of GATK v4.1.1 and later versions. Options: true, false (default: true).
  • --heterozygosity <FLOAT>: Prior probability of heterozygous SNP variants (default: 0.001).
  • --indel-heterozygosity <FLOAT>: Prior probability of heterozygous INDEL variants (default: 1.25E-4).
  • --heterozygosity-stdev <FLOAT>: Standard deviation of prior probabilities for SNP and INDEL variants (default: 0.01).
  • --max-alternate-alleles <INT>: Maximum number of alternate alleles. For variants exceeding this number, the program retains only the most probable alleles up to this count (default: 6).
  • --ploidy <INT>: Ploidy (default: 2).
  • -h, --help: Print help information and exit.
  • -V, --version: Print version number and exit.

3.6 SeqArc

3.6.1 Index Creation Functionality, Subcommand index

Parameters for creating indexes:

  • -h, --help: Get help information for index creation parameters.
  • -r, --ref <file>: Set the path to the input reference sequence file (FASTA).
  • -K, --clsdb <dir>: Set the directory containing reference sequence files for multiple species. The directory must contain names.dmp and nodes.dmp, which can be downloaded from: http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gzopen in new window.
  • -H, --hitlimit <int>: Set the minimizer hit count for taxonomy (default: 2).
  • -t, --thread <int>: Set the number of threads (default: 8).
  • -o, --output <dir>: Set the output directory for storing generated index files for the reference sequence.
  • --aligntype <int>: Set the alignment method for generating index files (default: 2: 1-hash, 2-minimizer, 3-longseq).

If users know the species of the fastq file, they can use the -r parameter to directly select the corresponding reference sequence to generate the index file, and use the -o parameter to save the generated index file. This index file can then be used for compression and decompression. If users do not know the species of the fastq file, they can use the -k parameter to create a species identification library index for all reference sequences in the directory. Then, for each reference sequence, they can use the -r parameter to create the corresponding index file. In this case, the directory specified by -o must be the same as the input directory of -K.

3.6.2 Compression Function, Subcommand compress

The compression parameters are as follows:

  • -h, --help: Get help instructions for compression parameters.
  • -r, --ref <file>: Set the path of the input reference sequence. The corresponding index file must exist in this path.
  • -K, --clsdb <dir>: Set the directory containing reference sequence files for multiple species. This directory must contain the generated species identification library file and the index files for each species.
  • -H, --hitlimit <int>: Set the minimizer hit count for taxonomy, default is 2. Keep this consistent with the creation parameters.
  • -J, --clsmin <float>: Set the minimum hit rate for species identification, default is 0.6.
  • -L, --clsnum <int>: Set the number of reads to be read from the fastq file for species identification, default is 10000.
  • -t, --thread <int>: Set the number of threads, default is 8.
  • -f, --force: Force overwrite existing files with the same name.
  • -i, --input <file>: Set the input fastq file.
  • --pipein: Specify if the input is pipeline data.
  • -o, --output <file>: Set the output file path. The .arc suffix will be automatically added.
  • --blocksz <int>: Set the size of each data block during compression, default is 50.
  • -ccheck: After compression is completed and data is written to disk, perform a decompression check to ensure data integrity.
  • --calcmd5: Calculate the MD5 of the input file during compression and save it to the compressed file.
  • --aligntype <int>: Set the compression alignment method, default is 2.
  • -a, --archive <dir>: Set the directory to be compressed and archived.
  • --slevel <int>: Set the compression level for sequences, default is 10.
  • --verify <int>: Set the verification method for each data block, default is 10 (no verification), 1 (verify the entire block), or 2 (verify name, seq, and qual separately).

3.6.3 Decompression Function, Subcommand decompress

The decompression parameters are as follows:

  • -h, --help: Get help instructions for decompression parameters.
  • -r, --ref <dir/file>: Set the path of the input reference sequence. The corresponding index file must exist in this path. Alternatively, specify the species identification library directory, which must contain the index files for each species.
  • -t, --thread <int>: Set the number of threads, default is 8.
  • -f, --force: Force overwrite existing files with the same name.
  • -i, --input <file>: Set the path of the input compressed file.
  • -o, --output <file>: Set the path of the output decompressed file.
  • -rawtype: Ensure the decompressed output file format matches the original compressed file format. This only applies to mgz format files.
  • -dcheck: After decompression is completed and data is written to disk, read the decompressed data blocks and verify them to ensure integrity.
  • -x, --extract: Extract one or more files from the compressed archive. The input file must be generated by -a or --archive.
  • -l, --list: Display the file information in the compressed archive. The input file must be generated by -a or --archive.
  • --show: Display the text MD5 value of the compressed file. This requires --calcmd5 to be specified during compression.

3.7 license-proxy

Executable file parameters:

  • -h, --help: Print help information.
  • --license <LICENSE_PATH>: Path to the license file.
  • --bind <BIND_ADDR>: Listening address.
  • -d, --daemon: Start in the background.

3.8 report

Organize, visualize, and generate an HTML report for WGS/WES analysis results. Parameters:

  • --sample: Sample name.
  • --filter: Path to the filter directory.
  • --bam_stat: Path to the BAM statistics file.
  • --vcf_stat: Path to the VCF statistics file.
  • --output: Directory for the output report.
  • --is_se: (Optional) Add this flag if single-end sequencing is used.

3.9 lickit

Used to check network connectivity, query the number of available threads for the user, and upload issue logs.

  • --ping <ip:port>: Check network connectivity.
  • --query <ip:port>: Query the number of available threads for the user.
  • --uplog <ip:port logpath [msg]>: Actively upload issue logs.

3.10 Utility tools

3.10.1 extract-vcf

A standard VCF file contains 8 columns. Columns 6, 7, and 8 occupy significant space but are not required by the program. To reduce IO pressure, this tool extracts only the first 5 columns from the VCF file and replaces the last 3 columns with *. Main parameters:

  • --known-sites <file>: Input VCF file, supporting both text and gz compressed formats.
  • --output-mini-vcf <file>: Output VCF file. If the output filename includes gz, it will be in compressed format.

3.10.2 bam-stat

Generate alignment statistics based on the BAM file. In the WGS pipeline, this function is integrated into variantCaller but can also be run separately. Parameters:

  • --bam <file>: Path to the input BAM file.
  • --bed <file>: Target regions. For WGS data, specify non-N regions (if unavailable, use the BAM statistics results from variantCaller). For WES data, specify the target region file.
  • --output <file>: Path to the output file.

3.10.3 vcf-stat

Generate variant statistics based on the genotyped VCF file, including SNP counts. Parameters:

  • --vcf <file>: Path to the input VCF file, supporting both text and gz formats. Results are output to standard output by default and can be redirected to a specified file (refer to Step 5 in Section 4.1).

3.10.4 tbi2lix

  • -i, --input FILE: Input TBI index file.
  • -o, --output FILE: Output LIX index file.
  • -m, --min-shift INT: Set the minimum interval size for the LIX linear index to 2^INT (default is 14, corresponding to a 16K interval size).
  • -h, --help: Print help information and exit.

4. Quick Start—Using Example Scripts and Data

After deploying the software as described in Section 1.4, download the quick start data package (cloud platform download link), unzip it, and modify the DCS_HOME and PROXY_HOST values in the example script. DCS_HOME is the path where the software package is unzipped, and PROXY_HOST is the IP and port (default 8909) specified during deployment on the networked node. Execute the corresponding example script for different application scenarios.

tar xzvf dcs-quick-start-xxxx.tar.gz
cd dcs-quick-start

4.1 WGS

Reference script: quick_start_wgs.sh

#!/bin/bash

# This script provides a quick start guide for using the dcs tool to analyze whole-genome sequencing data.
# The script demonstrates how to build the index of the reference genome, align the reads, 
# recalibrate the base quality scores, call variants, and filter the variants.
# if you have any questions, please contact us at cloud@stomics.tech

# set the location of the dcs tool
export DCS_HOME=/path/to/dcs

if [ ! -d "$DCS_HOME" ]; then
    echo "--- Error: DCS_HOME directory does not exist: $DCS_HOME"
    exit 1
fi
if [ ! -r "$DCS_HOME" ]; then
    echo "--- Error: No read permission for DCS_HOME: $DCS_HOME"
    exit 1
fi

# Before using the dcs tool, you must obtain a license file and start the authentication proxy program 
# on a node that can connect to the internet.

# Get the local IP address, which is used for license application. 
# The output data field of the following command indicates the IP address.
# curl -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo

# start the proxy program
# LICENSE=<path/to/license/file>
# $DCS_HOME/bin/dcs licproxy --license $LICENSE

# Set the following environment variables
export PROXY_HOST=127.0.0.1:8909

# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST

# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST

# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")

# Set the following variables for the dcs tool
FQ1=$SCRIPT_DIR/data/fastq/simu.r1.fq.gz
FQ2=$SCRIPT_DIR/data/fastq/simu.r2.fq.gz
FA=$SCRIPT_DIR/data/ref/chr20.fa
KN1=$SCRIPT_DIR/data/known_sites/1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz
KN2=$SCRIPT_DIR/data/known_sites/dbsnp_151.hg38.chr20.vcf.gz
KN3=$SCRIPT_DIR/data/known_sites/hapmap_3.3.hg38.chr20.vcf.gz
KN4=$SCRIPT_DIR/data/known_sites/Mills_and_1000G_gold_standard.indels.hg38.chr20.vcf.gz

nthreads=8 
sample=simu
group="@RG\\tID:sample\\tLB:sample\\tSM:sample\\tPL:COMPLETE\\tCN:DCS"

# run the wgs pipeline ------------------------------------------------------------------------------------

# check the existence of the index files
for file in $FA.0123 $FA.amb $FA.ann $FA.bwt $FA.index1 $FA.index2 $FA.pac $FA.sa;
do
    if [ ! -f $file ]; then
        echo "File $file does not exist, try to build the index ..."
        $DCS_HOME/bin/dcs aligner --threads 16 --build-index $FA
        break;
    fi
done

# Step 0: Set the working directory
workDir=$SCRIPT_DIR/result_wgs
mkdir -p $workDir
mkdir -p $workDir/${sample}.qcdir 

# Step 1: Alignment
$DCS_HOME/bin/dcs aligner --threads $nthreads --aln-index $FA --fq1 $FQ1 --fq2 $FQ2 --read-group $group --no-filter --fastq-qc-dir $workDir/${sample}.qcdir --bam-out $workDir/${sample}.align.bam

# Step 2: base quality score recalibration
$DCS_HOME/bin/dcs bqsr --threads $nthreads --in-bam $workDir/${sample}.align.bam --reference $FA --known-sites $KN1 --known-sites $KN2 --known-sites $KN3 --known-sites $KN4 --recal-table $workDir/${sample}.recal.table

# Step 3: variant calling
$DCS_HOME/bin/dcs variantCaller --threads $nthreads --input $workDir/${sample}.align.bam --output $workDir/${sample}.variantCaller.g.vcf.gz --reference $FA --recal-table  $workDir/${sample}.recal.table

# Step 4: genotype calling
$DCS_HOME/bin/dcs genotyper --threads $nthreads --input $workDir/${sample}.variantCaller.g.vcf.gz --output $workDir/${sample}.genotyper.vcf.gz --dbsnp $KN2 -s 30.0

# Step 5: generate the final report
$DCS_HOME/bin/dcs vcf-stat --vcf $workDir/${sample}.genotyper.vcf.gz >$workDir/vcfStat.txt

$DCS_HOME/bin/dcs report --sample ${sample} --filter $workDir/${sample}.qcdir --bam_stat $workDir/bamStat.txt --vcf_stat $workDir/vcfStat.txt --output $workDir/report

After modifying the environment variables in the file, simply run it directly.

# Modify the values of DCS_HOME and PROXY_HOST.
vi quick_start_wgs.sh
sh quick_start_wgs.sh
Figure 4.1 WGS Analysis Pipeline
Figure 4.1 WGS Analysis Pipeline

4.2 WES

The steps are largely the same as WGS, with the only difference being the variantCaller step, which requires an additional interval parameter (-L) specifying the WES bed file.

# Step 3: variant calling
$DCS_HOME/bin/dcs variantCaller --threads $nthreads --input $workDir/${sample}.align.bam --output $workDir/${sample}.variantCaller.g.vcf.gz --reference $FA --recal-table  $workDir/${sample}.recal.table -L <wes.target.bed>

4.3 Joint Calling

The reference script is quick_start_jointcalling.sh, similar to WGS. Modify the DCS_HOME and PROXY_HOST environment variables. Additionally, dcs jointcaller requires JAVA_HOME to be set to JDK8 and the spark/bin directory (Spark version 2.4.5 or higher) to be added to the PATH environment variable.

# This script provides a quick start guide for using the dcs jointcaller to perform joint calling on multiple gvcf files
# if you have any questions, please contact us at cloud@stomics.tech


# set the location of the dcs tool
export DCS_HOME=/path/to/dcs


# dcs jointcaller is a wrapper of DPGT, which is a spark java application
# set JAVA_HOME and add spark/bin to PATH
export JAVA_HOME=/path/to/jdk8
export PATH=/path/to/spark/bin:$PATH


# Before using the dcs tool, you must obtain a license file and start the authentication proxy program 
# on a node that can connect to the internet.


# Get the local IP address, which is used for license application. 
# The output data field of the following command indicates the IP address.
# curl -X POST https://uat-cloud.stomics.tech/api/licensor/licenses/anon/echo?cliIp=true


# start the proxy program
# LICENSE=<path/to/license/file>
# DCS_HOST=0.0.0.0:8909
# $DCS_HOME/bin/licproxy --command start --license $LICENSE --bind $DCS_HOST --log-level INFO


# Set the following environment variables to your desired values
export PROXY_HOST=127.0.0.1:8909


# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST


# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST


# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")


# Set the following variables for dcs jointcaller
# gvcf directory
GVCF_DIR=$SCRIPT_DIR/data/gvcf
# target region
REGION=chr20:1000000-2000000
# reference fasta file
FA=$SCRIPT_DIR/data/ref/chr20.fa


# number of threads
nthreads=4
# java heap max memory
memory_size=8g


# Step 0: Set the working directory
workDir=$SCRIPT_DIR/result_jointcalling
mkdir -p $workDir


# Step 1: Set input file and output dir
# dcs jointcaller input file
GVCF_FOFN=$workDir/gvcf.fofn
ls ${GVCF_DIR}/*gz > ${GVCF_FOFN}
# dcs jointcaller output directory
OUT_DIR=$workDir/result


# Step 2: Run dcs jointcaller
$DCS_HOME/bin/dcs jointcaller \
    --input ${GVCF_FOFN} \
    --output ${OUT_DIR} \
    --reference ${FA} \
    --jobs ${nthreads} \
    --memory ${memory_size} \
    -l ${REGION}

4.4 SeqArc

The reference script is quick_start_seqarc.sh, similar to WGS. Modify the DCS_HOME and PROXY_HOST environment variables.

#!/bin/bash


# set the location of the dcs tool
export DCS_HOME=/path/to/dcs


# Set the following environment variables to your desired values
export PROXY_HOST=127.0.0.1:8909


# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST


# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST




# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")


nthreads=2
FQ1=$SCRIPT_DIR/data/fastq/simu.r1.fq.gz
FA=$SCRIPT_DIR/data/ref/chr20.fa
ARCFA=$SCRIPT_DIR/data/arcref/
mkdir -p $ARCFA




workDir=$SCRIPT_DIR/result_seqarc
mkdir -p $workDir


# Step 1: create index
$DCS_HOME/bin/dcs SeqArc index -r $FA -o $ARCFA
 
# Step 2: compress 
$DCS_HOME/bin/dcs SeqArc compress -t $nthreads -r $ARCFA/chr20.fa -i $FQ1 -o $workDir/test 


# step 3 : decompress
$DCS_HOME/bin/dcs SeqArc decompress -t $nthreads -r $ARCFA/chr20.fa -i $workDir/test.arc -o $workDir/simu.r1.fq

5. Release Notes

v1.3.0

ModuleTypeDescription
WGS/WESBug Fixaligner: 1. Fixed the issue where overlong fasta lines cause errors in fai file generation. 2. Fixed the destructor bug on ARM Euler OS
variantCaller: 1. Fixed segmentation faults in certain scenarios. 2. Fixed memory issues for specific species
SeqArcFeature DevImplemented a Windows decompression tool
SeqArcFeature DevSupported direct decompression of arc files to gz format, with pipeline mode support
SeqArcFeature DevSupported compression and decompression of U-bases
Somatic Mutation DetectionFeature DevAdded this new module

v1.2.0

Added support for ARM instruction architecture and introduced sample size and data volume (compressed) statistics to align with the license model based on sample size and data volume.

ModuleTypeDescription
WGS/WESBug Fixbbqsr: Fixed a bug when chromosome length exceeds 2G.variantCaller: Fixed a bug with the -L parameter in certain scenarios and resolved gVCF index file issues.
Feature Developmentaligner: Added shared memory functionality (enable with --useSharedMemIndex parameter).variantCaller: 1. Added 15x coverage statistics. 2. Made compatible with overlapping BED files.report: Updated report version and included 15x coverage statistics.
SeqArcFeature DevelopmentImplemented a standalone decompression tool.
Feature DevelopmentAdded gene library statistics functionality.
JointCallingBug FixFixed a partition end bug.

v1.1.0

ModuleTypeDescription
WGS/WESBug FixResolved memory leak issue in aligner
Bug FixFixed aligner hanging when building indexes for certain ref.fasta files
SeqArcFeature DevelopmentImplemented cross-platform instruction sets using simde
Feature DevelopmentAdded standalone decompression tool
Feature DevelopmentImplemented statistical functionality for gene libraries

v1.0.1

改动模块:aligner, report

ModuleTypeDescription
WGS/WESBug FixFixed a coredump issue in the aligner under certain scenarios
Bug FixFixed an issue where the aligner output file was a pure filename without a path
Bug FixRenamed the bam index file generated by the aligner to *.bam.bai instead of *.bai
Bug FixFixed a potential issue in the cumulative depth frequency plot in the report module
Bug FixAdded set -e to the WGS script in the quick_start package to exit immediately on error

v1.0.0

ModuleTypeDescription
WGS/WESFeatureAdded alignment and variant-related statistics functionality.
SeqArcFeatureRestructured the project framework according to the AVS-G standard.
FeatureAdded compression functionality for long-read data.
FeatureAdjusted parameter parsing, added subcommands, and long/short parameters.

6. Notes and Common Issues

6.1 WGS/WES

  1. During alignment, ensure the reference sequence index file is prepared in advance to avoid errors.
  2. The BQSR step outputs only the recal.table by default. To generate a base quality-recalibrated BAM file, set the --out-bam parameter.
  3. For WES alignment statistics, there are two methods: –Specify the interval parameter in variantCaller as the WES bed file. –Run the bam-stat tool separately and set the --bed parameter. If no bed file is provided, statistics will default to the whole genome, leading to inaccurate coverage and depth results.

6.2 Joint Calling

  1. Input GVCF files must include TBI index files.
  2. The reference genome FASTA file requires FAI and DICT files.
  3. For large sample sizes, allocate sufficient memory to avoid overflow (refer to the runtime examples in Section 7).

6.3 SeqArc

  1. For compression, input FASTQ filenames must have the following suffixes: .fq, .fastq, .fq.gz, or .fastq.gz. Otherwise, the file type will be considered invalid.
  2. Input reference sequence files must be text files with the suffixes .fa or .fasta. Otherwise, the file type will be considered invalid.
  3. For decompression, input ARC files must be generated by this program. Otherwise, the file type will be considered invalid.
  4. The reference sequence file used for decompression must match the one used for compression. If they differ, an error will occur: [ref load md5 error].
  5. If the FASTQ file is incomplete, an error will occur: [read fastq error].
  6. If the seq and qual lengths are unequal, an error will occur: [seqlen is not equal quallen].
  7. For compression or decompression, file read/write errors will trigger: [read file err] or [write file err].
  8. During decompression, if a data block fails verification, an error will occur: [block check fail]. This may indicate compression anomalies or incorrect data storage.
  9. Set ulimit -n to avoid runtime crashes, which may generate core dumps.

6.4 License Proxy

Tool Issue: DCS Tools or lickit cannot connect to licproxy

If you encounter error messages like:

  • connection refused
  • couldn't connect to server

Please follow these troubleshooting steps:

  1. On the machine running DCS Tools or lickit, check connectivity to the License proxy using:
curl -v http://$PROXY_HOST/api/licensor/licenses/anon/pool

Replace $PROXY_HOST with your licproxy's internal IP address and port (e.g., 172.20.9.149:8909), NOT the public IP address you used when obtaining the License file!

If DCS Tools/lickit and licproxy are on the same machine, you may use 127.0.0.1:8909 for $PROXY_HOST

The expected return should include Success. If it prompts that the connection cannot be established, please proceed to check whether the License proxy process exists by following the steps below.

  1. Check whether the process exists on the machine where licproxy is deployed
ps aux | grep licproxy

If the process does not exist, please follow the steps in Section 1.4 Software Deployment to proceed. If the process exists, contact your IT operations team to ensure that the machines running DCS Tools or lickit can access the machine where licproxy is located as described in Step 1. If the network connection is unavailable, request your IT operations team to enable it.

  1. Check whether the script or command line for submitting tasks correctly sets the variable PROXY_HOST

Please verify whether the script or command line you are using overrides the environment variable PROXY_HOST. For example, if licproxy is deployed at 172.20.9.149:8909, ensure you set the following before running the script or command line:

export PROXY_HOST=172.20.9.149:8909
./path/to/your
# or
PROXY_HOST=172.20.9.149:8909 /path/to/your

Common License Authentication Error Messages :

  • Permission denied (No permission or authentication failed)
  • Size must be greater than 0 (The number of submitted threads must be greater than zero)
  • Pool size exceeded (The number of submitted threads exceeds the current thread limit)
  • License expired (The license has expired)
  • Unknown error (Unknown error)

7. Performance and Runtime

7.1 WGS

在On an Alibaba Cloud ECS i4g.8xlarge instance (32 cores, 128GB RAM) with SSD storage, the DCS tools completed the FASTQ-to-VCF process in the following times for different datasets:

DataRuntime(h)ThreadsMax Memory (GB)
HG001 30X NovaSeq1.8232101
HG001 30X DNBSEQ-G4001.9132100
HG005 40X DNBSEQ-G4002.4532120

7.2 SeqArc

On an Alibaba Cloud ECS c6a.xlarge instance (4 cores, 8GB RAM), compression and decompression tests were performed on NA12878 (30X WGS) data.

Compression Results:

Input DataRuntimeCompression Ratio (vs. gz)
r1.fq.gz(27G)27m 24s5.17
r2.fq.gz (30G)29m 14s4.10

Decompression Results:

Input DataRuntime
r1.fq.arc14m 32s
r2.fq.arc14m 25s

7.3 Joint Calling

Performance tests on the 1KGP dataset and internal large-scale population datasets showed the following results for dcsjointcaller (DPGT):

DataSamplesProcesses (6GB RAM each)Core HoursRuntime
1KGP2504256696327.2 hours
Internal 10K91652562137683.5 hours
Internal 52K520004400407760~6 days
Internal 105K10500036751026155~19 days

8. FAQ

1. What are the platform requirements for DCS Tools? DCS Tools is designed as an executable program running on Linux systems.

The recommended Linux distributions are: CentOS 7, Debian 8, Ubuntu 14.04, and later versions.

2. What are the hardware requirements for the system? The CPU should have at least 8 cores, and the memory should be at least 128GB. It is also recommended to use SSDs with faster read/write speeds.

3. Is an internet connection required during the software deployment process? DCS Tools performs license validation during runtime to ensure legality. Therefore, one machine with internet access (to communicate with the validation server) is required. Other machines (typically compute nodes) do not need internet access but must be able to communicate with the internet-connected machine. Of course, a single internet-connected machine can also deploy the software.

4. What files are needed for software deployment? The software package + license file can be obtained through the cloud platform (cloud@stomics.tech) or a sales manager.

We offer a 30-day free trial. Please contact a sales manager for details.

5. After successfully deploying the license proxy, why do license-related errors still occur when running DCS Tools? After deploying the license proxy, you can use the commands provided in the documentation to confirm whether the connection to the cloud platform is successful. Additionally, each license has a corresponding upper limit on the number of threads. If this limit is reached, new tasks will fail to start. You will need to wait for running tasks to complete or expand the thread limit.

1. Does joint calling output a merged gvcf or a vcf? Since the traditional method of generating a multi-sample gvcf file in one go requires significant computational resources and storage space, we have not adopted this strategy to obtain gvcf results for all samples. We recommend that users retain the gvcf files for individual samples. If joint calling is required for specific samples later, it can be run directly based on the single-sample gvcf files.

2. Does WGS support the analysis of non-diploid species? The variant detection workflow of DCS Tools is built based on a diploid detection model. Therefore, WGS currently cannot perform effective variant detection for polyploid samples.

3. Why does the runtime for WGS data significantly exceed the time stated in the documentation? In addition to data volume, the runtime of the WGS workflow is also constrained by the read/write speed of the storage device. In our testing environment, we used high-performance SSDs to minimize the impact of this factor on workflow runtime.

4. What platform's data was used for the compression tests? The compression method used by DCS Tools has already compressed tens of petabytes of data in practice. During the large-scale testing conducted in the development phase, the most representative dataset was the AVS-G reference test set. This dataset was extracted from the NCBI-SRA database and covers 13 of the most representative species, including 39 sets of FASTQ data. It comprehensively covers various scenarios on NCBI in terms of sample attributes, experimental methods, sequencing technologies, sequencing generations, and sequencing platforms.

As a reference, the compression ratio of DCS Tools on this test set was 8.11, while gzip achieved a compression ratio of 4.15. Given that different attributes such as species and sequencing platforms can affect the compression ratio, users are advised to make decisions based on actual test results of DCS Tools and similar tools on their own data.

Detailed information about the AVS-G reference test set can be obtained by contacting a sales manager or reaching out to us through the official account.

5. How does DCS Tools compare to the commonly used pigz when compressing 100GB of FASTQ source files? In a 32-thread environment, tests were conducted on 100GB of human WGS paired-end data. The results are as follows: | Data | pigz Compression Ratio | pigz Compression Speed | pigz CPU Usage | DCS Tools Compression Ratio | DCS Tools Compression Speed | DCS Tools CPU Usage | | ---------------------- | ------- | -------- | ----------- | ------------ | ------------- | ---------------- | | BGI-SEQ500(ERR2755197) | 3.93 | 450MB/s | 3224% | 5.66 | 937MB/s | 3177% | | NovaSeq(SRR10965088) | 4.9 | 220MB/s | 3175% | 21.15 | 794MB/s | 3147% |

6. Will the compression ratio vary for different FASTQ files from the same platform? What are the main reasons? FASTQ files primarily consist of sequencing sequences and quality scores, and the differences arise from these two components.

First, for sequencing sequences, DCS Tools recommends using a mode that includes a reference sequence during compression and decompression, which can effectively improve the compression ratio. The reference sequence is the genome sequence of the species. However, this mode is less effective for transcriptome sequencing and difficult to use for metagenomic sequencing. Even for genomic sequencing, the accuracy of sequencing varies across platforms/models. Generally, the more accurate the sequencing, the better the compression ratio.

Next, for quality scores, the types of quality values differ across models. Quality values are generally divided into two types: 40 and 4. Clearly, fewer types result in a better compression ratio, which is the main source of compression ratio differences. Additionally, quality scores themselves are highly random, which is strongly related to the sequencer. The more random the data, the worse the compression ratio.

Last update: