DCS Tools

DCS CloudAbout 27 min

1. Introduction

1.1 Description

DCS Tools is an advanced software leveraging CPU acceleration technology, designed to serve as an ideal alternative to BWA-MEM, GATK, and large-scale population joint calling in data analysis. It reduces storage costs by compressing Fastq.gz file sizes. Its functionalities include germline mutation analysis for whole-genome sequencing (WGS) and whole-exome sequencing (WES), joint calling for population variant detection, and a compression/decompression tool for Fastq files based on reference genome sequences.

1.2 Advantages and Value

1)Consistent and reliable germline variant detection tool

This tool ensures the core algorithmic logic of modules like SOAPnuke, BWA, and GATK remains unchanged while incorporating excellent parallelization design and restructuring, accelerating computation and reducing unnecessary I/O operations. It not only significantly speeds up processing but also ensures high consistency with the GATK system.

Tool highlights:

  • Uses the same mathematical methods as the BWA-GATK best practices workflow from the Broad Institute but is 16x faster from FASTQ to VCF and 42x faster from BAM to VCF (measured in core·hours).

  • No runtime variability, no downsampling in high-coverage regions.

  • Pure software solution running on CPU-based systems, cross-platform compatible with X86 and cloud environments.

  • Tested on Alibaba Cloud ECS i4g.8xlarge.

    • 30x genome, NA12878 (HG001), completed in 1.82 hours (58 core·hours).

2)Joint Calling: A tool for merging gVCFs and joint calling in population sequencing projects

Tool highlights

Fast: Completes joint variant detection for 52,000 whole genomes in 6 days. Accurate: GIAB standard sample SNP F1 score 99.44%, INDEL F1 score 98.95%. Scalable: Supports joint calling for millions of samples. User-friendly: Automatically allocates tasks, eliminating the need for users to design parallel strategies.

3)FASTQ format data compression tool

Follows the logic of most tools by independently compressing the ID, sequence, and quality values in FASTQ files. For sequence compression, in addition to entropy encoding, it provides sequence alignment modules based on HASH or BWT to improve the compression ratio for successfully aligned sequences. It also employs a data block design to balance compression ratio and performance, while the introduction of a packaging format ensures data security, independent partial data access, and backward compatibility.

Specialized Optimization

Maintains high compression and decompression speeds while ensuring a high compression ratio.

Reference genome sequences can exceed 4G.

Effective compression for both short and long reads.

Convenience and Flexibility

Supports Linux platforms, allows flexible setting of concurrent threads, and decompression has no license restrictions.

Complete Ecosystem

Seamlessly integrates with other tools in DCS Tools, enabling simultaneous decompression and computation.

Data is encapsulated in basic units, facilitating format extension and ensuring backward compatibility.

1.3 Platform Requirements

DCS Tools is designed to run as an executable program on Linux systems. Recommended Linux distributions include CentOS 7, Debian 8, Ubuntu 14.04, and later versions.

Hardware requirements for DCS Tools: •CPU: At least 8 cores. •Memory: At least 128G. •Recommended use of SSDs for faster read/write speeds.

1.4 License Verification

To verify the legality of the license during use, you need a compute node (also referred to as a "license node") to handle the verification communication. The compute node must ensure stable communication with the license server. Alternatively, this node can also serve as the verification reference (as shown in the example diagram at the end of this section). Additionally, you will need a license file, which can be obtained by contacting the platform (cloud@stomics.tech) for purchase or trial.

1.4.1 Obtaining the Compute Node's IP Address

Ensure that the compute node is not behind HTTP_PROXY or HTTPS_PROXY and can access cloud.stomics.tech.

When applying for a license file, you need to provide the external IP address of the compute node. You can obtain this IP address by running the following command:

curl --noproxy "\*" -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo  

The returned data field will contain the IP address:

{""code"":0,""msg"":""Success"",""data"":""172.19.29.215""}

If the following error occurs:

curl: (35) error:0A000152:SSL routines::unsafe legacy renegotiation disabled  

Run the following command to obtain the IP address:

echo -e "openssl\_conf = openssl\_init\n\[openssl\_init\]\nssl\_conf = ssl\_sect\n\[ssl\_sect\]\nsystem\_default = system\_default\_sect\n\[system\_default\_sect\]\nOptions = UnsafeLegacyRenegotiation" | OPENSSL\_CONF=/dev/stdin curl --noproxy "\*" -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo  

1.4.2 Obtaining the License File

Contact the platform at cloud@stomics.tech to purchase or request a trial license.

1.4.3 Deploying the License

After obtaining the license file, unzip the package and place the license file in the libexec directory.

1.4.4 Configuring the License Proxy on the Compute Node

On the compute node, run the foreground license proxy using the following command:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic  

Replace /path/to/XXXXXXXXX.lic with the actual path to your license file. The license proxy defaults to listening on 0.0.0.0:8909. If this address is already in use, you will see a warning:

"\[2025-03-27T06:34:46Z WARN  pingora\_core::listeners::l4\] 0.0.0.0:8909 is in use, will try again"  

The system will automatically retry, or you can manually stop the process with Ctrl + C and specify a different address:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic --bind 0.0.0.0:8910  

If no errors occur, the proxy will start successfully:

\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Bootstrap starting    
\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Bootstrap done    
\[2025-03-27T07:15:52Z INFO  pingora\_core::server\] Server starting  

Before proceeding, verify that the license proxy can communicate with the platform:

curl http://127.0.0.1:8909/api/licensor/licenses/anon/pool  

The expected response indicates successful communication:

{""code"":0,""msg"":""Success""}

To run the proxy in the background, use the -d flag:

./libexec/licproxy --license /path/to/XXXXXXXXX.lic -d  

This will run the proxy as a daemon process.

Note: Only one license proxy is needed per node when running in the background. Multiple license proxies are supported for the same Linux user on the same node, but not for multiple Linux users on the same node. To stop the proxy, use kill or delete the file /tmp/pingora.pid.

1.4.5 Running DCS Tools

After completing the above steps, the compute node can communicate with the license node to run DCS Tools. For processing WGS sequencing data, you can use the quick_start script provided by the platform. Refer to the @example demonstration script for details.

Before running, set the PROXY_HOST environment variable to the IP and port of the license proxy. Ensure that the current node can ping the IP. If the compute node and license node are the same machine, use 127.0.0.1 as the IP.

To verify the connection between the compute node and license node:

$DCS\_HOME/bin/dcs lickit --query $PROXY\_HOST  

To compress a FASTQ file using DCS Tools:

export PROXY\_HOST=<ip>:<port>    
$DCS\_HOME/bin/dcs SeqArc compress -i input.fastq.gz -o output.fastq  
Figure 1.1 DCS Tools License Verification Example Diagram
Figure 1.1 DCS Tools License Verification Example Diagram

1.4.6 Steps After Upgrading

After obtaining an upgraded version, follow the instructions provided. Generally, you do not need to reconfigure the license proxy (version 1.4.4). Simply unzip the new package and update the DCS_HOME path to the new directory.

1.5 Tool List

ScenarioTool (Module) NameFunction
WGS/WES AnalysisalignerQuality control, alignment, sorting, and duplicate marking
bqsrBase quality score recalibration
variantCallerVariant detection (similar to GATK HaplotypeCaller)
genotyperGVCF->VCF
reportOutput report
Joint Calling for Large CohortsjointcallerJoint variant detection, supports millions of samples
FASTQ Data Compression/DecompressionSeqArcHigh-efficiency lossless FASTQ compression tool
License Query, etc.lickitQuery available threads, test network connection, upload error logs, etc.
Utility Toolsextract-vcfSimplify VCF files, reduce file size, alleviate I/O pressure, primarily for knownSites (e.g., dbSNP) input for bqsr and genotyper
bam-statAlignment statistics
vcf-statVariant statistics
tbi2lixConvert TBI index files to custom linear index files for joint calling

2. Typical Usage

2.1 WGS

WGS (Whole Genome Sequencing) is a technology that sequences the entire genome of an organism. The WGS analysis pipeline typically includes steps such as data quality control, sequence alignment, duplicate marking, base quality score recalibration, and variant detection. DCS Tools follows the original algorithms while optimizing performance, resulting in a 16x faster analysis pipeline compared to BWA-GATK.

2.1.1 Preparations

Input files for the WGS analysis pipeline:

  1. Files required for sequence alignment:
Index FileDescription
species.faReference genome FASTA file
species.fa.faiFor rapid location and random access to FASTA file regions
species.dictIndex for quick lookup of reference sequences in subsequent processes
species.fa.*Alignment index files

Build alignment indexes using the aligner tool. Place the index files in the same directory as the FASTA file. The command to build alignment indexes is:

dcs aligner --threads <INT> --build-index <FASTA>
  1. FASTQ sequencing data, supports gz format.
  2. Single nucleotide polymorphism (dbSNP) data and known variant sites in VCF format.

The WGS analysis pipeline provided by DCS Tools includes the following steps:

  1. Sequence alignment and duplicate marking: Quality control is performed before alignment. The sequences are aligned to the reference genome recorded in the FASTA file, and the alignment results are written to a BAM file after duplicate marking.
  2. Base quality score recalibration: Adjusts the quality scores assigned to each base in the sequence to eliminate experimental biases introduced by sequencing methods.
  3. Variant detection: Identifies variant sites in the sequencing data relative to the reference genome and calculates genotypes for each sample at these sites.

2.1.2 Sequence Alignment and Duplicate Marking

Align sequences to the reference genome to determine the position of each sequence, providing a foundation for variant detection. Duplicate marking eliminates PCR-amplified duplicate sequences, reducing false positives and improving variant detection accuracy. These steps are completed by the aligner tool. The aligner also integrates data quality control functionality, filtering and processing sequences before alignment to reduce error rates and avoid interference from noisy data in subsequent analyses.

dcs aligner --threads             THREADNUM \
            --aln-index           INDEX \
            --fq1                 FASTQ1 \
            --fq2                 FASTQ2 \
            --read-group          RGROUP \
            --bam-out             BAMFILE \
            --fastq-qc-dir        QCDIR \
            --high-speed-storage  TEMPDIR

The command requires the following inputs:

  1. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.
  2. INDEX: Path to the reference genome FASTA file. Place the alignment index files built by the alignment software in the same directory as the FASTA file.
  3. FASTQ1: Path to the FASTQ1 file for paired-end sequencing.
  4. FASTQ2: Path to the FASTQ2 file for paired-end sequencing.
  5. RGROUP: Format is @RG\tID:GROUP_NAME\tSM:SAMPLE_NAME\tPL:PLATFORM, where GROUP_NAME is the read group identifier, SAMPLE_NAME is the sample name, and PLATFORM is the sequencing platform.
  6. BAMFILE: BAM file after duplicate marking.
  7. QCDIR: Output location for the quality control report.
  8. TEMPDIR: Storage path for temporary files. Placing this path on high-speed storage devices can improve software performance.

2.1.3 Base Quality Score Recalibration

Due to systematic errors, the base quality scores of sequences are not always entirely accurate. Recalibration corrects these biases, generating more precise quality scores to improve variant detection accuracy and reduce false positives and negatives. This step aligns with GATK's best practices and is completed by the bqsr tool.

dcs bqsr --threads          THREADNUM \
         --in-bam           BAMFILE \
         --reference        FASTA \
         --known-sites      KNOWN_SITES \
         --recal-table      RECAL_TABLE

The command requires the following inputs:

  1. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.
  2. BAMFILE: Input file. The BAM file generated after sequence alignment and duplicate marking.
  3. FASTA: Path to the reference genome FASTA file. Ensure it is the same as the one used in the sequence alignment stage.
  4. KNOWN_SITES: Single nucleotide polymorphism database data and known variant sites file.
  5. RECAL_TABLE: Output file. Saves information for recalibrating base quality scores.

2.1.4 Variant Detection

Detects SNP and INDEL variants, generates a GVCF file, and produces BAM statistics. The statistics file is stored in the same directory as the GVCF file, named bamStat.txt. This step is completed by the variantCaller tool.

dcs variantCaller -I INPUT_BAM \
                  -O OUTPUT_GVCF \
                  -R REFERENCE \
                  --recal-table RECAL_TABLE
                  -t THREADUM

The command requires the following inputs:

  1. INPUT_BAM: The BAM file output from step 2.1.2 (sorted and duplicate-marked).
  2. OUTPUT_GVCF: Output GVCF file.
  3. REFERENCE: Reference genome FASTA file.
  4. RECAL_TABLE: Input file containing information for base quality score recalibration, generated in step 2.1.3.
  5. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.

For a single sample, if only a VCF file containing variant records is needed, the genotyper tool can be used.

dcs genotyper -I INPUT_GVCF \
              -O OUTPUT_VCF \
              -s QUAL \
              -t THREDNUM

The command requires the following inputs:

  1. INPUT_GVCF: GVCF file from variantCaller.
  2. OUTPUT_VCF: Output VCF file.
  3. QUAL: Variant quality score threshold. Records with quality scores greater than or equal to QUAL will be output.
  4. THREADNUM: Number of threads. It is recommended not to exceed the number of CPU cores.

2.1.5 Report Generation

Consolidates statistics from each sample to generate a visual HTML report.

dcs report --sample SAMPLE_NAME \
           --filter QCDIR \
           --bam_stat BAMStat \
           --vcf_stat VCFStat \
           --output OUTPUT_DIR

The command requires the following inputs:

  1. SAMPLE_NAME:Sample name.
  2. QCDIR: Quality control statistics report folder generated by the aligner.
  3. BAMStat: bamStat.txt in the variantCaller result directory.
  4. VCFStat: VCF statistics results, defaulting to vcfStat.txt in the working directory.
  5. OUTPUT_DIR: Report output directory.

2.2 WES

The steps are largely the same as WGS, with the only difference being in the variantCaller step, which requires an additional interval parameter (-L).

dcs variantCaller -I INPUT_BAM \
              -O OUTPUT_GVCF \
              -R REFERENCE \
              --recal-table RECAL_TABLE
              -L INTERVAL_FILE
              -t THREADUM

INTERVAL_FILE: BED file recording WES intervals. By convention, each line follows the format: chr<tab>start<tab>end Where start and end positions are 0-based and left-closed, right-open intervals.

2.3 Joint Calling

Joint Calling is a method for detecting population variants, specifically referring to the process of merging variants from multiple samples and recalculating sample genotypes using multiple GVCF files as input. DPGT is a distributed Spark program that efficiently performs joint calling for large-scale samples. dcs jointcaller is a wrapper for the DPGT tool, used to run DPGT in standalone mode.

# Set JAVA_HOME to jdk8  
export JAVA_HOME=/path_to/jdk8  
# Add spark bin to PATH for calling spark-submit

export PATH=/path_to/spark/bin/:$PATH
dcs jointcaller \
    -i GVCF_LIST \
    -r REFERENCE \
    -o OUTDIR \
    -j JOBS \
    --memory MEMORY \
    -l REGION

DPGT depends on Java 8. Set JAVA_HOME=/path_to/jdk8. Spark version 2.4.5 or higher is recommended. Download Spark 2.4.5 from: <https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.6.tgz>

The command requires the following inputs:

GVCF_LIST: Input GVCF list, a text file with one GVCF path per line. Note that GVCFs must have matching TBI index files. The program defaults to searching for TBI index files by appending .tbi to the GVCF path.

REFERENCE: Path to the reference genome FASTA file. Ensure this FASTA file is consistent with the one used to generate the GVCFs. The program also requires matching .dict and .fai files for the FASTA file (e.g., if the FASTA file is hg38.fasta, the program will also read hg38.dict and hg38.fasta.fai).

OUTDIR: Output folder. The resulting VCF path is OUTDIR/result.*.vcf.gz.

JOBS: Number of parallel tasks. The number of threads used is equal to JOBS, an integer ranging from 1 to the number of machine threads.

MEMORY: Maximum Java heap memory size. The unit is "g" or "m". Memory size is related to the JOBS value and sample size. For 1,000 samples, allocate 2g per job; for 2,000-5,000 samples, allocate 4g per job; for 5,000-100,000 samples, allocate 6g per job; for 100,000-500,000 samples, allocate 8g per job.

REGION: Target region. Supports input strings or BED paths. The region string format is consistent with bcftools, e.g., "chr20:1000-2000", "chr20:1000-", "chr20". To specify multiple regions, use multiple -l parameters, e.g., -l chr1 -l chr2:1000-2000.

2.4 FASTQ Compression

SeqArc can perform compression and decompression with or without reference sequences. Compression ratios are generally higher with reference sequences, while compression and decompression speeds are faster without reference sequences.

FileDescription
*.faReference sequence file used
*.fa.mini_28_10Short-read alignment index file generated from the reference sequence
*.fa.similar_26_23Long-read alignment index file generated from the reference sequence
*.dmpInput file for building species identification indexes
*.cdbSpecies identification library index file generated from the reference sequence
*.arcCompressed result file

For compression, input FASTQ files can be plain text files, gzip files, mgz files, or piped input.

For decompression, the output is a text file. If no output path is specified, it will output to standard output. If the compressed input is an mgz file, use the --rawtype parameter during decompression to output an mgzip file.

2.4.1 Creating Reference Sequence Index Files

  1. Directly create index files for the reference sequence
dcs SeqArc index -r REF -o  OUTDIR 
  1. Alternatively, create species identification library indexes and index files for each species in two steps
# i.Create species identification library indexes. This will generate clsf.cdb, clsf_del.cdb, clsf_opt.cdb, and clsf_tax.cdb in REFDIR.
dcs SeqArc index -K REFDIR -H CNT


# ii.Create alignment indexes for each species
dcs SeqArc index -r REF -o REFDIR

The above commands require the following parameters:

  1. REF: Path to the input reference sequence.
  2. REFDIR: Directory containing multi-species reference sequences.
  3. OUTDIR: Output directory path.
  4. CNT: Minimizer hit count.

2.4.2 Using the Compression Command

  1. If the user knows the species of the FASTQ, use the -r parameter to directly select the corresponding reference sequence for compression:
dcs SeqArc compress -r REF \
                    -i FASTQ \
                    -o ARC \
                    -t THREADNUM
  1. If the user does not know the specific species, use the species identification library for reference-based compression. The program will analyze the species of the input data using the identification library and then use the corresponding index files for reference-based compression.
dcs SeqArc compress  -K REFDIR \
                     -i FASTQ \
                     -o ARC \
                     -t THREADNUM
  1. If the user has no reference sequence, reference-free compression can also be performed.
dcs SeqArc compress  -i FASTQ \
                     -o ARC \
                     -t THREADNUM
  1. The user can package a directory.
dcs SeqArc compress -K REFDIR \
                    -a FASTQDIR \
                    -o ARC \
                    -t THREADNUM

The command requires the following parameters:

  1. REF: Path to the input reference sequence.
  2. REFDIR: Directory containing multi-species reference sequences.
  3. FASTQ: Path to the input FASTQ file.
  4. ARC: Path to the output compressed file.
  5. THREADNUM: Number of threads to set.

2.4.3 Using the Decompression Command

  1. If the user knows the specific reference sequence used for compression, use the -r parameter to directly select the corresponding species' reference sequence for decompression.
dcs SeqArc decompress  -r REF \
                       -i ARC \
                       -o FASTQ \
                       -t THREADNUM
  1. If the user used the species identification library for compression, specify the identification library path during decompression.
dcs SeqArc decompress  -r REFDIR \
                       -i ARC \
                       -o FASTQ \
                       -t THREADNUM
  1. If the user performed reference-free compression, decompression is also reference-free.
dcs SeqArc decompress  -i ARC \
                       -o FASTQ \
                       -t THREADNUM

The command requires the following parameters:

  1. REF:Path to the input reference sequence.
  2. REFDIR:Directory containing multi-species reference sequences.
  3. FASTQ:Path to the output FASTQ file.
  4. ARC:Path to the input compressed file.
  5. THREADNUM:Number of threads to set.

2.4.4 Decompression Using Extraction Tools

If the customer has not purchased DCS Tools but needs to decompress arc-format files, they can use a standalone free decompression tool to extract the data. The decompression tool software package (Seqarc_decompress) can be obtained from sales.

  1. If the user knows the specific reference sequence used for compression, they can directly select the corresponding species' reference sequence for decompression using the 【-r】 parameter.
Seqarc_decompress      -r REF \  
                       -i ARC \  
                       -o FASTQ \  
                       -t THREADNUM  
  1. If the user compressed the data using a species identification library, the path to the species identification library must be specified during decompression.
Seqarc_decompress      -r REFDIR \  
                       -i ARC \  
                       -o FASTQ \
  1. If the user performed reference-free compression, decompression will also be reference-free.
Seqarc_decompress      -i ARC \  
                       -o FASTQ \  
                       -t THREADNUM  

The command requires the following parameters:

  1. REF: The path to the input reference sequence.
  2. REFDIR: The directory containing multi-species reference sequences.
  3. FASTQ: The output path for the FASTQ file.
  4. ARC: The path to the input compressed file.
  5. THREADNUM: The number of threads to be set.

3. Detailed Tool Parameter Descriptions

3.1 aligner

Supported command-line parameters for aligner:

  • --threads: Number of threads. Recommended not to exceed the number of CPU cores.
  • --fq1: Input FASTQ1 file. For multiple files, supports comma separation or file lists.
  • --fq2: Input FASTQ2 file. For multiple files, supports comma separation or file lists.
  • --seqarc-ref: Index file required for decompressing SeqArc format files.
  • --read-group: Format is @RG\tID:GROUP_NAME\tSM:SAMPLE_NAME\tPL:PLATFORM, where GROUP_NAME is the read group identifier, SAMPLE_NAME is the sample name, and PLATFORM is the sequencing platform.
  • --build-index: Build index files required for alignment tools.
  • --aln-index: FASTA path required for sequence alignment. Ensure the accompanying index files are in the same directory as the FASTA file.
  • --mark-split-hits-secondary: Mark shorter split hits as secondary.
  • --use-soft-clipping-for-sup-align: Use soft clipping for supplementary alignment.
  • --no-markdup: Do not perform duplicate marking.
  • --bam-out: Output BAM file.
  • --high-speed-storage: Temporary file storage path. Placing this path on high-speed storage devices can improve performance.
  • --fastq-qc-dir: Output directory for FASTQ quality control reports.
  • --no-filter: Do not filter FASTQ data.
  • --output-clean-fq: Output clean reads.
  • --qualified-quality-threshold: Qualified base quality threshold (default: 12).
  • --low-quality-ratio: Low-quality base ratio threshold (default: 0.5).
  • --mean-qual: Mean quality threshold.
  • --quality-offset: Quality value offset. Typical values are 33 or 64. If set to 0, the value is inferred by the program (default: 33).
  • --adapter1: Adapter sequence for read1, in uppercase.
  • --adapter2: Adapter sequence for read2, in uppercase.
  • --adapter-max-mismatch: Maximum allowed base mismatches during adapter search (default: 2).
  • --adapter-match-ratio: Minimum matching length ratio for adapters (default: 0.5).
  • --trim-adapter: Trim adapters from read sequences instead of discarding reads.
  • --min-read-len: Minimum read length threshold (default: 30).
  • --n-base-ratio: N base ratio threshold(default: 0.1).
  • --trim-ends: Trim a specified number of bases from both ends of the reads, with the format being comma-separated integers.

3.2 bqsr

Supported command-line options for bqsr:

  • --threads: Number of threads. Recommended not to exceed the number of CPU cores.
  • --in-bam: Duplicate-marked BAM file.
  • --out-bam: Path to store the BAM file after base quality score recalibration. If this parameter is omitted, only the recalibration table is generated.
  • --reference: Reference genome FASTA file. Must be the same as the one used in the sequence alignment stage.
  • --recal-table: Output file containing information for base quality score recalibration.
  • --known-sites: Single nucleotide polymorphism database data and known variant sites file. Supports text and gz compressed formats.
  • --interval: BED file.
  • --interval-padding: Number of bases to pad on both sides of the BED interval.

3.3 variantCaller

Detailed parameter descriptions for variantCaller:

  • -I, --input <file1> <file2>...: Input BAM file paths. Multiple BAM paths can be accepted, separated by spaces.
  • -O, --output <file>: Output VCF/GVCF BGZF file. The program automatically creates the corresponding index file.
  • -R, --reference <file>: Reference genome FASTA file path.
  • --recal-table <file>: Input file containing information for base quality score recalibration.
  • -H, --help: Print help information.
  • --version: Print version number.
  • -L, --interval <file>: Target interval BED file path.
  • -P, --interval-padding <int>: Number of bases to extend on both sides of the interval. Must be a non-negative integer (default: 0).
  • -Q, --base-quality-score-threshold <int>: Base quality score threshold. Bases with quality scores below this threshold will be set to the minimum value of 6 (default: 18).
  • -D, --max-reads-depth <int>: Maximum number of reads retained per alignment start position. Reads exceeding this threshold will be downsampled (default: 50).
  • -G, --gvcf-gq-bands <int1> <int2>...: Reference confidence model GQ groups. Input multiple integers separated by spaces (default: [1,2,3,...,59,60,70,80,90,99]).
  • -t, --threads <int>: Number of threads used by the program, ranging from 1 to 128 (default: 30).
  • --pcr-indel-model <str>: PCR indel model. Parameter range: {NONE, HOSTILE, CONSERVATIVE, AGGRESSIVE} (default: CONSERVATIVE).
  • --emit-ref-confidence <str>: Run the reference confidence scoring model. Parameter range: {NONE, BP_RESOLUTION, GVCF}. NONE: Do not run the reference confidence scoring model; BP_RESOLUTION: Run the reference confidence scoring model, outputting a variant record for each site; GVCF: Run the reference confidence scoring model, grouping reference confidence sites (ALT=<NON_REF>) by GQ value according to the -G parameter (default: NONE).
  • --compression-level <int>: Output VCF/GVCF BGZF file compression level, ranging from 0 to 9 (default: 6).

3.4 genotyper

The genotyper processes single-sample GVCF files, outputting VCF files containing only variant records. For joint calling of multiple GVCF files, use the high-performance DPGT software. Detailed parameter descriptions for genotyper:

  • -I, --input <file>: Input GVCF file path.
  • -O, --output <file>: Output VCF BGZF file containing genotypes. The program automatically creates the corresponding index file.
  • -s, --stand-call-conf <float>: Variant quality score threshold. Variants with quality scores greater than or equal to this value will be output (default: 10.0).
  • -D, --dbsnp <file>: dbSNP file.
  • -t, --threads <int>: Number of threads used by the program (default: 6).
  • --version: Print version number.
  • -h, --help: Print help information.

3.5 jointcaller (DPGT)

jointcaller is a wrapper for the DPGT tool, used to conveniently run DPGT in standalone mode. DPGT is a distributed joint calling tool. Its detailed parameter descriptions are as follows:

  • -i, --input <FILE>: Input GVCF file list, a text file with one GVCF path per line.
  • --indices <FILE>: Optional parameter. The program defaults to searching for index files by appending .tbi to the GVCF path. This parameter can specify the index file path, requiring a text file with one index path per line, matching the order of GVCF paths.
  • --use-lix <Boolean>: Use LIX index format. LIX files are simplified index files implemented by DPGT. TBI indexes can be converted to LIX indexes using the tbi2lix program. LIX indexes are approximately 1/80 the size of TBI indexes. Options: true, false (default: false).
  • --header <FILE>: Optional parameter. Use a precomputed VCF header file to avoid repeated calculations.
  • -o, --output <DIR>: Output folder.
  • -r, --reference <FILE>: Reference genome FASTA file.
  • -l, --target-regions <STRING>: Target region string or BED file. Defaults to processing the entire genome.
  • -j, --jobs <INT>: Number of parallel tasks (default: 10).
  • -n, --num-combine-partitions <INT>: Number of partitions during the variant merging phase. Defaults to the square root of the sample count, rounded down (default: -1).
  • -w, --window <INT>: Region size processed by each merge-genotype calculation (default: 300M).
  • -d, --delete <Boolean>: Whether to delete temporary files. Options: true, false (default: true).
  • -s, --stand-call-conf <FLOAT>: Quality score threshold. Variant records with quality scores greater than or equal to this value will be output (default: 30.0).
  • --dbsnp <FILE>: dbSNP VCF file path, used for annotating dbSNP IDs.
  • --use-old-qual-calculator <Boolean>: Whether to use the old quality score calculator. When set to true, the old calculator is used, matching the results of GATK v4.1.0 with --use-old-qual-calculator true --use-new-qual-calculator false. When set to false, the new calculator is used, matching the results of GATK v4.1.1 and later versions. Options: true, false (default: true).
  • --heterozygosity <FLOAT>: Prior probability of heterozygous SNP variants (default: 0.001).
  • --indel-heterozygosity <FLOAT>: Prior probability of heterozygous INDEL variants (default: 1.25E-4).
  • --heterozygosity-stdev <FLOAT>: Standard deviation of prior probabilities for SNP and INDEL variants (default: 0.01).
  • --max-alternate-alleles <INT>: Maximum number of alternate alleles. For variants exceeding this number, the program retains only the most probable alleles up to this count (default: 6).
  • --ploidy <INT>: Ploidy (default: 2).
  • -h, --help: Print help information and exit.
  • -V, --version: Print version number and exit.

3.6 SeqArc

3.6.1 Index Creation Functionality, Subcommand index

Parameters for creating indexes:

  • -h, --help: Get help information for index creation parameters.
  • -r, --ref <file>: Set the path to the input reference sequence file (FASTA).
  • -K, --clsdb <dir>: Set the directory containing reference sequence files for multiple species. The directory must contain names.dmp and nodes.dmp, which can be downloaded from: http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gzopen in new window.
  • -H, --hitlimit <int>: Set the minimizer hit count for taxonomy (default: 2).
  • -t, --thread <int>: Set the number of threads (default: 8).
  • -o, --output <dir>: Set the output directory for storing generated index files for the reference sequence.
  • --aligntype <int>: Set the alignment method for generating index files (default: 2: 1-hash, 2-minimizer, 3-longseq).

If users know the species of the fastq file, they can use the -r parameter to directly select the corresponding reference sequence to generate the index file, and use the -o parameter to save the generated index file. This index file can then be used for compression and decompression. If users do not know the species of the fastq file, they can use the -k parameter to create a species identification library index for all reference sequences in the directory. Then, for each reference sequence, they can use the -r parameter to create the corresponding index file. In this case, the directory specified by -o must be the same as the input directory of -K.

3.6.2 Compression Function, Subcommand compress

The compression parameters are as follows:

  • -h, --help: Get help instructions for compression parameters.
  • -r, --ref <file>: Set the path of the input reference sequence. The corresponding index file must exist in this path.
  • -K, --clsdb <dir>: Set the directory containing reference sequence files for multiple species. This directory must contain the generated species identification library file and the index files for each species.
  • -H, --hitlimit <int>: Set the minimizer hit count for taxonomy, default is 2. Keep this consistent with the creation parameters.
  • -J, --clsmin <float>: Set the minimum hit rate for species identification, default is 0.6.
  • -L, --clsnum <int>: Set the number of reads to be read from the fastq file for species identification, default is 10000.
  • -t, --thread <int>: Set the number of threads, default is 8.
  • -f, --force: Force overwrite existing files with the same name.
  • -i, --input <file>: Set the input fastq file.
  • --pipein: Specify if the input is pipeline data.
  • -o, --output <file>: Set the output file path. The .arc suffix will be automatically added.
  • --blocksz <int>: Set the size of each data block during compression, default is 50.
  • -ccheck: After compression is completed and data is written to disk, perform a decompression check to ensure data integrity.
  • --calcmd5: Calculate the MD5 of the input file during compression and save it to the compressed file.
  • --aligntype <int>: Set the compression alignment method, default is 2.
  • -a, --archive <dir>: Set the directory to be compressed and archived.
  • --slevel <int>: Set the compression level for sequences, default is 10.
  • --verify <int>: Set the verification method for each data block, default is 10 (no verification), 1 (verify the entire block), or 2 (verify name, seq, and qual separately).

3.6.3 Decompression Function, Subcommand decompress

The decompression parameters are as follows:

  • -h, --help: Get help instructions for decompression parameters.
  • -r, --ref <dir/file>: Set the path of the input reference sequence. The corresponding index file must exist in this path. Alternatively, specify the species identification library directory, which must contain the index files for each species.
  • -t, --thread <int>: Set the number of threads, default is 8.
  • -f, --force: Force overwrite existing files with the same name.
  • -i, --input <file>: Set the path of the input compressed file.
  • -o, --output <file>: Set the path of the output decompressed file.
  • -rawtype: Ensure the decompressed output file format matches the original compressed file format. This only applies to mgz format files.
  • -dcheck: After decompression is completed and data is written to disk, read the decompressed data blocks and verify them to ensure integrity.
  • -x, --extract: Extract one or more files from the compressed archive. The input file must be generated by -a or --archive.
  • -l, --list: Display the file information in the compressed archive. The input file must be generated by -a or --archive.
  • --show: Display the text MD5 value of the compressed file. This requires --calcmd5 to be specified during compression.

3.7 license-proxy

Executable file parameters:

  • -h, --help: Print help information.
  • --license <LICENSE_PATH>: Path to the license file.
  • --bind <BIND_ADDR>: Listening address.
  • -d, --daemon: Start in the background.

3.8 report

Organize, visualize, and generate an HTML report for WGS/WES analysis results. Parameters:

  • --sample: Sample name.
  • --filter: Path to the filter directory.
  • --bam_stat: Path to the BAM statistics file.
  • --vcf_stat: Path to the VCF statistics file.
  • --output: Directory for the output report.
  • --is_se: (Optional) Add this flag if single-end sequencing is used.

3.9 lickit

Used to check network connectivity, query the number of available threads for the user, and upload issue logs.

  • --ping <ip:port>: Check network connectivity.
  • --query <ip:port>: Query the number of available threads for the user.
  • --uplog <ip:port logpath [msg]>: Actively upload issue logs.

3.10 Utility tools

3.10.1 extract-vcf

A standard VCF file contains 8 columns. Columns 6, 7, and 8 occupy significant space but are not required by the program. To reduce IO pressure, this tool extracts only the first 5 columns from the VCF file and replaces the last 3 columns with *. Main parameters:

  • --known-sites <file>: Input VCF file, supporting both text and gz compressed formats.
  • --output-mini-vcf <file>: Output VCF file. If the output filename includes gz, it will be in compressed format.

3.10.2 bam-stat

Generate alignment statistics based on the BAM file. In the WGS pipeline, this function is integrated into variantCaller but can also be run separately. Parameters:

  • --bam <file>: Path to the input BAM file.
  • --bed <file>: Target regions. For WGS data, specify non-N regions (if unavailable, use the BAM statistics results from variantCaller). For WES data, specify the target region file.
  • --output <file>: Path to the output file.

3.10.3 vcf-stat

Generate variant statistics based on the genotyped VCF file, including SNP counts. Parameters:

  • --vcf <file>: Path to the input VCF file, supporting both text and gz formats. Results are output to standard output by default and can be redirected to a specified file (refer to Step 5 in Section 4.1).

3.10.4 tbi2lix

  • -i, --input FILE: Input TBI index file.
  • -o, --output FILE: Output LIX index file.
  • -m, --min-shift INT: Set the minimum interval size for the LIX linear index to 2^INT (default is 14, corresponding to a 16K interval size).
  • -h, --help: Print help information and exit.

4. Quick Start—Using Example Scripts and Data

After deploying the software as described in Section 1.4, download the quick start data package (cloud platform download link), unzip it, and modify the DCS_HOME and PROXY_HOST values in the example script. DCS_HOME is the path where the software package is unzipped, and PROXY_HOST is the IP and port (default 8909) specified during deployment on the networked node. Execute the corresponding example script for different application scenarios.

tar xzvf dcs-quick-start-xxxx.tar.gz
cd dcs-quick-start

4.1 WGS

Reference script: quick_start_wgs.sh

#!/bin/bash

# This script provides a quick start guide for using the dcs tool to analyze whole-genome sequencing data.
# The script demonstrates how to build the index of the reference genome, align the reads, 
# recalibrate the base quality scores, call variants, and filter the variants.
# if you have any questions, please contact us at cloud@stomics.tech

# set the location of the dcs tool
export DCS_HOME=/path/to/dcs

if [ ! -d "$DCS_HOME" ]; then
    echo "--- Error: DCS_HOME directory does not exist: $DCS_HOME"
    exit 1
fi
if [ ! -r "$DCS_HOME" ]; then
    echo "--- Error: No read permission for DCS_HOME: $DCS_HOME"
    exit 1
fi

# Before using the dcs tool, you must obtain a license file and start the authentication proxy program 
# on a node that can connect to the internet.

# Get the local IP address, which is used for license application. 
# The output data field of the following command indicates the IP address.
# curl -X POST https://cloud.stomics.tech/api/licensor/licenses/anon/echo

# start the proxy program
# LICENSE=<path/to/license/file>
# $DCS_HOME/bin/dcs licproxy --license $LICENSE

# Set the following environment variables
export PROXY_HOST=127.0.0.1:8909

# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST

# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST

# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")

# Set the following variables for the dcs tool
FQ1=$SCRIPT_DIR/data/fastq/simu.r1.fq.gz
FQ2=$SCRIPT_DIR/data/fastq/simu.r2.fq.gz
FA=$SCRIPT_DIR/data/ref/chr20.fa
KN1=$SCRIPT_DIR/data/known_sites/1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz
KN2=$SCRIPT_DIR/data/known_sites/dbsnp_151.hg38.chr20.vcf.gz
KN3=$SCRIPT_DIR/data/known_sites/hapmap_3.3.hg38.chr20.vcf.gz
KN4=$SCRIPT_DIR/data/known_sites/Mills_and_1000G_gold_standard.indels.hg38.chr20.vcf.gz

nthreads=8 
sample=simu
group="@RG\\tID:sample\\tLB:sample\\tSM:sample\\tPL:COMPLETE\\tCN:DCS"

# run the wgs pipeline ------------------------------------------------------------------------------------

# check the existence of the index files
for file in $FA.0123 $FA.amb $FA.ann $FA.bwt $FA.index1 $FA.index2 $FA.pac $FA.sa;
do
    if [ ! -f $file ]; then
        echo "File $file does not exist, try to build the index ..."
        $DCS_HOME/bin/dcs aligner --threads 16 --build-index $FA
        break;
    fi
done

# Step 0: Set the working directory
workDir=$SCRIPT_DIR/result_wgs
mkdir -p $workDir
mkdir -p $workDir/${sample}.qcdir 

# Step 1: Alignment
$DCS_HOME/bin/dcs aligner --threads $nthreads --aln-index $FA --fq1 $FQ1 --fq2 $FQ2 --read-group $group --no-filter --fastq-qc-dir $workDir/${sample}.qcdir --bam-out $workDir/${sample}.align.bam

# Step 2: base quality score recalibration
$DCS_HOME/bin/dcs bqsr --threads $nthreads --in-bam $workDir/${sample}.align.bam --reference $FA --known-sites $KN1 --known-sites $KN2 --known-sites $KN3 --known-sites $KN4 --recal-table $workDir/${sample}.recal.table

# Step 3: variant calling
$DCS_HOME/bin/dcs variantCaller --threads $nthreads --input $workDir/${sample}.align.bam --output $workDir/${sample}.variantCaller.g.vcf.gz --reference $FA --recal-table  $workDir/${sample}.recal.table

# Step 4: genotype calling
$DCS_HOME/bin/dcs genotyper --threads $nthreads --input $workDir/${sample}.variantCaller.g.vcf.gz --output $workDir/${sample}.genotyper.vcf.gz --dbsnp $KN2 -s 30.0

# Step 5: generate the final report
$DCS_HOME/bin/dcs vcf-stat --vcf $workDir/${sample}.genotyper.vcf.gz >$workDir/vcfStat.txt

$DCS_HOME/bin/dcs report --sample ${sample} --filter $workDir/${sample}.qcdir --bam_stat $workDir/bamStat.txt --vcf_stat $workDir/vcfStat.txt --output $workDir/report

After modifying the environment variables in the file, simply run it directly.

# Modify the values of DCS_HOME and PROXY_HOST.
vi quick_start_wgs.sh
sh quick_start_wgs.sh
Figure 4.1 WGS Analysis Pipeline
Figure 4.1 WGS Analysis Pipeline

4.2 WES

The steps are largely the same as WGS, with the only difference being the variantCaller step, which requires an additional interval parameter (-L) specifying the WES bed file.

# Step 3: variant calling
$DCS_HOME/bin/dcs variantCaller --threads $nthreads --input $workDir/${sample}.align.bam --output $workDir/${sample}.variantCaller.g.vcf.gz --reference $FA --recal-table  $workDir/${sample}.recal.table -L <wes.target.bed>

4.3 Joint Calling

The reference script is quick_start_jointcalling.sh, similar to WGS. Modify the DCS_HOME and PROXY_HOST environment variables. Additionally, dcs jointcaller requires JAVA_HOME to be set to JDK8 and the spark/bin directory (Spark version 2.4.5 or higher) to be added to the PATH environment variable.

# This script provides a quick start guide for using the dcs jointcaller to perform joint calling on multiple gvcf files
# if you have any questions, please contact us at cloud@stomics.tech


# set the location of the dcs tool
export DCS_HOME=/path/to/dcs


# dcs jointcaller is a wrapper of DPGT, which is a spark java application
# set JAVA_HOME and add spark/bin to PATH
export JAVA_HOME=/path/to/jdk8
export PATH=/path/to/spark/bin:$PATH


# Before using the dcs tool, you must obtain a license file and start the authentication proxy program 
# on a node that can connect to the internet.


# Get the local IP address, which is used for license application. 
# The output data field of the following command indicates the IP address.
# curl -X POST https://uat-cloud.stomics.tech/api/licensor/licenses/anon/echo?cliIp=true


# start the proxy program
# LICENSE=<path/to/license/file>
# DCS_HOST=0.0.0.0:8909
# $DCS_HOME/bin/licproxy --command start --license $LICENSE --bind $DCS_HOST --log-level INFO


# Set the following environment variables to your desired values
export PROXY_HOST=127.0.0.1:8909


# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST


# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST


# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")


# Set the following variables for dcs jointcaller
# gvcf directory
GVCF_DIR=$SCRIPT_DIR/data/gvcf
# target region
REGION=chr20:1000000-2000000
# reference fasta file
FA=$SCRIPT_DIR/data/ref/chr20.fa


# number of threads
nthreads=4
# java heap max memory
memory_size=8g


# Step 0: Set the working directory
workDir=$SCRIPT_DIR/result_jointcalling
mkdir -p $workDir


# Step 1: Set input file and output dir
# dcs jointcaller input file
GVCF_FOFN=$workDir/gvcf.fofn
ls ${GVCF_DIR}/*gz > ${GVCF_FOFN}
# dcs jointcaller output directory
OUT_DIR=$workDir/result


# Step 2: Run dcs jointcaller
$DCS_HOME/bin/dcs jointcaller \
    --input ${GVCF_FOFN} \
    --output ${OUT_DIR} \
    --reference ${FA} \
    --jobs ${nthreads} \
    --memory ${memory_size} \
    -l ${REGION}

4.4 SeqArc

The reference script is quick_start_seqarc.sh, similar to WGS. Modify the DCS_HOME and PROXY_HOST environment variables.

#!/bin/bash


# set the location of the dcs tool
export DCS_HOME=/path/to/dcs


# Set the following environment variables to your desired values
export PROXY_HOST=127.0.0.1:8909


# check the status of the proxy program
$DCS_HOME/bin/dcs lickit --ping $PROXY_HOST


# query the available resources provided by the certification
$DCS_HOME/bin/dcs lickit --query $PROXY_HOST




# Get the directory where the script is located
SCRIPT_PATH=$(realpath "$0" 2>/dev/null || readlink -f "$0" 2>/dev/null || echo "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")


nthreads=2
FQ1=$SCRIPT_DIR/data/fastq/simu.r1.fq.gz
FA=$SCRIPT_DIR/data/ref/chr20.fa
ARCFA=$SCRIPT_DIR/data/arcref/
mkdir -p $ARCFA




workDir=$SCRIPT_DIR/result_seqarc
mkdir -p $workDir


# Step 1: create index
$DCS_HOME/bin/dcs SeqArc index -r $FA -o $ARCFA
 
# Step 2: compress 
$DCS_HOME/bin/dcs SeqArc compress -t $nthreads -r $ARCFA/chr20.fa -i $FQ1 -o $workDir/test 


# step 3 : decompress
$DCS_HOME/bin/dcs SeqArc decompress -t $nthreads -r $ARCFA/chr20.fa -i $workDir/test.arc -o $workDir/simu.r1.fq

5. Release Notes

v1.1.0

ModuleTypeDescription
WGS/WESBug FixResolved memory leak issue in aligner
Bug FixFixed aligner hanging when building indexes for certain ref.fasta files
SeqArcFeature DevelopmentImplemented cross-platform instruction sets using simde
Feature DevelopmentAdded standalone decompression tool
Feature DevelopmentImplemented statistical functionality for gene libraries

v1.0.1

改动模块:aligner, report

ModuleTypeDescription
WGS/WESBug FixFixed a coredump issue in the aligner under certain scenarios
Bug FixFixed an issue where the aligner output file was a pure filename without a path
Bug FixRenamed the bam index file generated by the aligner to *.bam.bai instead of *.bai
Bug FixFixed a potential issue in the cumulative depth frequency plot in the report module
Bug FixAdded set -e to the WGS script in the quick_start package to exit immediately on error

v1.0.0

ModuleTypeDescription
WGS/WESFeatureAdded alignment and variant-related statistics functionality.
SeqArcFeatureRestructured the project framework according to the AVS-G standard.
FeatureAdded compression functionality for long-read data.
FeatureAdjusted parameter parsing, added subcommands, and long/short parameters.

6. Notes and Common Issues

6.1 WGS/WES

  1. During alignment, ensure the reference sequence index file is prepared in advance to avoid errors.
  2. The BQSR step outputs only the recal.table by default. To generate a base quality-recalibrated BAM file, set the --out-bam parameter.
  3. For WES alignment statistics, there are two methods: –Specify the interval parameter in variantCaller as the WES bed file. –Run the bam-stat tool separately and set the --bed parameter. If no bed file is provided, statistics will default to the whole genome, leading to inaccurate coverage and depth results.

6.2 Joint Calling

  1. Input GVCF files must include TBI index files.
  2. The reference genome FASTA file requires FAI and DICT files.
  3. For large sample sizes, allocate sufficient memory to avoid overflow (refer to the runtime examples in Section 7).

6.3 SeqArc

  1. For compression, input FASTQ filenames must have the following suffixes: .fq, .fastq, .fq.gz, or .fastq.gz. Otherwise, the file type will be considered invalid.

  2. Input reference sequence files must be text files with the suffixes .fa or .fasta. Otherwise, the file type will be considered invalid.

  3. For decompression, input ARC files must be generated by this program. Otherwise, the file type will be considered invalid.

  4. The reference sequence file used for decompression must match the one used for compression. If they differ, an error will occur: [ref load md5 error].

  5. If the FASTQ file is incomplete, an error will occur: [read fastq error].

  6. If the seq and qual lengths are unequal, an error will occur: [seqlen is not equal quallen].

  7. For compression or decompression, file read/write errors will trigger: [read file err] or [write file err].

  8. During decompression, if a data block fails verification, an error will occur: [block check fail]. This may indicate compression anomalies or incorrect data storage.

  9. Set ulimit -n to avoid runtime crashes, which may generate core dumps.

6.4 License Proxy

Tool Issue: DCS Tools or lickit cannot connect to licproxy

If you encounter error messages like:

  • connection refused
  • couldn't connect to server

Please follow these troubleshooting steps:

  1. On the machine running DCS Tools or lickit, check connectivity to the License proxy using:
curl -v http://$PROXY_HOST/api/licensor/licenses/anon/pool

Replace $PROXY_HOST with your licproxy's internal IP address and port (e.g., 172.20.9.149:8909), NOT the public IP address you used when obtaining the License file!

If DCS Tools/lickit and licproxy are on the same machine, you may use 127.0.0.1:8909 for $PROXY_HOST

The expected return should include Success. If it prompts that the connection cannot be established, please proceed to check whether the License proxy process exists by following the steps below.

  1. Check whether the process exists on the machine where licproxy is deployed
ps aux | grep licproxy

If the process does not exist, please follow the steps in Section 1.4 Software Deployment to proceed. If the process exists, contact your IT operations team to ensure that the machines running DCS Tools or lickit can access the machine where licproxy is located as described in Step 1. If the network connection is unavailable, request your IT operations team to enable it.

  1. Check whether the script or command line for submitting tasks correctly sets the variable PROXY_HOST

Please verify whether the script or command line you are using overrides the environment variable PROXY_HOST. For example, if licproxy is deployed at 172.20.9.149:8909, ensure you set the following before running the script or command line:

export PROXY_HOST=172.20.9.149:8909
./path/to/your
# or
PROXY_HOST=172.20.9.149:8909 /path/to/your

Common License Authentication Error Messages :

  • Permission denied (No permission or authentication failed)
  • Size must be greater than 0 (The number of submitted threads must be greater than zero)
  • Pool size exceeded (The number of submitted threads exceeds the current thread limit)
  • License expired (The license has expired)
  • Unknown error (Unknown error)

7. Performance and Runtime

7.1 WGS

在On an Alibaba Cloud ECS i4g.8xlarge instance (32 cores, 128GB RAM) with SSD storage, the DCS tools completed the FASTQ-to-VCF process in the following times for different datasets:

DataRuntime(h)ThreadsMax Memory (GB)
HG001 30X NovaSeq1.8232101
HG001 30X DNBSEQ-G4001.9132100
HG005 40X DNBSEQ-G4002.4532120

7.2 SeqArc

On an Alibaba Cloud ECS c6a.xlarge instance (4 cores, 8GB RAM), compression and decompression tests were performed on NA12878 (30X WGS) data.

Compression Results:

Input DataRuntimeCompression Ratio (vs. gz)
r1.fq.gz(27G)27m 24s5.17
r2.fq.gz (30G)29m 14s4.10

Decompression Results:

Input DataRuntime
r1.fq.arc14m 32s
r2.fq.arc14m 25s

7.3 Joint Calling

Performance tests on the 1KGP dataset and internal large-scale population datasets showed the following results for dcsjointcaller (DPGT):

DataSamplesProcesses (6GB RAM each)Core HoursRuntime
1KGP2504256696327.2 hours
Internal 10K91652562137683.5 hours
Internal 52K520004400407760~6 days
Internal 105K10500036751026155~19 days
Last update: