What Exactly Is a "GC Content - Sequencing Depth" Heatmap? How to Interpret It? And How to Plot It?
1. In genomic sequencing data analysis, "whether GC content affects sequencing depth" is a crucial and unavoidable question — high-GC regions often suffer from insufficient depth due to low amplification efficiency, while low-GC regions may produce false positives due to over-amplification. Such biases directly impact the accuracy of variant detection and copy number analysis.
Therefore, in the above heatmap, the x-axis represents different intervals divided by the GC content of the target regions, and the y-axis represents the normalized depth of the corresponding intervals (i.e., regional depth/average depth).The yellow areas indicate a higher density of the corresponding depth regions, while the red and blue areas indicate a decrease in the density of the corresponding depth regions in sequence.
2. The target regions in each Panel typically include regions with both high and low GC content; therefore, the corresponding depth fluctuations will affect the uniformity of the entire Panel. In the technical principle of hybrid capture, the levels of hybridization temperature and elution temperature both exert a significant impact on sequencing depth. Customers can analyze the depth changes of regions with different GC contents in the experiment based on the heatmap to quickly identify anomalies in experimental procedures.
Thus, from the perspective of bioinformatics practical operation, we will guide you through step-by-step to complete the "GC Content - Sequencing Depth Distribution Plot" today. At the end of the article, we will attach reusable R code — even beginners can easily reproduce the plot.
First, clarify the analysis goal: What problem are we trying to solve?
This analysis focuses on the "probe capture sequencing" scenario (applicable to exome sequencing, Panel sequencing, etc.), with the core goal of verifying whether probes in target regions cause abnormalities in captured reads depth due to differences in GC content.
To achieve this goal, we first need to obtain two key datasets:
GC content of each probe region (reflecting the base composition characteristics of the region);
Normalized sequencing depth of each probe region (eliminating overall sample differences and focusing on depth comparison between regions).
All subsequent operations will revolve around obtaining these two types of data.
Complete the Analysis in 3 Steps: From Data Acquisition to Plot Generation
Step 1: Calculate the GC Content of Each Probe Region
First, clarify the calculation logic of GC content: only count the proportion of G and C among non-N bases (A/T/G/C), to avoid interference from unknown bases (N) on the results.
Calculation formula:GC Content = (Number of G Bases + Number of C Bases) ÷ Total Valid Base Count (A+T+G+C) × 100%
No manual counting is needed—this can be done with a single command using the bedtools tool, maximizing efficiency.
After running the command, the gc_content.txt file will contain key data such as the position information of probe regions and GC content, which can be directly used in subsequent steps.
Step 2: Calculate the Normalized Depth of Each Probe Region
To calculate the normalized depth of each probe region, we first need to compute the "raw average depth of each probe region", then perform normalization using the "overall average depth of the sample" — this eliminates the interference of "differences in total sequencing volume" between different samples, making the depth of different samples comparable.
We use the samtools depth tool to calculate the raw depth, then complete normalization through simple calculations.
Finally, we will obtain a core data table with 5 columns, namely: Chromosome, Probe Start Position, Probe End Position, GC Content, and Normalized Depth.
Step 3: Plot a GC-depth scatter plot using R (with complete code)
With the core data, the next step is to visually display the "correlation trend between GC content and normalized depth" through visualization — we use the LSD package to plot a heatmap scatter plot (where the color intensity of points reflects data density, making it easier to observe clustering trends), and add reference lines to help determine "whether the depth is normal".
Complete R code (can be directly copied and used)
After the code is run, the following chart will be generated (the darker the color of the points, the greater the number of probes corresponding to this GC content-depth combination):
From the chart, we can intuitively observe the following:
When the GC content is between 0.4 and 0.6 (i.e., 40%-60%), the normalized depth is mostly concentrated around 1.0 (the normal range);
When the GC content is less than 0.4 or greater than 0.6, the depth tends to deviate from 1.0 (either lower or higher). This is consistent with the rule that "sequencing depth in regions with abnormal GC content tends to deviate".
Key Considerations
Adequate data preprocessing is essential:
When calculating GC content, it is critical to exclude probe regions with a high proportion of N (regions where N accounts for more than 10% are recommended to be removed to prevent result distortion);
Before counting depth, ensure that BAM files have undergone "deduplication, adapter removal, and quality filtering" (otherwise, depth statistics will be inflated).
Significance of normalized depth:
If "raw depth" is used directly for plotting, it will be affected by the "total sequencing volume of samples" (for example, if the total sequencing volume of Sample A is twice that of Sample B, the raw depth will also double). After normalization, however, the depth of different samples can be directly compared, making them more analytically valuable.
Logic for chart interpretation:
Focus on "depth points deviating from 1.0" — if points in a certain GC interval (e.g., GC > 0.7) are largely concentrated below 0.5, it indicates low probe capture efficiency in this interval, requiring subsequent optimization of experimental conditions (such as adjusting PCR annealing temperature or replacing with high-GC amplification enzymes).
Through these 3 steps, we can not only verify "the impact of GC content on sequencing depth" but also localize "probe regions with abnormal capture efficiency," providing direct basis for subsequent experimental optimization and data analysis calibration.