Details
1. Command Line Interface
❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...
mlQTL: Machine Learning for QTL Analysis
Options:
--help Show this message and exit.
Commands:
gff2range Convert GFF3 file to plink gene range format
gtf2range Convert GTF file to plink gene range format
importance Calculate feature importance and plot bar chart
run Run mlQTL analysis
The command line interface provides 5 subcommands:
- gff2range: Convert GFF3 file to plink gene range format
- gtf2range: Convert GTF file to plink gene range format
- importance: Calculate feature importance and plot bar chart
- run: Run mlQTL analysis
The main command is run, which is used to perform the analysis. You can view the help information for this command using the following command:
❯ mlqtl run --help
Usage: mlqtl run [OPTIONS]
Run mlQTL analysis
Options:
-g, --geno TEXT Path to genotype file (plink binary format) [required]
-p, --pheno PATH Path to phenotype file [required]
-r, --range PATH Path to plink gene range file [required]
-o, --out PATH Path to output directory [required]
-j, --jobs INTEGER Number of processes to use [default: 1]
-m, --model TEXT Model to use [default: DecisionTreeRegressor,RandomForestRegressor,SVR]
-c, --chrom TEXT Chromosome to analyze
--trait TEXT Trait to analyze
--onehot Use one-hot encoding for categorical features
--padj Use adjusted p-value for significance threshold (default: True)
--center-window-kb INTEGER Window radius in kilobases (kb) for symmetric neighborhood (e.g., 400 for ±400kb) [default:400]
--center-step-genes INTEGER Step size in number of genes for center-based window [default: 10]
--q FLOAT Quantile used as window score(e.g. 0.9 = 90% quantile) [default: 0.9]
--top-prop FLOAT Top proportion of windows genome-wide selected as QTL for --newmethod (e.g. 0.10 = top 10%) [default: 0.1]
--help Show this message and exit.
Required Parameters:
- -g, --geno: Path to the genotype file in plink binary format.
-
-p, --pheno: Path to the phenotype file in TSV format. The first column should be sample names, followed by phenotype values. The file must include a header, and the first column header must be "sample".
-
-r, --range: Path to the plink gene range file in TSV format. Columns: Chromosome, Start, End, Transcript Name, Gene Name. No header required.
Optional Parameters:
- -o, --out: Path to the output directory.
- -j, --jobs: Number of processes (default: 1).
- -m, --model: Regression model(s). Options:
DecisionTreeRegressor,RandomForestRegressor,SVR. - --center-window-kb: Defines the window radius in kilobases (kb). The window extends this distance both upstream and downstream from the center to calculate gene signals within the region. The default value is 400 kb (covering a total range of 400kb).
- --center-step-genes: The step size for window movement, measured in number of genes. This defines how many genes the center moves forward during each sliding window iteration. The default value is 10 genes.
- --q: The quantile used to calculate the window score. For example, a value of 0.9 means the 90th percentile of gene scores within the window is taken as the window's overall rating to highlight strong signals. The default value is 0.9.
- --top-prop: The proportion of windows to be selected as candidate QTLs. For instance, 0.1 indicates that the top 10% of windows with the highest scores genome-wide will be selected as QTL regions. The default value is 0.1.
- --onehot: Enable one-hot encoding for SNPs (default: False).
Example Run:
❯ mlqtl run -g imputed_base_filtered_v0.7 -p grain_length.txt -r gene_location_range.txt -j 64 -o result
Output files in result/{trait}/:
sliding_window.png: Visualization of results.candidate_genes.tsv: List of candidate genes.train_res.tsv: Full processed training results.qtl_regions.tsv: List of QTL.
Feature importance subcommand:
❯ mlqtl importance --help
Usage: mlqtl importance [OPTIONS]
Calculate feature importance and plot bar chart
Options:
-g, --geno TEXT Path to genotype file (plink binary format) [required]
-p, --pheno PATH Path to phenotype file [required]
-r, --range PATH Path to plink gene range file [required]
-o, --out PATH Output directory [required]
--gene TEXT Gene name (only one gene) [required]
-m, --model TEXT Model to use [default: DecisionTreeRegressor,RandomForestRegressor,SVR]
--trait TEXT Trait name (only one trait) [required]
--onehot Use one-hot encoding for categorical features
--help Show this message and exit.
Example for feature importance:
❯ mlqtl importance --geno imputed_base_filtered_v0.7 --pheno grain_length.txt --range gene_location_range.txt -o ./feature_importance --gene Os03g0407400 -m RandomForestRegressor --trait grain_length
./feature_importance/Os03g0407400/:
Os03g0407400_grain_length_feature_importance.tsv: Feature importance values for each SNP.Os03g0407400_grain_length_RandomForestRegressor.png: Bar plot of feature importance
2. Python API
Dataset Class
The Dataset class manages data loading.
When instantiating the Dataset class, you need to provide at least the paths to the genotype file and the gene range file.
from mlqtl.data import Dataset
geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
dataset = Dataset(geno, gene_range)
Retrieve SNP data:
>>> snp_data = dataset.get("Os01g0100100")
>>> snp_data[0]
(0, array([0, 0, 0, ..., 2, 2, 0], shape=(3024,), dtype=int8))
>>> dataset.snp.base(*snp_data[0])
array(['GG', 'GG', 'GG', ..., 'AA', 'AA', 'GG'], dtype='<U2')
Retrieve Haplotype data:
If you want to analyze phenotype data, you also need to provide the path to the phenotype file when instantiating.
geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
pheno = "data/traits.txt"
dataset = Dataset(geno, gene_range, pheno)
Training Models
You can utilize standard scikit-learn models through the training functions.
from sklearn.tree import DecisionTreeRegressor
from mlqtl.train import train_with_progressbar
models = [DecisionTreeRegressor]
trait = "grain_length"
train_res = train_with_progressbar(trait, models, dataset, workers=4)
Processing Training Results
Convert raw model outputs into a readable format and analyze the sliding window.
from mlqtl.datautils import proc_train_res, sliding_window_newmethod
from mlqtl.plot import plot_graph
# 1. Process results
result = proc_train_res(train_res, models, dataset, padj=True)
# 2. Calculate sliding window
sw_res, sig_genes, window_threshold = sliding_window_newmethod(result, 400,10,0.9,0.1)
# 3. Plot
plot_graph(sw_res, 10 ** (-window_threshold), "./plot.png", save=True)
Calculating Feature Importance
Identify which SNPs contribute most to the model's predictions.