Details

1. Command Line Interface

❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...

mlQTL: Machine Learning for QTL Analysis

Options:
--help  Show this message and exit.

Commands:
gff2range   Convert GFF3 file to plink gene range format
gtf2range   Convert GTF file to plink gene range format
importance  Calculate feature importance and plot bar chart
run         Run mlQTL analysis

The command line interface provides 5 subcommands:

gff2range: Convert GFF3 file to plink gene range format
gtf2range: Convert GTF file to plink gene range format
importance: Calculate feature importance and plot bar chart
run: Run mlQTL analysis

The main command is run, which is used to perform the analysis. You can view the help information for this command using the following command:

❯ mlqtl run --help
Usage: mlqtl run [OPTIONS]

  Run mlQTL analysis

Options:
  -g, --geno TEXT              Path to genotype file (plink binary format) [required]
  -p, --pheno PATH             Path to phenotype file  [required]
  -r, --range PATH             Path to plink gene range file  [required]
  -o, --out PATH               Path to output directory  [required]
  -j, --jobs INTEGER           Number of processes to use  [default: 1]
  -m, --model TEXT             Model to use  [default: DecisionTreeRegressor,RandomForestRegressor,SVR]
  -c, --chrom TEXT             Chromosome to analyze
  --trait TEXT                 Trait to analyze
  --onehot                     Use one-hot encoding for categorical features
  --padj                       Use adjusted p-value for significance threshold (default: True)
  --center-window-kb INTEGER   Window radius in kilobases (kb) for symmetric neighborhood (e.g., 400 for ±400kb) [default:400]
  --center-step-genes INTEGER  Step size in number of genes for center-based window [default: 10]
  --q FLOAT                    Quantile used as window score(e.g. 0.9 = 90% quantile) [default: 0.9]
  --top-prop FLOAT             Top proportion of windows genome-wide selected as QTL for --newmethod (e.g. 0.10 = top 10%) [default: 0.1]
  --help                       Show this message and exit.

Required Parameters:

-g, --geno: Path to the genotype file in plink binary format.

# plink files are data/genotype.bed, data/genotype.bim, data/genotype.fam
-g data/genotype

-p, --pheno: Path to the phenotype file in TSV format. The first column should be sample names, followed by phenotype values. The file must include a header, and the first column header must be "sample".
```
sample  trait1  trait2
sample1 1.2 3.4
sample2 2.3 4.5
```
-r, --range: Path to the plink gene range file in TSV format. Columns: Chromosome, Start, End, Transcript Name, Gene Name. No header required.
```
1   1983    10815   Os01t0100100-01 Os01g0100100
1   10218   12435   Os01t0100200-01 Os01g010020
```

Optional Parameters:

-o, --out: Path to the output directory.
-j, --jobs: Number of processes (default: 1).
-m, --model: Regression model(s). Options: DecisionTreeRegressor, RandomForestRegressor, SVR.
--center-window-kb: Defines the window radius in kilobases (kb). The window extends this distance both upstream and downstream from the center to calculate gene signals within the region. The default value is 400 kb (covering a total range of 400kb).
--center-step-genes: The step size for window movement, measured in number of genes. This defines how many genes the center moves forward during each sliding window iteration. The default value is 10 genes.
--q: The quantile used to calculate the window score. For example, a value of 0.9 means the 90th percentile of gene scores within the window is taken as the window's overall rating to highlight strong signals. The default value is 0.9.
--top-prop: The proportion of windows to be selected as candidate QTLs. For instance, 0.1 indicates that the top 10% of windows with the highest scores genome-wide will be selected as QTL regions. The default value is 0.1.
--onehot: Enable one-hot encoding for SNPs (default: False).

Example Run:

❯ mlqtl run -g imputed_base_filtered_v0.7 -p grain_length.txt -r gene_location_range.txt -j 64 -o result

Output files in result/{trait}/:

sliding_window.png: Visualization of results.
candidate_genes.tsv: List of candidate genes.
train_res.tsv: Full processed training results.
qtl_regions.tsv: List of QTL.

Feature importance subcommand:

❯ mlqtl importance --help
Usage: mlqtl importance [OPTIONS]

  Calculate feature importance and plot bar chart

Options:
  -g, --geno TEXT   Path to genotype file (plink binary format)  [required]
  -p, --pheno PATH  Path to phenotype file  [required]
  -r, --range PATH  Path to plink gene range file  [required]
  -o, --out PATH   Output directory  [required]
  --gene TEXT      Gene name (only one gene)  [required]
  -m, --model TEXT  Model to use  [default: DecisionTreeRegressor,RandomForestRegressor,SVR]
  --trait TEXT     Trait name (only one trait)  [required]
  --onehot         Use one-hot encoding for categorical features
  --help          Show this message and exit.

Example for feature importance:

❯ mlqtl importance --geno imputed_base_filtered_v0.7 --pheno grain_length.txt --range gene_location_range.txt -o ./feature_importance --gene Os03g0407400 -m RandomForestRegressor --trait grain_length

Output files in ./feature_importance/Os03g0407400/:

Os03g0407400_grain_length_feature_importance.tsv: Feature importance values for each SNP.
Os03g0407400_grain_length_RandomForestRegressor.png: Bar plot of feature importance

2. Python API

Dataset Class

The Dataset class manages data loading.

When instantiating the Dataset class, you need to provide at least the paths to the genotype file and the gene range file.

from mlqtl.data import Dataset

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
dataset = Dataset(geno, gene_range)

Retrieve SNP data:

>>> snp_data = dataset.get("Os01g0100100")
>>> snp_data[0]
(0, array([0, 0, 0, ..., 2, 2, 0], shape=(3024,), dtype=int8))
>>> dataset.snp.base(*snp_data[0])
array(['GG', 'GG', 'GG', ..., 'AA', 'AA', 'GG'], dtype='<U2')

Retrieve Haplotype data:

>>> hap_data = dataset.get_hap("Os01g0100100")

If you want to analyze phenotype data, you also need to provide the path to the phenotype file when instantiating.

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
pheno = "data/traits.txt"
dataset = Dataset(geno, gene_range, pheno)

Training Models

You can utilize standard scikit-learn models through the training functions.

from sklearn.tree import DecisionTreeRegressor
from mlqtl.train import train_with_progressbar

models = [DecisionTreeRegressor]
trait = "grain_length"
train_res = train_with_progressbar(trait, models, dataset, workers=4)

Processing Training Results

Convert raw model outputs into a readable format and analyze the sliding window.

from mlqtl.datautils import proc_train_res, sliding_window_newmethod
from mlqtl.plot import plot_graph

# 1. Process results
result = proc_train_res(train_res, models, dataset, padj=True)

# 2. Calculate sliding window
sw_res, sig_genes, window_threshold = sliding_window_newmethod(result, 400,10,0.9,0.1)

# 3. Plot
plot_graph(sw_res, 10 ** (-window_threshold), "./plot.png", save=True)

Calculating Feature Importance

Identify which SNPs contribute most to the model's predictions.

from mlqtl.train import feature_importance
from mlqtl.plot import plot_feature_importance

importance_df = feature_importance("Os01g0100100", "grain_length", models, dataset)
plot_feature_importance(importance_df, top_N=10)