Details

1. Command Line Interface

❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...

ML-QTL: Machine Learning for QTL Analysis

Options:
--help  Show this message and exit.

Commands:
gff2range   Convert GFF3 file to plink gene range format
gtf2range   Convert GTF file to plink gene range format
importance  Calculate feature importance and plot bar chart
rerun       Re-run sliding window analysis with new parameters
run         Run ML-QTL analysis

The command line interface provides 5 subcommands:

gff2range: Convert GFF3 file to plink gene range format
gtf2range: Convert GTF file to plink gene range format
importance: Calculate feature importance and plot bar chart
rerun: Re-run sliding window analysis with new parameters
run: Run ML-QTL analysis

The main command is run, which is used to perform the analysis. You can view the help information for this command using the following command:

❯ mlqtl run --help
Usage: mlqtl run [OPTIONS]

Run ML-QTL analysis

Options:
-g, --geno TEXT       Path to genotype file (plink binary format)
                        [required]
-p, --pheno PATH      Path to phenotype file  [required]
-r, --range PATH      Path to plink gene range file  [required]
-o, --out PATH        Path to output directory  [required]
-j, --jobs INTEGER    Number of processes to use  [default: 1]
--threshold FLOAT     Significance threshold
-w, --window INTEGER  Sliding window size  [default: 100]
--step INTEGER        Sliding window step size  [default: 10]
-m, --model TEXT      Model to use  [default:
                        DecisionTreeRegressor,RandomForestRegressor,SVR]
-c, --chrom TEXT      Chromosome to analyze
--trait TEXT          Trait to analyze
--onehot              Use one-hot encoding for categorical features
--adaptive_threshold  Use adaptive threshold when cannot find significant
                        genes
--padj                Use adjusted p-value for significance threshold
--help                Show this message and exit.

The required parameters are:

-g, –geno: Path to the genotype file in plink binary format, for example:

# plink files are data/genotype.bed, data/genotype.bim, data/genotype.fam
-g data/genotype

-p, –pheno: Path to the phenotype file in TSV format. The first column should be sample names, followed by phenotype values. The file must include a header, and the first column header must be “sample”, for example:
sample trait1 trait2 sample1 1.2 3.4 sample2 2.3 4.5
-r, –range: Path to the plink gene range file in TSV format. The first column is the chromosome, the second column is the start position, the third column is the end position, the fourth column is the transcript name (not used in analysis, included for compatibility and readability), and the fifth column is the gene name. No header is required. Example:
1 1983 10815 Os01t0100100-01 Os01g0100100 1 10218 12435 Os01t0100200-01 Os01g0100200 1 11372 13284 Os01t0100300-00 Os01g0100300 1 11721 15685 Os01t0100400-01 Os01g0100400

Other parameters can be adjusted as needed:

-o, –out: Path to the output directory where all result files will be saved
-j, –jobs: Number of processes to use, default is 1
–threshold: Significance threshold, default is 2.74e-5
-w, –window: Sliding window size, default is 100
–step: Sliding window step size, default is 10
-m, –model: Regression model to use. Options are “DecisionTreeRegressor” (Decision Tree Regressor), “RandomForestRegressor” (Random Forest Regressor), and “SVR” (Support Vector Regressor). Default is to use all. If using other models, ensure the corresponding library is installed and specify the full path to the model class, for example:
-m sklearn.linear_model.LinearRegression
-c, –chrom: Chromosome to analyze, default is “all”, meaning all chromosomes
–trait: Trait to analyze, default is “all”, meaning all traits
–onehot: Whether to use one-hot encoding for SNPs, default is False
–adaptive_threshold: Whether to use adaptive threshold if no significant genes are found, default is False
–padj: Whether to use adjusted p-value for significance threshold, default is False

When using the command line interface, you can adjust the parameters as needed to perform the analysis. For example, the following command analyzes the grain length trait using the default models, 64 processes:

❯ mlqtl run -g imputed -p grain_length.txt -r gene_location_range.txt --trait grain_length -j 64 --padj --threshold 2.74e-5 -o result

========================================
    ML-QTL Analysis Parameters
========================================
Genotype file:            imputed
Phenotype file:           grain_length.txt
Gene range file:          ../gene_location_range.txt
Output directory:         result
Number of processes:      64
Significance threshold:   2.74e-05
Sliding window size:      100
Sliding window step size: 10
Model(s):                 DecisionTreeRegressor,RandomForestRegressor,SVR
Chromosome:               all chromosomes
Trait:                    grain_length
One-hot encoding:         disabled
Adaptive threshold:       disabled
Use adjusted p-value:     enabled
========================================

==> Starting Analysis ...
==> Genome-wide significance Threshold: 2.74e-05
==> Analyzing Trait: grain_length
==> Training Model ...
grain_length  [####################################]  37828/37828  100%
==> Processing Training Result ...
==> Result Graph [result/grain_length/sliding_window.png]
==> Significant Genes Table [result/grain_length/significant_genes.tsv]
==> Training Result Pkl [result/grain_length/train_res.pkl]
==> Training Result Table [result/grain_length/train_res.tsv]
Analysis completed

The result files will be saved in the specified output directory, including:

result/{trait}/sliding_window.png: Sliding window plot
result/{trait}/significant_genes.tsv: List of significant genes
result/{trait}/train_res.pkl: Raw model training results, type List[List[MatrixFloat64 | None]]
result/{trait}/train_res.tsv: Processed model training results in DataFrame format

The importance command is used to calculate feature importance and plot a bar chart. You can view the help information for this command using the following command:

❯ mlqtl importance --help
Usage: mlqtl importance [OPTIONS]

Calculate feature importance and plot bar chart

Options:
-g, --geno TEXT   Path to genotype file (plink binary format)  [required]
-p, --pheno PATH  Path to phenotype file  [required]
-r, --range PATH  Path to plink gene range file  [required]
-o, --out PATH    Output directory  [required]
--gene TEXT       Gene name ( only one gene )  [required]
-m, --model TEXT  Model to use  [default:
                    DecisionTreeRegressor,RandomForestRegressor,SVR]
--trait TEXT      Trait name ( only one trait )  [required]
--onehot          Use one-hot encoding for categorical features
--help            Show this message and exit.

After analyzing data using the run command, you can use the rerun command to re-run the sliding window analysis with new parameters, avoiding the need to retrain the model. You can view the help information for this command using the following command:

❯ mlqtl rerun --help
Usage: mlqtl rerun [OPTIONS]

Re-run sliding window analysis with new parameters

Options:
-f, --file PATH        Path to the training result file (dataframe)
                        [required]
-w, --window INTEGER   Sliding window size  [default: 100]
-s, --step INTEGER     Sliding window step size  [default: 10]
-t, --threshold FLOAT  Significance threshold  [required]
-o, --out PATH         Output directory  [required]
--help                 Show this message and exit.

-f, –file: Path to the training result file in DataFrame format, typically result/{trait}/train_res.tsv, generated by the run command

2. Python API

If you prefer to use the Python API, you can import and use ML-QTL as follows:

2.1 Dataset Class

The Dataset class is used to load and process data. You can import and use it as follows:

from mlqtl.data import Dataset

When instantiating the Dataset class, you need to provide at least the paths to the genotype file and the gene range file. Example:

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
dataset = Dataset(geno, gene_range)

You can use the get method of the Dataset class to retrieve SNP data for a gene. Example:

>>> snp_data = dataset.get("Os01g0100100")
>>> snp_data[0]
(0, array([0, 0, 0, ..., 2, 2, 0], shape=(3024,), dtype=int8))

The result is a tuple where the first element is the SNP index, and the second element is a NumPy array representing binary SNP data. Use the Dataset.snp.base method to convert it to character SNP data. Example:

>>> dataset.snp.base(*snp_data[0])
array(['GG', 'GG', 'GG', ..., 'AA', 'AA', 'GG'],
    shape=(3024,), dtype='<U2')

If you need to encode SNPs, you can use the Dataset.snp.encode method. Example:

>>> snp_data = dataset.get("Os01g0100100")
>>> dataset.snp.encode(snp_data)
array([[2, 2, 3, ..., 4, 3, 4],
    [2, 2, 3, ..., 4, 3, 4],
    [2, 2, 3, ..., 4, 3, 4],
    ...,
    [1, 2, 3, ..., 4, 2, 2],
    [1, 2, 3, ..., 4, 2, 2],
    [2, 2, 3, ..., 4, 3, 4]], shape=(3024, 71), dtype=int8)

# One-hot encoding
>>> snp_encoded = dataset.snp.encode(snp_data, onehot=True)
>>> snp_encoded.shape
(3024, 71, 4)

>>> snp_encoded[0][0]
array([0., 1., 0., 0.])

Use the get_hap method of the Dataset class to retrieve haplotype data for a gene. Example:

>>> hap_data = dataset.get_hap("Os01g0100100")
>>> hap_data
        hap                                                seq  count
rank
1       Hap1  GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC...   2084
2       Hap2  GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC...    135
3       Hap3  AGCAGTCAGGCAGTCGACACGTTGCGAGGGACGAGCGTGATGGTAC...    134
4       Hap4  AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG...     80
5       Hap5  AGTGAGCTGGTGATCGGTATGTAGCGGGGAGCGGGTACAGTACTTG...     71
...      ...                                                ...    ...
258   Hap258  AGTGATCTGATGGTCGGTACGTWGCGGGGAACGGGTACAGTACTAG...      1
259   Hap259  AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG...      1
260   Hap260  AGTGATCTGATGGTNGGTACGTAGCGGGGAANGGGTACAGTACTAG...      1
261   Hap261  AGTGATCTGATGGTCNGTACGTANCGGGGAANGGGTACAGTACTAG...      1
262   Hap262  AGTGATCTGGTGGTCGGTACGTAGTAGGGAACGGGTACAGTACNAG...      1

[262 rows x 3 columns]

If you want to analyze phenotype data, you also need to provide the path to the phenotype file when instantiating. Example:

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
pheno = "data/merge_15_traits.txt"
dataset = Dataset(geno, gene_range, pheno)

2.2 Training Models

Import the model classes you want to use

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

Import the model training functions

from mlqtl.train import train
from mlqtl.train import train_with_progressbar

The tool provides two training functions, train and train_with_progressbar. The former does not display a progress bar, while the latter does.

Use the training function to train the models

models = [DecisionTreeRegressor, RandomForestRegressor, SVR]
trait = "grain_length"
workers = 4
train_res = train_with_progressbar(trait, models, dataset, workers)

2.3 Processing Training Results

The training results are a list of type List[List[MatrixFloat64 | None]], where each element corresponds to the training results of a model. You can process the training results as follows:

from mlqtl.datautils import proc_train_res
padj = True
result = proc_train_res(train_res, models, dataset, padj)

The result is a pandas DataFrame containing the training results for each gene.

Calculate sliding window

from mlqtl.datautils import sliding_window
window = 100
step = 10
threshold = 2.74e-5
sw_res, sig_genes = sliding_window(result, window, step, threshold)

Plot sliding window graph

import os
from mlqtl.plot import plot_graph

# If you are using a Jupyter Notebook environment, you can use the following code to plot the graph
plot_graph(sw_res, threshold)

# If you want to save the graph to a file, specify the save path
plot_path = os.path.join(".", f"{trait}_sliding_window")
plot_graph(sw_res, threshold, plot_path, save=True)

The generated image will be saved in the current directory with the filename grain_length_sliding_window.png.

2.4 Calculating Feature Importance

If you need to calculate feature importance, you can use the following method:

from mlqtl.train import feature_importance
onehot = False
gene = "Os01g0100100"
trait = "grain_length"
importance_df = feature_importance(gene, trait, models, dataset, onehot)

Visualize feature importance:

from mlqtl.plot import plot_feature_importance
top_N = 10

# Jupyter Notebook environment
plot_feature_importance(importance_df, top_N)

# Save to file
plot_path = os.path.join(".", f"{gene}_{trait}")
plot_feature_importance(importance_df, top_N, True, plot_path)

The above demonstrates how to use the Python API of this tool for analysis. For more details, refer to the comments and documentation in the code.