Details
1. Command Line Interface
❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...
ML-QTL: Machine Learning for QTL Analysis
Options:
--help Show this message and exit.
Commands:
gff2range Convert GFF3 file to plink gene range format
gtf2range Convert GTF file to plink gene range format
importance Calculate feature importance and plot bar chart
rerun Re-run sliding window analysis with new parameters
run Run ML-QTL analysis
The command line interface provides 5 subcommands:
gff2range: Convert GFF3 file to plink gene range format
gtf2range: Convert GTF file to plink gene range format
importance: Calculate feature importance and plot bar chart
rerun: Re-run sliding window analysis with new parameters
run: Run ML-QTL analysis
The main command is run, which is used to perform the analysis. You can view the help information for this command using the following command:
❯ mlqtl run --help
Usage: mlqtl run [OPTIONS]
Run ML-QTL analysis
Options:
-g, --geno TEXT Path to genotype file (plink binary format)
[required]
-p, --pheno PATH Path to phenotype file [required]
-r, --range PATH Path to plink gene range file [required]
-o, --out PATH Path to output directory [required]
-j, --jobs INTEGER Number of processes to use [default: 1]
--threshold FLOAT Significance threshold
-w, --window INTEGER Sliding window size [default: 100]
--step INTEGER Sliding window step size [default: 10]
-m, --model TEXT Model to use [default:
DecisionTreeRegressor,RandomForestRegressor,SVR]
-c, --chrom TEXT Chromosome to analyze
--trait TEXT Trait to analyze
--onehot Use one-hot encoding for categorical features
--adaptive_threshold Use adaptive threshold when cannot find significant
genes
--padj Use adjusted p-value for significance threshold
--help Show this message and exit.
The required parameters are:
-g, –geno: Path to the genotype file in plink binary format, for example:
# plink files are data/genotype.bed, data/genotype.bim, data/genotype.fam -g data/genotype
-p, –pheno: Path to the phenotype file in TSV format. The first column should be sample names, followed by phenotype values. The file must include a header, and the first column header must be “sample”, for example:
sample trait1 trait2 sample1 1.2 3.4 sample2 2.3 4.5
-r, –range: Path to the plink gene range file in TSV format. The first column is the chromosome, the second column is the start position, the third column is the end position, the fourth column is the transcript name (not used in analysis, included for compatibility and readability), and the fifth column is the gene name. No header is required. Example:
1 1983 10815 Os01t0100100-01 Os01g0100100 1 10218 12435 Os01t0100200-01 Os01g0100200 1 11372 13284 Os01t0100300-00 Os01g0100300 1 11721 15685 Os01t0100400-01 Os01g0100400
Other parameters can be adjusted as needed:
-o, –out: Path to the output directory where all result files will be saved
-j, –jobs: Number of processes to use, default is 1
–threshold: Significance threshold, default is 2.74e-5
-w, –window: Sliding window size, default is 100
–step: Sliding window step size, default is 10
-m, –model: Regression model to use. Options are “DecisionTreeRegressor” (Decision Tree Regressor), “RandomForestRegressor” (Random Forest Regressor), and “SVR” (Support Vector Regressor). Default is to use all. If using other models, ensure the corresponding library is installed and specify the full path to the model class, for example:
-m sklearn.linear_model.LinearRegression
-c, –chrom: Chromosome to analyze, default is “all”, meaning all chromosomes
–trait: Trait to analyze, default is “all”, meaning all traits
–onehot: Whether to use one-hot encoding for SNPs, default is False
–adaptive_threshold: Whether to use adaptive threshold if no significant genes are found, default is False
–padj: Whether to use adjusted p-value for significance threshold, default is False
When using the command line interface, you can adjust the parameters as needed to perform the analysis. For example, the following command analyzes the grain length trait using the default models, 64 processes:
❯ mlqtl run -g imputed -p grain_length.txt -r gene_location_range.txt --trait grain_length -j 64 --padj --threshold 2.74e-5 -o result
========================================
ML-QTL Analysis Parameters
========================================
Genotype file: imputed
Phenotype file: grain_length.txt
Gene range file: ../gene_location_range.txt
Output directory: result
Number of processes: 64
Significance threshold: 2.74e-05
Sliding window size: 100
Sliding window step size: 10
Model(s): DecisionTreeRegressor,RandomForestRegressor,SVR
Chromosome: all chromosomes
Trait: grain_length
One-hot encoding: disabled
Adaptive threshold: disabled
Use adjusted p-value: enabled
========================================
==> Starting Analysis ...
==> Genome-wide significance Threshold: 2.74e-05
==> Analyzing Trait: grain_length
==> Training Model ...
grain_length [####################################] 37828/37828 100%
==> Processing Training Result ...
==> Result Graph [result/grain_length/sliding_window.png]
==> Significant Genes Table [result/grain_length/significant_genes.tsv]
==> Training Result Pkl [result/grain_length/train_res.pkl]
==> Training Result Table [result/grain_length/train_res.tsv]
Analysis completed
The result files will be saved in the specified output directory, including:
result/{trait}/sliding_window.png: Sliding window plot
result/{trait}/significant_genes.tsv: List of significant genes
result/{trait}/train_res.pkl: Raw model training results, type List[List[MatrixFloat64 | None]]
result/{trait}/train_res.tsv: Processed model training results in DataFrame format
The importance command is used to calculate feature importance and plot a bar chart. You can view the help information for this command using the following command:
❯ mlqtl importance --help
Usage: mlqtl importance [OPTIONS]
Calculate feature importance and plot bar chart
Options:
-g, --geno TEXT Path to genotype file (plink binary format) [required]
-p, --pheno PATH Path to phenotype file [required]
-r, --range PATH Path to plink gene range file [required]
-o, --out PATH Output directory [required]
--gene TEXT Gene name ( only one gene ) [required]
-m, --model TEXT Model to use [default:
DecisionTreeRegressor,RandomForestRegressor,SVR]
--trait TEXT Trait name ( only one trait ) [required]
--onehot Use one-hot encoding for categorical features
--help Show this message and exit.
After analyzing data using the run command, you can use the rerun command to re-run the sliding window analysis with new parameters, avoiding the need to retrain the model. You can view the help information for this command using the following command:
❯ mlqtl rerun --help
Usage: mlqtl rerun [OPTIONS]
Re-run sliding window analysis with new parameters
Options:
-f, --file PATH Path to the training result file (dataframe)
[required]
-w, --window INTEGER Sliding window size [default: 100]
-s, --step INTEGER Sliding window step size [default: 10]
-t, --threshold FLOAT Significance threshold [required]
-o, --out PATH Output directory [required]
--help Show this message and exit.
-f, –file: Path to the training result file in DataFrame format, typically result/{trait}/train_res.tsv, generated by the run command
2. Python API
If you prefer to use the Python API, you can import and use ML-QTL as follows:
2.1 Dataset Class
The Dataset
class is used to load and process data. You can import and use it as follows:
from mlqtl.data import Dataset
When instantiating the Dataset
class, you need to provide at least the paths to the genotype file and the gene range file. Example:
geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
dataset = Dataset(geno, gene_range)
You can use the get
method of the Dataset
class to retrieve SNP data for a gene. Example:
>>> snp_data = dataset.get("Os01g0100100")
>>> snp_data[0]
(0, array([0, 0, 0, ..., 2, 2, 0], shape=(3024,), dtype=int8))
The result is a tuple where the first element is the SNP index, and the second element is a NumPy array representing binary SNP data. Use the Dataset.snp.base
method to convert it to character SNP data. Example:
>>> dataset.snp.base(*snp_data[0])
array(['GG', 'GG', 'GG', ..., 'AA', 'AA', 'GG'],
shape=(3024,), dtype='<U2')
If you need to encode SNPs, you can use the Dataset.snp.encode
method. Example:
>>> snp_data = dataset.get("Os01g0100100")
>>> dataset.snp.encode(snp_data)
array([[2, 2, 3, ..., 4, 3, 4],
[2, 2, 3, ..., 4, 3, 4],
[2, 2, 3, ..., 4, 3, 4],
...,
[1, 2, 3, ..., 4, 2, 2],
[1, 2, 3, ..., 4, 2, 2],
[2, 2, 3, ..., 4, 3, 4]], shape=(3024, 71), dtype=int8)
# One-hot encoding
>>> snp_encoded = dataset.snp.encode(snp_data, onehot=True)
>>> snp_encoded.shape
(3024, 71, 4)
>>> snp_encoded[0][0]
array([0., 1., 0., 0.])
Use the get_hap
method of the Dataset
class to retrieve haplotype data for a gene. Example:
>>> hap_data = dataset.get_hap("Os01g0100100")
>>> hap_data
hap seq count
rank
1 Hap1 GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC... 2084
2 Hap2 GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC... 135
3 Hap3 AGCAGTCAGGCAGTCGACACGTTGCGAGGGACGAGCGTGATGGTAC... 134
4 Hap4 AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG... 80
5 Hap5 AGTGAGCTGGTGATCGGTATGTAGCGGGGAGCGGGTACAGTACTTG... 71
... ... ... ...
258 Hap258 AGTGATCTGATGGTCGGTACGTWGCGGGGAACGGGTACAGTACTAG... 1
259 Hap259 AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG... 1
260 Hap260 AGTGATCTGATGGTNGGTACGTAGCGGGGAANGGGTACAGTACTAG... 1
261 Hap261 AGTGATCTGATGGTCNGTACGTANCGGGGAANGGGTACAGTACTAG... 1
262 Hap262 AGTGATCTGGTGGTCGGTACGTAGTAGGGAACGGGTACAGTACNAG... 1
[262 rows x 3 columns]
If you want to analyze phenotype data, you also need to provide the path to the phenotype file when instantiating. Example:
geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
pheno = "data/merge_15_traits.txt"
dataset = Dataset(geno, gene_range, pheno)
2.2 Training Models
Import the model classes you want to use
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
Import the model training functions
from mlqtl.train import train
from mlqtl.train import train_with_progressbar
The tool provides two training functions, train
and train_with_progressbar
. The former does not display a progress bar, while the latter does.
Use the training function to train the models
models = [DecisionTreeRegressor, RandomForestRegressor, SVR]
trait = "grain_length"
workers = 4
train_res = train_with_progressbar(trait, models, dataset, workers)
2.3 Processing Training Results
The training results are a list of type List[List[MatrixFloat64 | None]]
, where each element corresponds to the training results of a model. You can process the training results as follows:
from mlqtl.datautils import proc_train_res
padj = True
result = proc_train_res(train_res, models, dataset, padj)
The result is a pandas DataFrame containing the training results for each gene.
Calculate sliding window
from mlqtl.datautils import sliding_window
window = 100
step = 10
threshold = 2.74e-5
sw_res, sig_genes = sliding_window(result, window, step, threshold)
Plot sliding window graph
import os
from mlqtl.plot import plot_graph
# If you are using a Jupyter Notebook environment, you can use the following code to plot the graph
plot_graph(sw_res, threshold)
# If you want to save the graph to a file, specify the save path
plot_path = os.path.join(".", f"{trait}_sliding_window")
plot_graph(sw_res, threshold, plot_path, save=True)
The generated image will be saved in the current directory with the filename grain_length_sliding_window.png.
2.4 Calculating Feature Importance
If you need to calculate feature importance, you can use the following method:
from mlqtl.train import feature_importance
onehot = False
gene = "Os01g0100100"
trait = "grain_length"
importance_df = feature_importance(gene, trait, models, dataset, onehot)
Visualize feature importance:
from mlqtl.plot import plot_feature_importance
top_N = 10
# Jupyter Notebook environment
plot_feature_importance(importance_df, top_N)
# Save to file
plot_path = os.path.join(".", f"{gene}_{trait}")
plot_feature_importance(importance_df, top_N, True, plot_path)
The above demonstrates how to use the Python API of this tool for analysis. For more details, refer to the comments and documentation in the code.