详细

1. 命令行界面

❯ mlqtl --help
Usage: mlqtl [OPTIONS] COMMAND [ARGS]...

ML-QTL: Machine Learning for QTL Analysis

Options:
--help  Show this message and exit.

Commands:
gff2range   Convert GFF3 file to plink gene range format
gtf2range   Convert GTF file to plink gene range format
importance  Calculate feature importance and plot bar chart
rerun       Re-run sliding window analysis with new parameters
run         Run ML-QTL analysis

命令行有5个子命令：

gff2range: 将GFF3文件转换为plink基因范围格式
gtf2range: 将GTF文件转换为plink基因范围格式
importance: 计算特征重要性并绘制条形图
rerun: 使用新的参数重新运行滑动窗口分析
run: 运行ML-QTL分析

其中主要的命令是 run，它用于执行分析。你可以通过以下命令查看该命令的帮助信息：

❯ mlqtl run --help
Usage: mlqtl run [OPTIONS]

Run ML-QTL analysis

Options:
-g, --geno TEXT       Path to genotype file (plink binary format)
                        [required]
-p, --pheno PATH      Path to phenotype file  [required]
-r, --range PATH      Path to plink gene range file  [required]
-o, --out PATH        Path to output directory  [required]
-j, --jobs INTEGER    Number of processes to use  [default: 1]
--threshold FLOAT     Significance threshold
-w, --window INTEGER  Sliding window size  [default: 100]
--step INTEGER        Sliding window step size  [default: 10]
-m, --model TEXT      Model to use  [default:
                        DecisionTreeRegressor,RandomForestRegressor,SVR]
-c, --chrom TEXT      Chromosome to analyze
--trait TEXT          Trait to analyze
--onehot              Use one-hot encoding for categorical features
--adaptive_threshold  Use adaptive threshold when cannot find significant
                        genes
--padj                Use adjusted p-value for significance threshold
--help                Show this message and exit.

其中必须的参数有：

-g, --geno: plink二进制格式的基因型文件的路径，例如：

# plink文件为 data/genotype.bed , data/genotype.bim, data/genotype.fam
-g data/genotype

-p, --pheno: 表型文件的路径，格式为tsv，第一列为样本名，后续列为表型值，需要包含表头，第一列的列名需要为"sample", 例如：
sample trait1 trait2 sample1 1.2 3.4 sample2 2.3 4.5
-r, --range: plink基因范围文件的路径，格式为tsv，第一列为染色体，第二列为起始位置，第三列为结束位置，第四列为转录本名（此列内容不会实际用到，存在的原因是为了兼容plink命令行和增加可读性），第五列为基因名，不需要表头，格式如下：
1 1983 10815 Os01t0100100-01 Os01g0100100 1 10218 12435 Os01t0100200-01 Os01g0100200 1 11372 13284 Os01t0100300-00 Os01g0100300 1 11721 15685 Os01t0100400-01 Os01g0100400

其他参数可以根据需要调整：

-o, --out: 输出目录的路径，所有结果文件将保存在此目录下
-j, --jobs: 使用的进程数，默认为1
--threshold: 显著性阈值，默认为2.74e-5
-w, --window: 滑动窗口大小，默认为100
--step: 滑动窗口步长，默认为10
-m, --model: 使用的回归模型, 可选值为 "DecisionTreeRegressor"（决策树回归）、"RandomForestRegressor"（随机森林回归）、"SVR"（支持向量回归），默认全部使用。如果使用其他的模型，请确保已安装相应的库，并使用完整的路径来指定模型类，例如：
-m sklearn.linear_model.LinearRegression
-c, --chrom: 要分析的染色体，默认为"all"，表示分析所有染色体
--trait: 要分析的性状，默认为"all"，表示分析所有性状
--onehot: 是否对snp使用独热编码，默认为False
--adaptive_threshold: 如果无法找到显著基因，是否使用自适应阈值，默认为False
--padj: 是否使用校正后的P值，默认为False

使用命令行界面时，你可以根据需要调整参数来执行分析。例如，以下命令将使用默认模型分析粒长性状，64个进程

❯ mlqtl run -g imputed -p grain_length.txt -r gene_location_range.txt --trait grain_length -j 64 --padj --threshold 2.74e-5 -o result

========================================
    ML-QTL Analysis Parameters
========================================
Genotype file:            imputed
Phenotype file:           grain_length.txt
Gene range file:          ../gene_location_range.txt
Output directory:         result
Number of processes:      64
Significance threshold:   2.74e-05
Sliding window size:      100
Sliding window step size: 10
Model(s):                 DecisionTreeRegressor,RandomForestRegressor,SVR
Chromosome:               all chromosomes
Trait:                    grain_length
One-hot encoding:         disabled
Adaptive threshold:       disabled
Use adjusted p-value:     enabled
========================================

==> Starting Analysis ...
==> Genome-wide significance Threshold: 2.74e-05
==> Analyzing Trait: grain_length
==> Training Model ...
grain_length  [####################################]  37828/37828  100%
==> Processing Training Result ...
==> Result Graph [result/grain_length/sliding_window.png]
==> Significant Genes Table [result/grain_length/significant_genes.tsv]
==> Training Result Pkl [result/grain_length/train_res.pkl]
==> Training Result Table [result/grain_length/train_res.tsv]
Analysis completed

结果文件将保存在指定的输出目录下，主要包括：

result/{trait}/sliding_window.png: 滑动窗口图
result/{trait}/significant_genes.tsv: 显著基因列表
result/{trait}/train_res.pkl: 原始的模型训练结果，类型为List[List[MatrixFloat64 | None]]
result/{trait}/train_res.tsv: 处理过的模型训练结果的DataFrame格式

其次是 importance 命令，用于计算特征重要性并绘制条形图。你可以通过以下命令查看该命令的帮助信息：

❯ mlqtl importance --help
Usage: mlqtl importance [OPTIONS]

Calculate feature importance and plot bar chart

Options:
-g, --geno TEXT   Path to genotype file (plink binary format)  [required]
-p, --pheno PATH  Path to phenotype file  [required]
-r, --range PATH  Path to plink gene range file  [required]
-o, --out PATH    Output directory  [required]
--gene TEXT       Gene name ( only one gene )  [required]
-m, --model TEXT  Model to use  [default:
                    DecisionTreeRegressor,RandomForestRegressor,SVR]
--trait TEXT      Trait name ( only one trait )  [required]
--onehot          Use one-hot encoding for categorical features
--help            Show this message and exit.

当你使用 run 命令分析完数据后, 可以使用 rerun 命令使用新的参数重新运行滑动窗口分析，从而避免重新训练模型。你可以通过以下命令查看该命令的帮助信息：

❯ mlqtl rerun --help
Usage: mlqtl rerun [OPTIONS]

Re-run sliding window analysis with new parameters

Options:
-f, --file PATH        Path to the training result file (dataframe)
                        [required]
-w, --window INTEGER   Sliding window size  [default: 100]
-s, --step INTEGER     Sliding window step size  [default: 10]
-t, --threshold FLOAT  Significance threshold  [required]
-o, --out PATH         Output directory  [required]
--help                 Show this message and exit.

-f, --file: dataframe格式的训练结果文件路径，通常是 result/{trait}/train_res.tsv, 即由 run 命令生成的结果文件

2. Python API

如果你希望使用Python API，可以通过以下方式导入并使用ML-QTL：

2.1 Dataset类

Dataset 类用于加载和处理数据。你可以通过以下方式导入并使用它：

from mlqtl.data import Dataset

实例化 Dataset 类时，需要至少提供基因型文件和基因范围文件的路径。示例：

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
dataset = Dataset(geno, gene_range)

可以通过 Dataset 类的 get 方法来获取一个基因上的snp数据。示例：

>>> snp_data = dataset.get("Os01g0100100")
>>> snp_data[0]
(0, array([0, 0, 0, ..., 2, 2, 0], shape=(3024,), dtype=int8))

获得的结果是一个元组，第一个元素是snp的索引，第二个元素是一个NumPy数组，表示二进制的snp数据，使用 Dataset.snp.base 方法来转换为字符符型snp数据。示例：

>>> dataset.snp.base(*snp_data[0])
array(['GG', 'GG', 'GG', ..., 'AA', 'AA', 'GG'],
    shape=(3024,), dtype='<U2')

如果需要将snp编码，可以使用 Dataset.snp.encode 方法。示例：

>>> snp_data = dataset.get("Os01g0100100")
>>> dataset.snp.encode(snp_data)
array([[2, 2, 3, ..., 4, 3, 4],
    [2, 2, 3, ..., 4, 3, 4],
    [2, 2, 3, ..., 4, 3, 4],
    ...,
    [1, 2, 3, ..., 4, 2, 2],
    [1, 2, 3, ..., 4, 2, 2],
    [2, 2, 3, ..., 4, 3, 4]], shape=(3024, 71), dtype=int8)

# onehot编码
>>> snp_encoded = dataset.snp.encode(snp_data, onehot=True)
>>> snp_encoded.shape
(3024, 71, 4)

>>> snp_encoded[0][0]
array([0., 1., 0., 0.])

使用 Dataset 类的 get_hap 方法可以获取一个基因上的单倍型数据。示例：

>>> hap_data = dataset.get_hap("Os01g0100100")
>>> hap_data
        hap                                                seq  count
rank
1       Hap1  GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC...   2084
2       Hap2  GGCGGTCTAGCGGCTAGCACGCTACGGGGGATTGATGCGGCGGAAC...    135
3       Hap3  AGCAGTCAGGCAGTCGACACGTTGCGAGGGACGAGCGTGATGGTAC...    134
4       Hap4  AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG...     80
5       Hap5  AGTGAGCTGGTGATCGGTATGTAGCGGGGAGCGGGTACAGTACTTG...     71
...      ...                                                ...    ...
258   Hap258  AGTGATCTGATGGTCGGTACGTWGCGGGGAACGGGTACAGTACTAG...      1
259   Hap259  AGTGATCTGATGGTCGGTACGTAGCGGGGAACGGGTACAGTACTAG...      1
260   Hap260  AGTGATCTGATGGTNGGTACGTAGCGGGGAACGGGTACAGTACTAG...      1
261   Hap261  AGTGATCTGATGGTCNGTACGTANCGGGGAANGGGTACAGTACTAG...      1
262   Hap262  AGTGATCTGGTGGTCGGTACGTAGTAGGGAACGGGTACAGTACNAG...      1

[262 rows x 3 columns]

如果要用于分析表型数据，你在实例化时还需要提供表型文件的路径。示例：

geno = "data/imputed"
gene_range = "data/gene_location_range.txt"
pheno = "data/merge_15_traits.txt"
dataset = Dataset(geno, gene_range, pheno)

2.2 训练模型

导入要使用的模型类

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

导入模型训练函数

from mlqtl.train import train
from mlqtl.train import train_with_progressbar

工具提供了两个训练函数，train 和 train_with_progressbar，前者不显示进度条，后者会显示进度条

使用训练函数进行模型训练

models = [DecisionTreeRegressor, RandomForestRegressor, SVR]
trait = "grain_length"
workers = 4
train_res = train_with_progressbar(trait, models, dataset, workers)

2.3 处理训练结果

训练结果是一个列表, 类型为 List[List[MatrixFloat64 | None]]，其中每个元素对应一个模型的训练结果。你可以通过以下方式处理训练结果：

from mlqtl.datautils import proc_train_res
padj = True
result = proc_train_res(train_res, models, dataset, padj)

result是一个pandas DataFrame，包含了每个基因的训练结果

计算滑动窗口

from mlqtl.datautils import sliding_window
window = 100
step = 10
threshold = 2.74e-5
sw_res, sig_genes = sliding_window(result, window, step, threshold)

绘制滑动窗口图

import os
from mlqtl.plot import plot_graph

# 如果你使用的是Jupyter Notebook环境，可以使用以下代码来绘制图形
plot_graph(sw_res, threshold)

# 如果你希望将图形保存到文件，可以指定保存路径
plot_path = os.path.join(".", f"{trait}_sliding_window")
plot_graph(sw_res, threshold, plot_path, save=True)

生成的图像将保存在当前目录下，文件名为 grain_length_sliding_window.png

2.4 计算特征重要性

如果你需要计算特征重要性，可以使用以下方法：

from mlqtl.train import feature_importance
onehot = False
gene = "Os01g0100100"
trait = "grain_length"
importance_df = feature_importance(gene, trait, models, dataset, onehot)

可视化特征重要性：

from mlqtl.plot import plot_feature_importance
top_N = 10

# Jupyter Notebook环境
plot_feature_importance(importance_df, top_N)

# 保存到文件
plot_path = os.path.join(".", f"{gene}_{trait}")
plot_feature_importance(importance_df, top_N, True, plot_path)

以上展示了如何使用此工具的Python API来进行分析，更多细节可以参考代码中的注释和文档。