mlqtl package

Subpackages

Submodules

mlqtl.cli module

mlqtl.datautils module

mlqtl.datautils.cal_padj(group: DataFrame) DataFrame[source]

Calculate the padj for a given group of p-values

mlqtl.datautils.cal_sliding_window(met: DataFrame, chrom: str, window_size: int, step: int) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Sliding window to calculate the mean of the padj_norm values

Parameters

metDataFrame

The DataFrame containing the padj_norm values

chromstr

The chromosome to calculate the mean

window_sizeint

The size of the window

stepint

The step size for the sliding window

Returns

MatrixFloat64

The mean of the padj_norm values for each window and the start and end positions

mlqtl.datautils.merge_window(window_mean: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], threshold: float64) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None[source]

Merge genes in the same region

Parameters

window_meanMatrixFloat64

The mean of the padj_norm values for each window and the start and end positions

thresholdnp.float64

The threshold to filter the mean values

Returns

MatrixFloat64

The merged windows with start and end positions and the mean padj_norm value

mlqtl.datautils.proc_train_res(train_res: List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]], models: List[RegressorMixin], dataset: Dataset, padj: bool = False) DataFrame[source]

Integrate the results from different chunks and calculate padj

Parameters

resultList[List[MatrixFloat64 | None]]

The result from the regression models, each Matrix element is a list of (pcc, pval)

modelsList[RegressorMixin]

The list of regression models

datasetDataset

The dataset containing the information

Returns

DataFrame

The integrated DataFrame containing the correlation, p-value, padj, and gene information

mlqtl.datautils.significance(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], result: DataFrame, threshold: float64) DataFrame[source]

Get the gene in the peek window of the green region

Parameters

sliding_window_resultList[Tuple[str, MatrixFloat64, MatrixFloat64]]

Results of the sliding window calculation

resultDataFrame

Integrated training results

thresholdnp.float64

The threshold to filter the candidate genes

Returns

DataFrame

Gene table of the green region in the graph

mlqtl.datautils.skip_padj(group: DataFrame) DataFrame[source]

Skip the padj calculation for a given group of p-values

mlqtl.datautils.sliding_window(result: DataFrame, window: int, step: int, threshold: float64) Tuple[List[Tuple[str_, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], DataFrame][source]

Convert the training results to dataframe and calculate the sliding window and merge significant regions

Parameters

resultDataFrame

The integrated training results containing the correlation, p-value, and gene information

window_sizeint

The size of the window

stepint

The step size for the sliding window

thresholdnp.float64

The threshold to filter the mean values

Returns

sliding_window_resultList[np.str, MatrixFloat64, MatrixFloat64]

The sliding window results is a list of tuples containing the chromosome, the mean values for each window, and the merged windows

significant_genesDataFrame

The significant genes in the green region of the graph

mlqtl.nda_typing module

mlqtl.nda_typing.MatrixFloat64

2D array (matrix) of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int, int]]

mlqtl.nda_typing.MatrixInt8

2D array (matrix) of 8-bit signed integers.

alias of Annotated[ndarray[tuple[int, …], dtype[int8]], typing.Tuple[int, int]]

mlqtl.nda_typing.TensorFloat64

3D array (tensor) of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int, int, int]]

mlqtl.nda_typing.VectorBool

1D array of NumPy booleans.

alias of Annotated[ndarray[tuple[int, …], dtype[bool]], typing.Tuple[int]]

mlqtl.nda_typing.VectorFloat64

1D array of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int]]

mlqtl.nda_typing.VectorInt8

1D array of 8-bit signed integers.

alias of Annotated[ndarray[tuple[int, …], dtype[int8]], typing.Tuple[int]]

mlqtl.nda_typing.VectorStr

1D array of NumPy Unicode strings.

alias of Annotated[ndarray[tuple[int, …], dtype[str_]], typing.Tuple[int]]

mlqtl.plot module

mlqtl.plot.plot_feature_importance(feature_importance: DataFrame, topn: int = 10, save: bool = False, filename: str = 'feature_importance') None[source]

Plot the feature importance of SNPs in different models

mlqtl.plot.plot_graph(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], threshold: float64, filename: str = 'result', font_size: int = 20, save: bool = False) None[source]

Plot the sliding window result

mlqtl.training module

class mlqtl.train.MLMetrics[source]

Bases: object

update(y, y_hat)[source]
mlqtl.train.feature_importance(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) DataFrame[source]

Train a single gene and trait

Parameters

genestr

The gene to train

traitstr

The trait to train

datasetDataset

The dataset to train

modelsList[RegressorMixin]

The list of models to train

onehotbool

Whether the SNP data is one-hot encoded

Returns

DataFrame

a pandas DataFrame containing the feature importance for each model with the SNP markers as columns and the model names as rows

mlqtl.train.init_worker(dataset: Dataset) None[source]

Initialize the worker with the dataset for multiprocessing

mlqtl.train.train(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]][source]

Train models on the given dataset using multiprocessing

Parameters

traitstr

The trait to train

onehotbool

Whether the SNP data is one-hot encoded

modelsList[RegressorMixin]

The list of models to train

datasetDataset

The dataset to train

max_workersint

The number of workers to use for multiprocessing

Returns

List[List[MatrixFloat64 | None]]

A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)

mlqtl.train.train_batch(X: Annotated[ndarray[tuple[int, ...], dtype[int8]], Tuple[int, int]] | Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int, int]], y: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int]], onehot: bool, models: List[RegressorMixin], importance: bool = False) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Train a batch of models on the given data

Parameters

XMatrixInt8 | TensorFloat64

The encoded SNP data

yVectorFloat64

The trait values

onehotbool

Whether the SNP data is one-hot encoded

modelsList[RegressorMixin]

The list of models to train

importancebool

Whether to calculate feature importance

Returns

MatrixFloat64
  • importance == False: a matrix of shape (n_models, 2) containing the correlation and p-value for each model

  • importance == True, a matrix of shape (n_models, n_features) containing the feature importance for each model

mlqtl.train.train_single(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Train a single gene and trait

Parameters

genestr

The gene to train

traitstr

The trait to train

datasetDataset

The dataset to train

modelsList[RegressorMixin]

The list of models to train

onehotbool

Whether the SNP data is one-hot encoded

Returns

MatrixFloat64

a matrix of shape (n_models, 2) containing the correlation and p-value for each model

mlqtl.train.train_with_progressbar(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]][source]

Train models on the given dataset using multiprocessing

Parameters

traitstr

The trait to train

onehotbool

Whether the SNP data is one-hot encoded

modelsList[RegressorMixin]

The list of models to train

datasetDataset

The dataset to train

max_workersint

The number of workers to use for multiprocessing

Returns

List[List[MatrixFloat64 | None]]

A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)

mlqtl.utils module

mlqtl.utils.get_class_from_path(class_path_string: str) RegressorMixin[source]

Given a string representing a class path, import the class and return it

Parameters

class_path_stringstr

A string representing the class path, e.g. “module.submodule.ClassName”

Returns

RegressorMixin

The imported class object, which should be a subclass of RegressorMixin

mlqtl.utils.gff3_to_range(gff_file: str, region: str = 'CDS') DataFrame[source]

Convert GFF3 file to plink gene range format

mlqtl.utils.gtf_to_range(gtf_file: str, region: str = 'CDS') DataFrame[source]

Convert GTF file to plink gene range format

Run a plink command and return the output.

Module contents