mlqtl package

Submodules

mlqtl.cli module

mlqtl.datautils module

mlqtl.datautils.cal_padj(group: DataFrame) → DataFrame[source]: Calculate the padj for a given group of p-values

mlqtl.datautils.cal_sliding_window(met: DataFrame, chrom: str, window_size: int, step: int) → Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Sliding window to calculate the mean of the padj_norm values

Parameters

metDataFrame: The DataFrame containing the padj_norm values
chromstr: The chromosome to calculate the mean
window_sizeint: The size of the window
stepint: The step size for the sliding window

Returns

MatrixFloat64: The mean of the padj_norm values for each window and the start and end positions

mlqtl.datautils.merge_window(window_mean: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], threshold: float64) → Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None[source]

Merge genes in the same region

Parameters

window_meanMatrixFloat64: The mean of the padj_norm values for each window and the start and end positions
thresholdnp.float64: The threshold to filter the mean values

Returns

MatrixFloat64: The merged windows with start and end positions and the mean padj_norm value

mlqtl.datautils.proc_train_res(train_res: List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]], models: List[RegressorMixin], dataset: Dataset, padj: bool = False) → DataFrame[source]

Integrate the results from different chunks and calculate padj

Parameters

resultList[List[MatrixFloat64 | None]]: The result from the regression models, each Matrix element is a list of (pcc, pval)
modelsList[RegressorMixin]: The list of regression models
datasetDataset: The dataset containing the information

Returns

DataFrame: The integrated DataFrame containing the correlation, p-value, padj, and gene information

mlqtl.datautils.significance(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], result: DataFrame, threshold: float64) → DataFrame[source]

Get the gene in the peek window of the green region

Parameters

sliding_window_resultList[Tuple[str, MatrixFloat64, MatrixFloat64]]: Results of the sliding window calculation
resultDataFrame: Integrated training results
thresholdnp.float64: The threshold to filter the candidate genes

Returns

DataFrame: Gene table of the green region in the graph

mlqtl.datautils.skip_padj(group: DataFrame) → DataFrame[source]: Skip the padj calculation for a given group of p-values

mlqtl.datautils.sliding_window(result: DataFrame, window: int, step: int, threshold: float64) → Tuple[List[Tuple[str_, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], DataFrame][source]

Convert the training results to dataframe and calculate the sliding window and merge significant regions

Parameters

resultDataFrame: The integrated training results containing the correlation, p-value, and gene information
window_sizeint: The size of the window
stepint: The step size for the sliding window
thresholdnp.float64: The threshold to filter the mean values

Returns

sliding_window_resultList[np.str, MatrixFloat64, MatrixFloat64]: The sliding window results is a list of tuples containing the chromosome, the mean values for each window, and the merged windows
significant_genesDataFrame: The significant genes in the green region of the graph

mlqtl.nda_typing module

mlqtl.nda_typing.MatrixFloat64

2D array (matrix) of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int, int]]

mlqtl.nda_typing.MatrixInt8

2D array (matrix) of 8-bit signed integers.

alias of Annotated[ndarray[tuple[int, …], dtype[int8]], typing.Tuple[int, int]]

mlqtl.nda_typing.TensorFloat64

3D array (tensor) of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int, int, int]]

mlqtl.nda_typing.VectorBool

1D array of NumPy booleans.

alias of Annotated[ndarray[tuple[int, …], dtype[bool]], typing.Tuple[int]]

mlqtl.nda_typing.VectorFloat64

1D array of 64-bit floating-point numbers (double precision).

alias of Annotated[ndarray[tuple[int, …], dtype[float64]], typing.Tuple[int]]

mlqtl.nda_typing.VectorInt8

1D array of 8-bit signed integers.

alias of Annotated[ndarray[tuple[int, …], dtype[int8]], typing.Tuple[int]]

mlqtl.nda_typing.VectorStr

1D array of NumPy Unicode strings.

alias of Annotated[ndarray[tuple[int, …], dtype[str_]], typing.Tuple[int]]

mlqtl.plot module

mlqtl.plot.plot_feature_importance(feature_importance: DataFrame, topn: int = 10, save: bool = False, filename: str = 'feature_importance') → None[source]: Plot the feature importance of SNPs in different models

mlqtl.plot.plot_graph(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], threshold: float64, filename: str = 'result', font_size: int = 20, save: bool = False) → None[source]: Plot the sliding window result

mlqtl.training module

class mlqtl.train.MLMetrics[source]

Bases: object

update(y, y_hat)[source]

mlqtl.train.feature_importance(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) → DataFrame[source]

Train a single gene and trait

Parameters

genestr: The gene to train
traitstr: The trait to train
datasetDataset: The dataset to train
modelsList[RegressorMixin]: The list of models to train
onehotbool: Whether the SNP data is one-hot encoded

Returns

DataFrame: a pandas DataFrame containing the feature importance for each model with the SNP markers as columns and the model names as rows

mlqtl.train.init_worker(dataset: Dataset) → None[source]: Initialize the worker with the dataset for multiprocessing

mlqtl.train.train(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) → List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]][source]

Train models on the given dataset using multiprocessing

Parameters

traitstr: The trait to train
onehotbool: Whether the SNP data is one-hot encoded
modelsList[RegressorMixin]: The list of models to train
datasetDataset: The dataset to train
max_workersint: The number of workers to use for multiprocessing

Returns

List[List[MatrixFloat64 | None]]: A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)

mlqtl.train.train_batch(X: Annotated[ndarray[tuple[int, ...], dtype[int8]], Tuple[int, int]] | Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int, int]], y: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int]], onehot: bool, models: List[RegressorMixin], importance: bool = False) → Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Train a batch of models on the given data

Parameters

XMatrixInt8 | TensorFloat64: The encoded SNP data
yVectorFloat64: The trait values
onehotbool: Whether the SNP data is one-hot encoded
modelsList[RegressorMixin]: The list of models to train
importancebool: Whether to calculate feature importance

Returns

MatrixFloat64

importance == False: a matrix of shape (n_models, 2) containing the correlation and p-value for each model
importance == True, a matrix of shape (n_models, n_features) containing the feature importance for each model

mlqtl.train.train_single(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) → Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]][source]

Train a single gene and trait

Parameters

genestr: The gene to train
traitstr: The trait to train
datasetDataset: The dataset to train
modelsList[RegressorMixin]: The list of models to train
onehotbool: Whether the SNP data is one-hot encoded

Returns

MatrixFloat64: a matrix of shape (n_models, 2) containing the correlation and p-value for each model

mlqtl.train.train_with_progressbar(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) → List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]][source]

Train models on the given dataset using multiprocessing

Parameters

traitstr: The trait to train
onehotbool: Whether the SNP data is one-hot encoded
modelsList[RegressorMixin]: The list of models to train
datasetDataset: The dataset to train
max_workersint: The number of workers to use for multiprocessing

Returns

List[List[MatrixFloat64 | None]]: A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)

mlqtl.utils module

mlqtl.utils.get_class_from_path(class_path_string: str) → RegressorMixin[source]

Given a string representing a class path, import the class and return it

Parameters

class_path_stringstr: A string representing the class path, e.g. “module.submodule.ClassName”

Returns

RegressorMixin: The imported class object, which should be a subclass of RegressorMixin

mlqtl.utils.gff3_to_range(gff_file: str, region: str = 'CDS') → DataFrame[source]: Convert GFF3 file to plink gene range format

mlqtl.utils.gtf_to_range(gtf_file: str, region: str = 'CDS') → DataFrame[source]: Convert GTF file to plink gene range format

mlqtl.utils.run_plink(cmd: str) → str[source]: Run a plink command and return the output.

mlqtl package

Subpackages

Submodules

mlqtl.cli module

mlqtl.datautils module

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

mlqtl.nda_typing module

mlqtl.plot module

mlqtl.training module

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

mlqtl.utils module

Parameters

Returns

Module contents