mlqtl package
Subpackages
Submodules
mlqtl.cli module
mlqtl.datautils module
- mlqtl.datautils.cal_padj(group: DataFrame) DataFrame [source]
Calculate the padj for a given group of p-values
- mlqtl.datautils.cal_sliding_window(met: DataFrame, chrom: str, window_size: int, step: int) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] [source]
Sliding window to calculate the mean of the padj_norm values
Parameters
- metDataFrame
The DataFrame containing the padj_norm values
- chromstr
The chromosome to calculate the mean
- window_sizeint
The size of the window
- stepint
The step size for the sliding window
Returns
- MatrixFloat64
The mean of the padj_norm values for each window and the start and end positions
- mlqtl.datautils.merge_window(window_mean: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], threshold: float64) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None [source]
Merge genes in the same region
Parameters
- window_meanMatrixFloat64
The mean of the padj_norm values for each window and the start and end positions
- thresholdnp.float64
The threshold to filter the mean values
Returns
- MatrixFloat64
The merged windows with start and end positions and the mean padj_norm value
- mlqtl.datautils.proc_train_res(train_res: List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]], models: List[RegressorMixin], dataset: Dataset, padj: bool = False) DataFrame [source]
Integrate the results from different chunks and calculate padj
Parameters
- resultList[List[MatrixFloat64 | None]]
The result from the regression models, each Matrix element is a list of (pcc, pval)
- modelsList[RegressorMixin]
The list of regression models
- datasetDataset
The dataset containing the information
Returns
- DataFrame
The integrated DataFrame containing the correlation, p-value, padj, and gene information
- mlqtl.datautils.significance(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], result: DataFrame, threshold: float64) DataFrame [source]
Get the gene in the peek window of the green region
Parameters
- sliding_window_resultList[Tuple[str, MatrixFloat64, MatrixFloat64]]
Results of the sliding window calculation
- resultDataFrame
Integrated training results
- thresholdnp.float64
The threshold to filter the candidate genes
Returns
- DataFrame
Gene table of the green region in the graph
- mlqtl.datautils.skip_padj(group: DataFrame) DataFrame [source]
Skip the padj calculation for a given group of p-values
- mlqtl.datautils.sliding_window(result: DataFrame, window: int, step: int, threshold: float64) Tuple[List[Tuple[str_, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], DataFrame] [source]
Convert the training results to dataframe and calculate the sliding window and merge significant regions
Parameters
- resultDataFrame
The integrated training results containing the correlation, p-value, and gene information
- window_sizeint
The size of the window
- stepint
The step size for the sliding window
- thresholdnp.float64
The threshold to filter the mean values
Returns
- sliding_window_resultList[np.str, MatrixFloat64, MatrixFloat64]
The sliding window results is a list of tuples containing the chromosome, the mean values for each window, and the merged windows
- significant_genesDataFrame
The significant genes in the green region of the graph
mlqtl.nda_typing module
- mlqtl.nda_typing.MatrixFloat64
2D array (matrix) of 64-bit floating-point numbers (double precision).
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[float64
]], typing.Tuple[int, int]]
- mlqtl.nda_typing.MatrixInt8
2D array (matrix) of 8-bit signed integers.
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[int8
]], typing.Tuple[int, int]]
- mlqtl.nda_typing.TensorFloat64
3D array (tensor) of 64-bit floating-point numbers (double precision).
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[float64
]], typing.Tuple[int, int, int]]
- mlqtl.nda_typing.VectorBool
1D array of NumPy booleans.
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[bool
]], typing.Tuple[int]]
- mlqtl.nda_typing.VectorFloat64
1D array of 64-bit floating-point numbers (double precision).
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[float64
]], typing.Tuple[int]]
- mlqtl.nda_typing.VectorInt8
1D array of 8-bit signed integers.
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[int8
]], typing.Tuple[int]]
- mlqtl.nda_typing.VectorStr
1D array of NumPy Unicode strings.
alias of
Annotated
[ndarray
[tuple
[int
, …],dtype
[str_
]], typing.Tuple[int]]
mlqtl.plot module
- mlqtl.plot.plot_feature_importance(feature_importance: DataFrame, topn: int = 10, save: bool = False, filename: str = 'feature_importance') None [source]
Plot the feature importance of SNPs in different models
- mlqtl.plot.plot_graph(sliding_window_result: List[Tuple[str, Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]], Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]]]], threshold: float64, filename: str = 'result', font_size: int = 20, save: bool = False) None [source]
Plot the sliding window result
mlqtl.training module
- mlqtl.train.feature_importance(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) DataFrame [source]
Train a single gene and trait
Parameters
- genestr
The gene to train
- traitstr
The trait to train
- datasetDataset
The dataset to train
- modelsList[RegressorMixin]
The list of models to train
- onehotbool
Whether the SNP data is one-hot encoded
Returns
- DataFrame
a pandas DataFrame containing the feature importance for each model with the SNP markers as columns and the model names as rows
- mlqtl.train.init_worker(dataset: Dataset) None [source]
Initialize the worker with the dataset for multiprocessing
- mlqtl.train.train(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]] [source]
Train models on the given dataset using multiprocessing
Parameters
- traitstr
The trait to train
- onehotbool
Whether the SNP data is one-hot encoded
- modelsList[RegressorMixin]
The list of models to train
- datasetDataset
The dataset to train
- max_workersint
The number of workers to use for multiprocessing
Returns
- List[List[MatrixFloat64 | None]]
A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)
- mlqtl.train.train_batch(X: Annotated[ndarray[tuple[int, ...], dtype[int8]], Tuple[int, int]] | Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int, int]], y: Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int]], onehot: bool, models: List[RegressorMixin], importance: bool = False) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] [source]
Train a batch of models on the given data
Parameters
- XMatrixInt8 | TensorFloat64
The encoded SNP data
- yVectorFloat64
The trait values
- onehotbool
Whether the SNP data is one-hot encoded
- modelsList[RegressorMixin]
The list of models to train
- importancebool
Whether to calculate feature importance
Returns
- MatrixFloat64
importance == False: a matrix of shape (n_models, 2) containing the correlation and p-value for each model
importance == True, a matrix of shape (n_models, n_features) containing the feature importance for each model
- mlqtl.train.train_single(gene: str, trait: str, models: List[RegressorMixin], dataset: Dataset, onehot: bool = False) Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] [source]
Train a single gene and trait
Parameters
- genestr
The gene to train
- traitstr
The trait to train
- datasetDataset
The dataset to train
- modelsList[RegressorMixin]
The list of models to train
- onehotbool
Whether the SNP data is one-hot encoded
Returns
- MatrixFloat64
a matrix of shape (n_models, 2) containing the correlation and p-value for each model
- mlqtl.train.train_with_progressbar(trait: str, models: List[RegressorMixin], dataset: Dataset, max_workers: int = 8, onehot: bool = False) List[List[Annotated[ndarray[tuple[int, ...], dtype[float64]], Tuple[int, int]] | None]] [source]
Train models on the given dataset using multiprocessing
Parameters
- traitstr
The trait to train
- onehotbool
Whether the SNP data is one-hot encoded
- modelsList[RegressorMixin]
The list of models to train
- datasetDataset
The dataset to train
- max_workersint
The number of workers to use for multiprocessing
Returns
- List[List[MatrixFloat64 | None]]
A list of lists containing the correlation and p-value for each model for each gene, shape of MatrixFloat64 is (n_models, 2)
mlqtl.utils module
- mlqtl.utils.get_class_from_path(class_path_string: str) RegressorMixin [source]
Given a string representing a class path, import the class and return it
Parameters
- class_path_stringstr
A string representing the class path, e.g. “module.submodule.ClassName”
Returns
- RegressorMixin
The imported class object, which should be a subclass of RegressorMixin
- mlqtl.utils.gff3_to_range(gff_file: str, region: str = 'CDS') DataFrame [source]
Convert GFF3 file to plink gene range format