Auto-generated documentation for fse.models.base_s2v module.
Base class containing common methods for training, using & evaluating sentence embeddings. A lot of the code is based on Gensim. I have to thank Radim Rehurek and the whole team for the outstanding library which I used for a lot of my research.
wv : class gensim.models.keyedvectors.KeyedVectors
This object essentially contains the mapping between words and embeddings. After training, it can be used
directly to query those embeddings in various ways. See the module level docstring for examples.
sv : class fse.models.sentencevectors.SentenceVectors
This object contains the sentence vectors inferred from the training data. There will be one such vector
for each unique docusentence supplied during training. They may be individually accessed using the index.
prep : class fse.models.base_s2v.BaseSentence2VecPreparer
The prep object is used to transform and initialize the sv.vectors. Aditionally, it can be used
to move the vectors to disk for training with memmap.
class fse.models.average.Average
.
Average sentence model.
class fse.models.sif.SIF
.
Smooth inverse frequency weighted model.
class fse.models.usif.uSIF
.
Unsupervised Smooth inverse frequency weighted model.
class BaseSentence2VecModel(SaveLoad):
def __init__(
model: KeyedVectors,
sv_mapfile_path: str = None,
wv_mapfile_path: str = None,
workers: int = 1,
lang_freq: str = None,
fast_version: int = 0,
batch_words: int = 10000,
batch_ngrams: int = 40,
**kwargs,
):
def __str__() -> str:
Human readable representation of the model’s state.
str Human readable representation of the model’s state.
def estimate_memory(
max_index: int,
report: dict = None,
**kwargs,
) -> Dict[str, int]:
Estimate the size of the sentence embedding
max_index : int Maximum index found during the initial scan report : dict Report of subclasses
dict Dictionary of estimated memory sizes
def infer(sentences: List[tuple] = None, use_norm=False) -> ndarray:
Secondary routine to train an embedding. This method is essential for small batches of sentences, which require little computation. Note: This method does not apply post-training transformations, only post inference calls (such as removing principal components).
sentences : (list, iterable) An iterable consisting of tuple objects use_norm : bool If bool is True, the sentence vectors will be L2 normalized (unit euclidean length)
ndarray Computed sentence vectors
@classmethod
def load(*args, **kwargs):
Load a previously saved class fse.models.base_s2v.BaseSentence2VecModel
.
fname : str Path to the saved file.
class fse.models.base_s2v.BaseSentence2VecModel
Loaded model.
def save(*args, **kwargs):
Save the model.
This saved model can be loaded again using :func:~fse.models.base_s2v.BaseSentence2VecModel.load
fname : str Path to the file.
def scan_sentences(
sentences: List[tuple] = None,
progress_per: int = 5,
) -> Dict[str, int]:
Performs an initial scan of the data and reports all corresponding statistics
sentences : (list, iterable) An iterable consisting of tuple objects progress_per : int Number of seconds to pass before reporting the scan progress
dict Dictionary containing the scan statistics
def train(
sentences: List[tuple] = None,
update: bool = False,
queue_factor: int = 2,
report_delay: int = 5,
) -> Tuple[int, int]:
Main routine to train an embedding. This method writes all sentences vectors into sv.vectors and is used for computing embeddings for large chunks of data. This method also handles post-training transformations, such as computing the SVD of the sentence vectors.
sentences : (list, iterable) An iterable consisting of tuple objects update : bool If bool is True, the sentence vector matrix will be updated in size (even with memmap) queue_factor : int Multiplier for size of queue -> size = number of workers * queue_factor. report_delay : int Number of seconds between two consecutive progress report messages in the logger.
int, int Count of effective sentences and words encountered
class BaseSentence2VecPreparer(SaveLoad):
Contains helper functions to perpare the weights for the training of BaseSentence2VecModel
def prepare_vectors(
sv: SentenceVectors,
total_sentences: int,
update: bool = False,
):
Build tables and model weights based on final vocabulary settings.
def reset_vectors(sv: SentenceVectors, total_sentences: int):
Initialize all sentence vectors to zero and overwrite existing files
def update_vectors(sv: SentenceVectors, total_sentences: int):
Given existing sentence vectors, append new ones