Customizing Topic Extraction¶
We use LDA model for topic extraction.
First you need a trained LDA model using gensim
. You can find a
demo trained model on our the GitHub repository, or you can train one
yourself.
The extraction process consists of two parts:
- Tokenizing (potentially includes noun phrase extraction)
- Predicting topics with trained LDA Model
Customizing Tokenization¶
The tokenizer is implemented in
core.lda_engine.LdaModelWrapper.tokenize()
. It is a method of a
LDA model since different models may need to tokenize differently (For
example, some need noun phrase extraction in addition to tokenizing).
See core.lda_engine.LdaModelWrapper.tokenize()
for more details.
Loading Your Own LDA Model¶
If you are training LDA model with gensim
, you can load your
trained models in FMR by a few lines of configurations. See Installation
for details.
Make sure you also have the following components:
- Gensim’s
.dictionary
file, with which you trained the LDA model. .json
file, which stores the profiles of your pool of scholars.
Implementing LDA Model For Other Libraries¶
The core.lda_engine.LdaModelWrapper
class serves as an
abstraction layer between the rest of the application and the actual LDA
model.
If you have other LDA models implemented by other libraries, or even a
completely different language, you can rewrite the core.lda_engine.LdaModelWrapper
to fit your need.
A minimal working core.lda_engine.LdaModelWrapper
should at least consists of the
following methods:
core.lda_engine.LdaModelWrapper.predict()
: it takes a raw text string and return a NumPy array of topics IDs and their confidence levels.core.lda_engine.LdaModelWrapper.get_author_top_topics()
core.lda_engine.LdaModelWrapper.get_topic_in_string()
- core.lda_engine.LdaModelWrapper.authors_lib : a dictionary that contains the profile of the pool of scholars. It must work in tandem with the matching algorithm. It will be automatically loaded if the configured correctly. See LDA Models for details.