Customizing Topic Extraction

We use LDA model for topic extraction.

First you need a trained LDA model using gensim . You can find a demo trained model on our the GitHub repository, or you can train one yourself.

The extraction process consists of two parts:

  1. Tokenizing (potentially includes noun phrase extraction)
  2. Predicting topics with trained LDA Model

Customizing Tokenization

The tokenizer is implemented in core.lda_engine.LdaModelWrapper.tokenize() . It is a method of a LDA model since different models may need to tokenize differently (For example, some need noun phrase extraction in addition to tokenizing).

See core.lda_engine.LdaModelWrapper.tokenize() for more details.

Loading Your Own LDA Model

If you are training LDA model with gensim , you can load your trained models in FMR by a few lines of configurations. See Installation for details.

Make sure you also have the following components:

  1. Gensim’s .dictionary file, with which you trained the LDA model.
  2. .json file, which stores the profiles of your pool of scholars.

Implementing LDA Model For Other Libraries

The core.lda_engine.LdaModelWrapper class serves as an abstraction layer between the rest of the application and the actual LDA model.

If you have other LDA models implemented by other libraries, or even a completely different language, you can rewrite the core.lda_engine.LdaModelWrapper to fit your need.

A minimal working core.lda_engine.LdaModelWrapper should at least consists of the following methods: