This respository uses biodiversity data from the BioTIME database to classifiy methods texts using fastText.
- a local copy of BioTIME and the metadata.
- conda (Miniconda or Anaconda)
- a Python fastText binding (more information in the installation section)
This sections guides you to set up this project to run experiments using Snakemake and fastText.
$ git clone https://github.com/komax/BioTIME-fastText-classification
- Create a new environment, e.g.,
biotime-fasttext
and install all dependencies.
$ conda env create --name biotime-fasttext --file environment.yaml
- Activate the conda environment. Either use the anaconda navigator or use this command in your terminal:
$ conda activate biotime-fasttext
or
$ source activate biotime-fasttext
Disclaimer: you can use pip install fasttext
in your anaconda environment, but those bindings are outdated.
I recommend doing this: 0. First activate your anaconda environment.
- Checkout the github respository from fastText or a stable fork:
$ git clone https://github.com/komax/fastText
- Install the python bindings in the fastText respository
pip install .
Create a symlink or copy your BioTIME data into biotime
directory.
nltk requires to download content to tokenize a sentence. Run this in your python shell:
>>> import nltk
>>> nltk.download('punkt')
or run
$ python scripts/download-nltk-punkt.py
All configuration parameters are stored in Snakefile
. Change the parameters to your purpose.
Adjust -j <num_cores>
in your snakemake calls to make use of multiple cores to run at the same time.
$ snakemake normalize_fasttext
Create data for cross validation, split the model parameters up in blocks and sort the model parameters by f1 scores on the training data.
$ snakemake sort_f1_scores
Select the best model (from the cross validation) and train it
$ snakemake train_model
$ snakemake test_model
$ snakemake
Snakemake can visualize the workflow using dot
. Run the following to generate a png for the workflow.
$ snakemake --dag all | dot -Tpng > dag.png
Checkout the Snakefile
and adjust this section to configure the experimental setup (parameter selection, cross validation, parallelization):
KFOLD = 2
TEST_SIZE = 0.25
CHUNKS = 4
PARAMETER_SPACE = ModelParams(
dim=ParamRange(start=10, stop=100, num=2),
lr=ParamRange(start=0.1, stop=1.0, num=2),
wordNgrams=ParamRange(start=2, stop=5, num=2),
epoch=ParamRange(start=5, stop=50, num=2),
bucket=ParamRange(start=2_000_000, stop=10_000_000, num=2)
)
FIRST_N_SENTENCES = 1
The (sub)directory data
contains intermediate data from data transforms/selection, chunking of the parameter space data/blocks
and subsampling for cross validation data/cv
.
results
entails the parameterization for the experiments as well as the accurancy scores measured as f1 scores on precision and recall:
results/blocks
contains all chunks (inlcuding the validation scores) ascsv
s,results/params_scores.csv
is the concatenation of all blocks,results/params_scores_sorted.csv
ranks the resulting scores by thef1_cross_validation_micro
score on the cross validation sets per label. Then, we select the model with the smallestf1_cross_validation_micro_ptp
with smallest point to point distance (minimum value to maximium value)