2024-12-01

tangentially, starspace trained really quickly, 28k lists and about 300k stories, under 10 minutes total to train and generate embeddings, I'm impressed!


pulled from https://github.com/Archive-WP/WattpadRecommendations

Steps

  1. Clone Starship to the project directory and cd into it (git clone https://github.com/facebookresearch/Starspace.git\ncd Starspace)
  2. Install and extract boost (curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz && tar -xzvf boost_1_82_0.tar.gz)
  3. Build boost (make -e BOOST_DIR=boost_1_82_0 && \\n make embed_doc -e BOOST_DIR=boost_1_82_0)
  4. Train starspace
./starspace train \
    -trainFile ../src/sids.txt \
    -model wpaRecModel \
    -label '' \
    -trainMode 1 \
    -epoch 25 \
    -dim 100

sids.txt is in the format,

6501 7212 17445 20412 36197 ...
23153 38792 47922 73234 91986 ...
84307 87217 89794 105872 ...

Each line represents a List on Wattpad, containing space seperated IDs of the stories they contain. Generated using test.ipynb.

  1. Build embed_doc (make embed_doc && chmod +x ./embed_doc)
  2. Generate a file with one story id per line (script in test.ipynb)
6501
7212
17445
...
  1. Pass to embed_doc (./embed_doc wpaRecModel ../src/story_ids.txt > embeddings)
  2. Move generated embeddings file to src directory
  3. The next step requires a Qdrant instance, please make sure you have one running with the GRPC port forwarded.
  4. Run parsing script in test.ipynb
  5. The embeddings have been synced to Qdrant!
  6. You need a Redis instance for the next step, please make sure you have one running.
  7. You can now serve the embeddings using the Discord bot! (python3 main.py)