Starspace setup

2024-12-01

tangentially, starspace trained really quickly, 28k lists and about 300k stories, under 10 minutes total to train and generate embeddings, I'm impressed!

pulled from https://github.com/Archive-WP/WattpadRecommendations

Steps

Clone Starship to the project directory and cd into it (git clone https://github.com/facebookresearch/Starspace.git\ncd Starspace)
Install and extract boost (curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz && tar -xzvf boost_1_82_0.tar.gz)
Build boost (make -e BOOST_DIR=boost_1_82_0 && \\n make embed_doc -e BOOST_DIR=boost_1_82_0)
Train starspace

./starspace train \
    -trainFile ../src/sids.txt \
    -model wpaRecModel \
    -label '' \
    -trainMode 1 \
    -epoch 25 \
    -dim 100

sids.txt is in the format,

6501 7212 17445 20412 36197 ...
23153 38792 47922 73234 91986 ...
84307 87217 89794 105872 ...

Each line represents a List on Wattpad, containing space seperated IDs of the stories they contain. Generated using test.ipynb.

Build embed_doc (make embed_doc && chmod +x ./embed_doc)
Generate a file with one story id per line (script in test.ipynb)

Pass to embed_doc (./embed_doc wpaRecModel ../src/story_ids.txt > embeddings)
Move generated embeddings file to src directory
The next step requires a Qdrant instance, please make sure you have one running with the GRPC port forwarded.
Run parsing script in test.ipynb
The embeddings have been synced to Qdrant!
You need a Redis instance for the next step, please make sure you have one running.
You can now serve the embeddings using the Discord bot! (python3 main.py)

Steps #

Steps