tangentially, starspace trained really quickly, 28k lists and about 300k stories, under 10 minutes total to train and generate embeddings, I'm impressed!
pulled from https://github.com/Archive-WP/WattpadRecommendations
Steps
- Clone Starship to the project directory and
cd
into it (git clone https://github.com/facebookresearch/Starspace.git\ncd Starspace
) - Install and extract boost (
curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz && tar -xzvf boost_1_82_0.tar.gz
) - Build boost (
make -e BOOST_DIR=boost_1_82_0 && \\n make embed_doc -e BOOST_DIR=boost_1_82_0
) - Train starspace
./starspace train \
-trainFile ../src/sids.txt \
-model wpaRecModel \
-label '' \
-trainMode 1 \
-epoch 25 \
-dim 100
sids.txt
is in the format,
6501 7212 17445 20412 36197 ...
23153 38792 47922 73234 91986 ...
84307 87217 89794 105872 ...
Each line represents a List on Wattpad, containing space seperated IDs of the stories they contain. Generated using test.ipynb.
- Build
embed_doc
(make embed_doc && chmod +x ./embed_doc
) - Generate a file with one story id per line (script in test.ipynb)
6501
7212
17445
...
- Pass to
embed_doc
(./embed_doc wpaRecModel ../src/story_ids.txt > embeddings
) - Move generated embeddings file to
src
directory - The next step requires a Qdrant instance, please make sure you have one running with the GRPC port forwarded.
- Run parsing script in test.ipynb
- The embeddings have been synced to Qdrant!
- You need a Redis instance for the next step, please make sure you have one running.
- You can now serve the embeddings using the Discord bot! (
python3 main.py
)