Tracking Books On Reddit
I rewrote my rust program in python for performance reasons and so it is more maintainable
When I was in college, I had a really good idea for an app. I would track everything people said on Twitter, figure out which products they are talking about, create a website that ranks those products over time, and then add affiliate links so I can make a million dollars and not have to work a real job.
There were a few hard problems I ran into back then that stopped me from shipping it.
Consuming Twitter data is really annoying.
Telling which tweets are about products is hard.
I barely even knew how to program.
I still think it's a pretty good idea so I did a little hacking on it over Christmas break
All of these are still problems today, but I think I can work around them. Twitter data is still annoying to consume so I decided to use Reddit data instead, which is slightly easier to scrape. To start, I decided to scrape book-related subreddits. I can use some off-the-shelf machine learning model to answer, "Is this post about a book and does the person like the book or not like the book?". And as far as being bad at programming goes, I just don't let that stop me anymore.
Reddit has some APIs to get data but I don't feel like doing the work to obey rate limits. I just found a tool that already does it, the Bulk Downloaded for Reddit (bdfr). I can point it at a subreddit and it will pull a bunch of posts and comments. Perfect.
After running bdfr
I get a directory full of JSON files, one for each post. I wrote a rust program to parse the file and store the posts and comments in a DuckDB database.
This is where I hit a problem. Some of the posts with lots and lots of comments (each having replies that are also comments) created some extremely nested JSON objects. They were so nested that serde (the library to parse JSON in rust) was erroring out, hitting recursion limits. Also, I was using the DuckDB crate and it compiled all of DuckDB into my program, which takes forever when done from scratch.
Because of these errors, I did something I never thought I'd ever do. I rewrote my rust program in python for performance and maintainability.
The last step in collecting the data is just to run the program often. bdfr
can only scrape the "new" feed, the "top" feed, or the "hot" feed so I need to check regularly to get all the posts and comments. I spun up a GitHub actions workflow to do it for me. I think this will work for at least a week or so until the data set is too large to even use git lfs. Then I guess I'll have to put them in s3 or something.