Extracting reviews from Goodreads into Markdown pages

Published: Sun 07 May 2023

In Tech.

tags: python

In early 2020, it became really difficult to read books. I'm slowly starting to get past that at last, but while Goodreads was super fun for a few years, right now tracking and counting and rating just makes reading feel like a chore and Not Fun as a hobby.

I still would like to keep a record of what I read, though. And I used to write book reviews on this blog before getting onto Goodreads. It seems like a nice way to keep a record of what I read without it becoming a number thing. I'll probably still write Goodreads reviews for small authors since I know it can help, but I feel less pressed about having everything there.

I also wanted to save the reviews I wrote only on Goodreads here, and wrote a script to migrate my reviews into Pelican-friendly Markdown pages since that's what powers this blog now. I decided to keep to a single entry per year rather than one entry per book, since I had a few good reading years in there (and others will only a single review!)

Step 1: CSV export of the books

First, if you go to 'My books' and find the 'Tools' menu at the bottom of the leftside menu on Goodreads, you'll find a page to export your library. You may have to try a couple of times: my first export only had a handful a books, the second one looks more comprehensive although the number of books was off by two, but what can you do.

Step 2: Python script to create the Markdown pages

This is the script I wrote to extract only the books I've actually read. I don't really care about DNF (did not finish) and to-read, right now. I also hardcoded the years relevant to me. May someone find something helpful in here!

import csv
from dataclasses import dataclass
from datetime import date, datetime

class Review:
    title: str
    author: str
    date_read: date
    review: str
    rating: int

def get_reviews(year):
    reviews = []

    with open('goodreads_library_export.csv') as csvfile:
        reader = csv.DictReader(csvfile)

        for row in reader:
                r = Review(row['Title'],
                           row['Date Read'],
                           row['My Review'],
                           int(row['My Rating']))
            except ValueError:
                # When the int() cast fails, usually it means the CSV is
                # corrupted for that line. In my case, it was for a few to-read
                # records so I ignore them rather than attempt to fix the
                # original CSV. You can print the row here if you want to check
                # what's failing.

            if year is None or year in r.date_read:

    return reviews

def rating_or_review(review):
    # If I wrote a review, return that
    if review.review:
        return review.review

    # Otherwise, make the rating into words.
    if review.rating >= 4:
        return "I really enjoyed it."
    elif review.rating == 3:
        return "It was fine."
        return "Wasn't for me."

def format_reviews(reviews, year):
    with open(f'book-reviews-{year}.md', 'w') as f:
        f.write(f"Title: Book reviews: Year {year}\n")
        f.write(f"Date: {datetime.now().isoformat()}\n")
        f.write("tags: book review\n\n")

        for r in reviews:
            # A couple of abandoned books sneaked in with a '0' rating, and I'm
            # not interested in preserving those
            if r.rating != 0:
                f.write(f"## {r.title} by {r.author}\n\n")

for year in range(2013, 2023):
    reviews = get_reviews(str(year))
    # Chronological order
    reviews = sorted(reviews, key=lambda r: r.date_read)
    format_reviews(reviews, year)

The hardest part was probably to decide what text to convert a rating into, since I didn't want to keep numbers!

Step 3: Checking the output looks right and recalling fond memories

I used the 'Year in Books' pages on Goodreads to compare the results. There was some funkiness sometimes, like a book read in 2011 showing in year in books but without any shelves and a date read showing as 2020, even though I don't remember messing with it. The review also shows as Jan 2020 on the Goodreads UI despite appearing in the correct 'Year in books'. 2020 turned out to be date_added (which is definitely false) while the date_read field is empty. Maybe some data migration funkiness on the Goodreads side at some point during the last 12 years. Otherwise, a duplicate once, and a couple of intra-Goodreads links that didn't work.

I still have to clean up the file for 2022. I was getting annoyed with tracking myself so I didn't write reviews, but if it's for the blog I wouldn't mind adding a few notes. And I need to decide if I want to post my 2023 reviews as I go, or batch them in some way!