A
All projects
2022 — 2023CompletePython · Scrapy · NLP · LDA

Zhihu College-Topics Scraper

Distributed scraper for 500+ college-related Zhihu topics — 1M+ comments and questions — with sentiment analysis and topic modeling. Built under Berkeley URAP.

01

Overview

URAP side-stream investigating how Chinese college-age discourse shifts over time on Zhihu. Needed a dataset at scale, with tight timing, and existing scraping tools couldn't keep up — so I built a custom one.

02

Process

  1. 01

    Scraping architecture

    Combined Scrapy (for structured pages), BeautifulSoup (for parsing), and Selenium (for JS-rendered question feeds). Custom request rotation + cookie management to survive rate limiting.

  2. 02

    Performance engineering

    Multi-threaded producers + asynchronous processing. Cut wall-clock extraction time by 50% versus the naive sequential version — 1000+ hours of scraped content processed in a fraction of the time.

  3. 03

    NLP analysis

    Ran sentiment analysis and LDA-based topic modeling on the corpus. Clustered by theme, tracked sentiment-by-topic over time.

03

Result

1M+ comments and questions across 500+ topics, cleanly structured and ready for downstream research. The scaling-and-scheduling problem was more intellectually alive than I expected — concurrency bugs and rate-limit edge cases taught me more systems thinking than any lecture.

By the numbers

500+

Topics covered

1M+

Comments extracted

Speedup

Next project

Improvements in Adverse Drug Reaction Prediction

Published first-author paper improving ADR prediction via rigorous interpretability (SHAP, feature importance) across Random Forest, Gradient Boosting, and SVM.