Zhihu College-Topics Scraper
Distributed scraper for 500+ college-related Zhihu topics — 1M+ comments and questions — with sentiment analysis and topic modeling. Built under Berkeley URAP.
Overview
URAP side-stream investigating how Chinese college-age discourse shifts over time on Zhihu. Needed a dataset at scale, with tight timing, and existing scraping tools couldn't keep up — so I built a custom one.
Process
- 01
Scraping architecture
Combined Scrapy (for structured pages), BeautifulSoup (for parsing), and Selenium (for JS-rendered question feeds). Custom request rotation + cookie management to survive rate limiting.
- 02
Performance engineering
Multi-threaded producers + asynchronous processing. Cut wall-clock extraction time by 50% versus the naive sequential version — 1000+ hours of scraped content processed in a fraction of the time.
- 03
NLP analysis
Ran sentiment analysis and LDA-based topic modeling on the corpus. Clustered by theme, tracked sentiment-by-topic over time.
Result
1M+ comments and questions across 500+ topics, cleanly structured and ready for downstream research. The scaling-and-scheduling problem was more intellectually alive than I expected — concurrency bugs and rate-limit edge cases taught me more systems thinking than any lecture.
By the numbers
500+
Topics covered
1M+
Comments extracted
2×
Speedup
Next project
Improvements in Adverse Drug Reaction Prediction
Published first-author paper improving ADR prediction via rigorous interpretability (SHAP, feature importance) across Random Forest, Gradient Boosting, and SVM.