2022 — 2023CompletePython · Scrapy · NLP · LDA

Zhihu College-Topics Scraper

Distributed scraper for 500+ college-related Zhihu topics — 1M+ comments and questions — with sentiment analysis and topic modeling. Built under Berkeley URAP.

Source

Overview

URAP side-stream investigating how Chinese college-age discourse shifts over time on Zhihu. Needed a dataset at scale, with tight timing, and existing scraping tools couldn't keep up — so I built a custom one.

Process

01
Scraping architecture
Combined Scrapy (for structured pages), BeautifulSoup (for parsing), and Selenium (for JS-rendered question feeds). Custom request rotation + cookie management to survive rate limiting.
02
Performance engineering
Multi-threaded producers + asynchronous processing. Cut wall-clock extraction time by 50% versus the naive sequential version — 1000+ hours of scraped content processed in a fraction of the time.
03
NLP analysis
Ran sentiment analysis and LDA-based topic modeling on the corpus. Clustered by theme, tracked sentiment-by-topic over time.

Result

1M+ comments and questions across 500+ topics, cleanly structured and ready for downstream research. The scaling-and-scheduling problem was more intellectually alive than I expected — concurrency bugs and rate-limit edge cases taught me more systems thinking than any lecture.

By the numbers

500+

Topics covered

1M+

Comments extracted

2×

Speedup

Next project

Gitlet — A Miniature Git

Built a working version-control system in Java from scratch — init / add / commit / log / checkout / branch / merge — modeled on git's internals. CS 61B Project 3.

Overview

Process

Scraping architecture

Performance engineering

NLP analysis

Result

Gitlet — A Miniature Git