CS Knowledge Hub

# Mass Collaboration for Social Research in the Digital Age **Video Category:** Social Science & Data Science ## ð 0. Video Metadata **Video Title:** Mass Collaboration for Social Research in the Digital Age **YouTube Channel:** Stanford Center for Professional Development **Publication Date:** January 29, 2016 **Video Duration:** ~1 hour 4 minutes ## ð 1. Core Summary (TL;DR) The digital age represents a fundamental structural shift in how data is generated, moving research from an analog paradigm to a digital one. Rather than simply analyzing "found data" or digital exhaustâwhich acts like archaeological relics left behind by corporate systemsâresearchers must actively design mass collaboration platforms to solve complex scientific problems. By leveraging human computation, open calls, and distributed data collection, scientists can combine the scale of the internet with the precision of targeted research design to achieve results that are faster, more reproducible, and fundamentally better than traditional methods. ## 2. Core Concepts & Frameworks * **The "Relics" Metaphor vs. The "Microscope" Metaphor:** -> **Meaning:** Researchers often view digital data (like Twitter or Facebook logs) as a new "microscope" to view society. The video argues it is actually more like archaeological "relics" (pottery shards or footprints on a beach)âdata left behind by accident, not designed to answer specific scientific questions. -> **Application:** Instead of being constrained by the limitations of "found" data, researchers should actively design systems to collect the exact data they need. * **Mass Collaboration Topology:** -> **Meaning:** A framework dividing mass collaboration into two main branches: Data Analysis and Data Collection. Data Analysis is further split into "Human Computation" (specifying a concrete task) and "Open Calls" (specifying a goal without dictating the method). -> **Application:** Used to determine the correct architecture for a research project based on whether you need a repetitive task scaled up, a novel solution to a hard problem, or data gathered from disparate geographic locations. * **The Split-Apply-Combine Strategy:** -> **Meaning:** The foundational recipe for human computation projects. A massive problem is broken down into micro-tasks (split), human workers process those tasks independently (apply), and the results are rigorously aggregated using statistical methods (combine). -> **Application:** Used in projects like Galaxy Zoo to process 40 million classifications by having thousands of individuals each classify a small batch of images, which are then aggregated to form consensus labels. ## 3. Evidence & Examples (Hyper-Specific Details) * **Gartner's Hype Cycle:** The speaker uses this visual framework (Technology Trigger -> Peak of Inflated Expectations -> Trough of Disillusionment -> Slope of Enlightenment -> Plateau of Productivity) to explain that while "big data" currently suffers from fad-like hype, the underlying transition from analog to digital is a permanent structural shift. * **Galaxy Zoo (Human Computation):** Astronomers needed to classify 50,000 galaxies (spiral vs. elliptical, red vs. blue). Originally, graduate student Kevin Schawinski worked 7 twelve-hour days to classify them manually. This was only 5% of the ~1 million galaxies in the Sloan Digital Sky Survey. By creating a website where volunteers passed a simple 5-minute quiz, the project generated 40 million classifications (4x10^7) over a couple of months. The results matched expert quality but at 10x the scale, leading to new discoveries like "Green Pea" galaxies. * **The Manifesto Project vs. Crowd-Sourced Text Analysis:** Traditionally, European political scientists hand-coded ~4,000 political manifestos from ~50 countries (e.g., analyzing a 2010 UK Labour Party manifesto). Benoit et al. moved this to a crowd-sourced platform (using CrowdFlower). The crowd-coded results closely matched the expert coding (plotting Economic and Social positions on a scatter plot with high correlation). The crowd approach proved superior because it is reproducible (unlike relying on unavailable experts) and flexible (allowing rapid re-coding for new topics like "immigration" that weren't tracked in the 1980s). * **The Netflix Prize (Open Call):** Netflix released ~100 million movie ratings and held back ~1.5 million as a test set. They offered $1 million to anyone who could improve their recommendation algorithm's Root Mean Squared Error (RMSE). They received close to 45,000 submissions. Crucially, a software developer backpacking in New Zealand named "Sifter aka Simon Funk" posted a blog on December 11, 2006, detailing a singular value decomposition (SVD) approach that instantly moved him to 4th place. This demonstrated how open calls access ideas from non-traditional experts. * **Peer-to-Patent (Open Call):** Created by Beth Noveck to solve the US Patent Office bottleneck (examiners only get ~20 hours per application and patents are secret). Applicants opted in to open review. The crowd researched prior art, annotated it, and forwarded the "top ten" references to the examiner. In one instance, Steve Pearson (an IBM programmer) found an old Intel manual that served as prior art, causing the examiner to reject an HP patent ("User-selectable management alert format", US Patent 20070118658). * **eBird (Distributed Data Collection):** Transitioned bird watching from rotting notebooks in closets to a global database. Over 200,000 participants submitted >250 million observations. Because birds aren't just near roads (where humans tend to be), the system uses input filters, regional expert coordinators to verify unusual sightings (like a Snowy Egret in Palo Alto), and algorithms to estimate observer skill. The data has been used in over 200 scientific papers regarding climate change and migration. * **PhotoCity (Distributed Data Collection / Directed Attention):** Built to create 3D reconstructions of the UW and Cornell campuses. Previous systems (like "Build Rome in a Day") failed because tourists only took photos of popular angles. PhotoCity used a competitive game mechanic that awarded points based on the number of *new* pixels a photo added to the 3D model. Over 2 months, 45 players submitted 100,000 photos, demonstrating how to actively steer crowd behavior to collect missing, high-value data. * **Malawi Journal Project (Distributed Data Collection):** To study informal conversations about AIDS, 22 citizen "journalists" in Malawi wrote down overheard conversations over 15 years, yielding ~12,000 pages of text. This distributed ethnography revealed that locals' discussions about condom use differed entirely from how public health messaging framed itâinsights that Western researchers and standardized surveys completely missed. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Transition from Observer to Designer** - Do not limit research to what companies happen to collect in their digital exhaust. Actively design custom digital platforms to collect the precise data required to solve your specific scientific problem. * **Rule 2: Execute the Split-Apply-Combine Workflow** - For human computation, break large datasets into micro-tasks, distribute them to a crowd, and rigorously clean the returned data. You must include a "de-biasing" step to correct systematic errors (e.g., the crowd's bias to label far-away spiral galaxies as ellipticals) before taking a weighted average based on contributor skill. * **Rule 3: Require Asymmetric Evaluation for Open Calls** - Only use Open Calls (contests) for problems where checking a submitted solution is exponentially faster and easier than generating the solution. If verification requires manual review (e.g., reading 45,000 essays), the open call will fail. * **Rule 4: Design for Extreme Heterogeneity** - Build platforms expecting a power-law distribution of effort. Ensure the system can extract value from the vast majority of users who do very little (e.g., classifying 3 galaxies) while providing power-tools for the tiny minority who do the majority of the work (e.g., classifying thousands). * **Rule 5: Gamify the Collection of Rare Data** - When crowd-sourcing data collection, do not just ask for data generally. Implement dynamic scoring systems that reward users for capturing rare, missing, or difficult-to-obtain data points to avoid redundant submissions. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Treating digital exhaust as a neutral "microscope." -> **Why it fails:** Corporate data is designed for business purposes, not scientific inquiry, leading researchers to answer trivial questions simply because the data is available. -> **Warning sign:** You find yourself retrofitting your research question to match the structure of the data you scraped. * **Pitfall:** Using simple averaging to aggregate crowd data. -> **Why it fails:** Averaging removes random noise but preserves systematic biases (e.g., visual illusions or cultural prejudices shared by the crowd). -> **Warning sign:** Your aggregated crowd data consistently deviates from the physical ground truth or expert baselines in a specific direction. * **Pitfall:** Launching an open call for a subjective problem. -> **Why it fails:** If there is no automated, objective metric (like Root Mean Squared Error) to evaluate submissions, the sheer volume of responses will overwhelm the organizers. -> **Warning sign:** You receive thousands of submissions and realize you need to hire a team of experts just to read and rank them. * **Pitfall:** Relying exclusively on expert categorization for longitudinal data. -> **Why it fails:** Concepts evolve (e.g., the political relevance of "immigration" changes over decades), and experts are too expensive and slow to repeatedly re-code massive historical datasets. -> **Warning sign:** Your coding taxonomy becomes rigid and obsolete, and you cannot afford to update it to reflect current realities. ## 6. Key Quote / Core Insight "We shouldn't just think of digital data as a microscope we look through. We should think of it as archaeological relics we've found. Instead of just analyzing what was left behind, we need to build mass collaboration systems to actively collect the data we actually want." ## 7. Additional Resources & References * **Resource:** *Bit by Bit: Social Research in the Digital Age* by Matthew Salganik - **Type:** Book - **Relevance:** The core text upon which this presentation is based, detailing methodologies for social science in the digital era. * **Resource:** Galaxy Zoo - **Type:** Website/Project - **Relevance:** A primary example of utilizing human computation to solve large-scale visual classification problems in astronomy. * **Resource:** Peer-to-Patent (Beth Noveck) - **Type:** Project / Paper - **Relevance:** Demonstrates how Open Calls can be applied to legal and bureaucratic domains to source expert knowledge. * **Resource:** PhotoCity - **Type:** Project/Paper (CHI 2011) - **Relevance:** A case study on using game mechanics to focus human attention for distributed data collection.