Methodology

End-to-end construction of the phenotype catalog, from Wikipedia photo sourcing through vision-LLM annotation to per-group aggregation. Reproducible: the full pipeline is open source under Apache 2.0.

1 — Group-level catalog

The starting point is a curated 484-row catalog of ethnic groups, indexed by name with normalized columns for homeland, region, language, language ISO codes, country ISO codes, religion, and a Wikipedia URL where available. The taxonomy follows a standard continent → sub-region → ethnicity hierarchy: Americas, Europe, Africa, Asia, Oceania at the continent level; 23 sub-regions; 484 ethnic groups.

This catalog predates the present pipeline and is curated manually. It is published as the ethnicities config of the public HuggingFace dataset.

2 — Notable-people scrape

For each group with a Wikipedia URL, we attempt to fetch the corresponding "List of {Ethnicity} people" article. When such an article exists, we extract names, short "known for" descriptors, birth/death years where surfaced, and per-person reference URLs. 291 of 484 groups have at least one such article on Wikipedia; 193 do not (typically small or obscure groups for which no list page has been authored). The scrape produced 13,094 people-rows total, published as the notable_people config.

3 — Image URL discovery

For each scraped person, we follow the reference URL to that person's individual Wikipedia article and extract a representative image URL — preferring the article's lead infobox image, falling back to the first OpenGraph image. 6,243 of 13,094 persons (47.7%) have a discoverable Wikipedia image; the remaining ~52% have no infobox image (often pre-photography figures or low-prominence subjects with text-only articles).

Wikimedia's upload.wikimedia.org rate-limits aggressively per source IP. We rate-limit the fetcher to 2 requests per second per source IP; tighter rates trigger HTTP 429 within batches.

4 — Vision-LLM annotation

Each image with a discoverable URL is submitted to a vision language model with a fixed structured prompt asking for 14 phenotype fields plus a self-reported confidence score (0.0–1.0) and an image quality bucket (high / medium / low / very_low). The prompt explicitly instructs the model to use Fitzpatrick I–VI vocabulary for skin tone (accepting ranges where appropriate), detect epicanthic-fold presence as a structured eye-shape sub-field, treat the photograph as one observation of one person (not an ethnic stereotype prototype), return unknown or empty for fields that cannot be assessed, and surface obscurations (glasses, hat, makeup, partial-face) explicitly.

The model used is the Anthropic Claude Sonnet 4.6 inference profile on AWS Bedrock (us.anthropic.claude-sonnet-4-6). Pilot runs against higher-tier models showed no measurable accuracy lift on this prompt — the task is short-context structured extraction with a fixed schema, which Sonnet handles at full quality at lower cost.

Run statistics: 5,668 images successfully analyzed; ~575 failed (load failures, model errors, parse failures, or oversized inputs hitting the Bedrock 5MB-per-image cap); concurrency = 4; throughput ~0.42 images/second (network-bound); total cost $44.66 USD at Bedrock list pricing.

The structured response is parsed and stored row-by-row. The original raw JSON is retained in the source database (column ethnic_image_analysis.raw_json) for audit but not redistributed.

5 — Controlled-vocabulary system

On top of the legacy 14-field analysis, the pipeline now defines a comprehensive controlled vocabulary covering 22 anatomical categories with 196 dimensions and 853 vocabulary buckets across 113+ peer-reviewed references. Each vocabulary file is a JSON document specifying dimensions (with type: categorical / ordinal / numeric / structured), scale (with citation), description, valid values (with definitions), and observability metadata (whether the dimension can be assessed from a photograph and at what minimum visible extent).

A generator script (scripts/generate-from-vocabulary.mjs) reads each vocabulary file and emits four artifacts: a vision-LLM analysis prompt fragment, a Prisma database model, an aggregation function, and a Markdown documentation page. Adding a new dimension means editing one JSON file and re-running the generator — there is no hand-written analysis prompt to maintain in parallel with the schema.

See Atlas to browse the 22 categories, Glossary for all 853 defined terms, or the public vocabulary repo for the source files.

6 — Per-group aggregation

A deterministic SQL/JavaScript aggregator (no LLM) produces a per-group summary card from each group's image-observation rows. The aggregator is non-LLM and re-runnable. For each group with at least 3 image observations (currently 209 of 484 groups), it produces:

  • Sample size and source breakdown
  • Quality split (high / medium / low / very_low percentages)
  • Mean self-reported confidence
  • Fitzpatrick distribution (proportion of rows in each bin I–VI)
  • Hair color, hair texture, eye color distributions
  • Epicanthic-fold proportion (yes / no / partial)
  • Conditional caveat lines: small sample (N<10), modest sample (N<25), low quality, low confidence, "Wikipedia notable people skews male and public-life — not population-representative"

7 — Coverage score

Every ethnic group page surfaces a 0–100 data depth score with four weighted components:

  • Sample size 0–40 (log-scaled image count, caps at n≥50)
  • Quality 0–30 (% of images at image_quality = "high")
  • Confidence 0–20 (mean self-reported model confidence)
  • Source diversity 0–10 (distinct source_types — currently always 0–1 since 100% Wikipedia, forward-compatible)

The score reflects coverage depth honestly. Groups Wikipedia has no photographic coverage for sit at 0; groups with comprehensive coverage approach 100. The score lives at ethniclist.CoverageScore and updates whenever the underlying data changes.

8 — Reproducibility

All pipeline code is open source under Apache 2.0 at github.com/Agaveis/phenotype-catalog-pipeline. The methodology paper is published at doi.org/10.5281/zenodo.20075617. The dataset is published under CC BY 4.0 at huggingface.co/datasets/EthnicErotic/phenotype-catalog.

Replicating the pipeline requires: AWS credentials with Bedrock access in us-east-1, a SQL Server database with the schema (or a port to your DB of choice), Node.js 18+, the HuggingFace CLI for upload, and ~$45 USD in Bedrock vision inference budget for the full 5,668-image run.

Read more: About · Ethics · FAQ · Glossary