- Home/
- About the Phenotype Catalog
About the Phenotype Catalog
A structured ethnographic + phenotype reference covering 484 ethnic groups across 22 anatomical categories, with 195 dimensions, 848 cited vocabulary buckets, and 113 references.
What this is
The Phenotype Catalog is a public, citation-backed reference that pairs structured ethnographic metadata (homeland, language family, ISO codes, religion) for 484 ethnic groups with a comprehensive medical / anthropological vocabulary covering 22 anatomical categories. Each category has 4–16 dimensions of variation drawn from peer-reviewed scales — Fitzpatrick I–VI for skin, Halls for breast shape, Regnault for ptosis, Heath-Carter for somatotype, Andre Walker 1A–4C for hair texture, Hamilton-Norwood and Ludwig for hair loss, Manning's 2D:4D for digit ratio, Cavanagh-Rodgers for arch index, Mendieta for buttock shape, and others.
Each row of structured per-image observation data on the live site can be traced back to a public-domain Wikipedia photograph of a notable person from the corresponding group, analyzed by a vision language model (Anthropic Claude Sonnet 4.6 via AWS Bedrock) into the structured fields. The aggregated per-group distributions surface on each /ethnic/[slug] page, with explicit caveats about source-bias and sample-size.
Why it exists
Existing fairness benchmarks like FairFace and BUPT-Balancedface aggregate ethnicity to four to seven race buckets — too coarse for finer-grained bias auditing on vision systems and AI-generation tools. Constructing a finer-grained dataset directly is expensive: crowd labeling has its own biases, most candidate image sources lack per-image license clarity, and the academic literature on ethnically-distinct phenotype variation is fragmented across dermatology, plastic surgery, urology, anthropology, and trichology journals.
This catalog solves the licensing problem by sourcing exclusively from public-domain Wikipedia photographs, solves the labeling cost by using vision-LLM annotation against fixed structured prompts, and solves the literature-fragmentation problem by consolidating the relevant peer-reviewed scales into a single open vocabulary published on GitHub and HuggingFace. The result is a per-image structured phenotype dataset at unusual ethnographic grain (484 groups, mean ~24 images per represented group) grounded in 113+ peer-reviewed citations.
What you can do with it
- AI image generation, grounded in real distributions. The site uses these per-group phenotype distributions to ground prompts to image generators — yielding within-group diversity instead of mode-collapsed stereotypes. Users select a group, the system samples from the actual observed distribution, generation reflects real variance.
- Bias auditing of vision systems. Per-group ground-truth phenotype labels enable measuring per-group accuracy of any vision system you want to audit.
- Anthropological reference. The 484-group catalog with normalized homeland / language / ISO codes / religion fields supports cross-group queries and educational use.
- AI model fine-tuning. Fitzpatrick-balanced training subsets, multi-ethnic prompt grounding, demographic embeddings.
Open artifacts
- Dataset (CC BY 4.0): huggingface.co/datasets/EthnicErotic/phenotype-catalog
- Methodology paper (DOI): 10.5281/zenodo.20075617
- Pipeline code (Apache 2.0): github.com/Agaveis/phenotype-catalog-pipeline
- 22 vocabulary files: public canonical taxonomy
Caveats
The dominant limitation of the dataset is the Wikipedia source frame: "notable people Wikipedia has a list-of-X-people article for, with a photograph in their individual article." This sample is biased male, biased toward public life, English-language-coverage-biased, and photographic-era-biased. The aggregator surfaces this caveat textually on every per-group summary.
Additionally: the dataset is aggregate-research-only; per-image rows are not appropriate for individual classification, identification, or surveillance applications. Phenotype variation within ethnic groups overlaps substantially with variation between groups; per-group averages should not be read as individual predictions. See the ethics page for the explicit scope statement.