Ethics

The phenotype catalog is explicit about what it is and what it is not. This page documents the intended uses, out-of-scope uses, source-bias caveats, and the historical-misuse boundary the dataset operates against.

Intended uses

Bias auditing of vision systems. Per-group structured ground-truth phenotype labels enable measuring per-group accuracy of vision systems (face attribute classifiers, skin-tone detectors, age estimators) under audit.
Fitzpatrick-balanced training subsets. The skin_tone field allows construction of training and evaluation subsets balanced across the Fitzpatrick I–VI distribution at higher granularity than the four-bucket race labels typical in existing benchmarks.
AI image generation, grounded in real distributions. The per-group phenotype distributions ground prompts to image generators, yielding within-group diversity instead of mode-collapsed stereotypical outputs.
Anthropological reference and education. The 484-group catalog with normalized homeland / language / ISO codes / religion fields supports cross-group queries, coursework, and educational use.
Multi-ethnic NLP prompt grounding. Structured demographic embeddings for grounding language-model outputs across ethnic groups.

Out-of-scope uses

The dataset is not appropriate for any of the following applications:

Individual classification.Predicting an individual's ethnic group from their photograph. The dataset describes ethnic groups in aggregate; individual rows are not classifier training data.
Identification or surveillance. Matching individuals to groups via phenotype features. The per-image rows preserve source URLs for audit purposes; they are not designed for biometric matching, and any application that fingerprints individuals via per-image phenotype features is explicitly out of scope.
Profile-based discrimination. Phenotype-based filtering of individuals in hiring, lending, immigration, or any other consequential decision pipeline.
Hierarchical population claims.Per-group distribution differences in this dataset reflect Wikipedia's sampling frame, not population-level genetic or trait hierarchies. Claims that one group is "more" or "less" of any phenotype dimension are not supported by this data.
Causal genetics inference. Phenotype distributions are observational, not experimental. The dataset cannot support claims about which genetic loci drive which phenotype variations.

We chose these boundaries deliberately. The dataset card on HuggingFace, the methodology paper, and this ethics page all carry the same explicit out-of-scope statement.

Source-bias caveats

The dominant limitation of the dataset is the Wikipedia source frame. Every per-image row in the corpus is sourced from a public-domain Wikipedia photograph of a person covered by the "List of {Ethnicity}people" Wikipedia article tradition. This sample is:

Gender-skewed male. Public-life prominence on Wikipedia tilts male in most categories (politicians, scientists, athletes, historical figures); rough overall skew is ~70–75% male.
Public-life-biased. Scientists, politicians, entertainers, athletes, and historical figures are over-represented relative to the general population.
English-language-coverage-biased. Groups with stronger English-Wikipedia presence have more image observations; this is loosely correlated with Anglophone diaspora size, not native-population size.
Photographic-era-biased. Pre-photography figures have no infobox image, so groups whose Wikipedia presence skews historical have lower image-observation coverage.

Per-group aggregations on the live site surface these caveats explicitly whenever the source breakdown is 100% Wikipedia (which is currently every aggregation). Future releases that incorporate user-submitted images or a second public-domain source will dilute this skew.

Historical-misuse boundary

Several of the anatomical dimensions in the catalog use vocabulary with a history of misuse in 19th- and early 20th-century racial-typology pseudoscience. The most prominent is the cephalic index in the head-shape vocabulary — dolichocephalic / mesocephalic / brachycephalic categories were extensively misused to support hierarchical claims about populations.

Boas's 1912 study (Changes in bodily form of descendants of immigrants, American Anthropologist 14(3): 530–562) demonstrated that cephalic index responds to environment within a single generation, undermining any use of cranial morphology as a fixed population marker. Contemporary craniofacial anthropology uses these dimensions as developmental, clinical, and individual descriptors only. The vocabulary file's framing-caveats block disclaims the historical misuse explicitly and scopes the dimension to clinical (craniosynostosis screening, pediatric orthotics), forensic (individual identification), and individual-descriptor uses.

Downstream consumers of this dataset should resist mapping cephalic-index categories — or any other dimension in the catalog — onto racial or population-typological claims. Per-group distributions are documented for completeness, not as definitions or distinctions.

Genital anatomy schemas

Three vocabulary files cover external genital anatomy (vulva, penis, pubic-region). These are published academically — citation-backed, ontology-interoperable, defensible to reviewers — but carry a file-level observations_source_policy: "internal_only" flag. No observations against these dimensions are populated from public-domain photographs and none reach the public dataset.

The rationale: Wikipedia and similar public-domain sources contain no relevant photographs; adult-content sources have consent and provenance issues that make them inappropriate for academic dataset construction; clinical-photograph populations require IRB-equivalent consent that an open dataset cannot guarantee. The schema is published so the academic taxonomy is transparent; observations stay internal until consent-cleared sources are available.

License and reuse

Code is published under Apache License 2.0; the dataset is published under CC BY 4.0. Image URLs in the dataset reference Wikipedia / Wikimedia Commons content under their own per-image licenses (typically CC-BY-SA, public domain, or various Creative Commons variants); consult each row's reference URL before redistributing actual image bytes.

Permissive licensing is intentional. Academic researchers, fairness-research teams, and AI developers who want to mirror the dataset under different attribution or rebrand it for domain-specific use are explicitly welcome to do so. The artifacts are designed to outlive any single host or branding context.

Read more: About · Methodology · FAQ · Glossary