Global Artificial Intelligence News Headlines (GAIN-H) Corpus (doi:10.7910/DVN/7C6FNO)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 4: Variable Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

Global Artificial Intelligence News Headlines (GAIN-H) Corpus

Identification Number:

doi:10.7910/DVN/7C6FNO

Distributor:

Harvard Dataverse

Date of Distribution:

2026-06-05

Version:

1

Bibliographic Citation:

Samuel, Jim; Siritha Chidipothu; Khanna, Tanya; Lakra, Ashish; Vidhi Gala, 2026, "Global Artificial Intelligence News Headlines (GAIN-H) Corpus", https://doi.org/10.7910/DVN/7C6FNO, Harvard Dataverse, V1, UNF:6:Axjbvx2xbD9mrqy6cDc5Jw== [fileUNF]

Study Description

Citation

Title:

Global Artificial Intelligence News Headlines (GAIN-H) Corpus

Identification Number:

doi:10.7910/DVN/7C6FNO

Authoring Entity:

Samuel, Jim

Siritha Chidipothu

Khanna, Tanya

Lakra, Ashish

Vidhi Gala

Other identifications and acknowledgements:

Jim Samuel

Other identifications and acknowledgements:

Tanya Khanna

Other identifications and acknowledgements:

Ashish Lakra

Other identifications and acknowledgements:

Vidhi Gala

Distributor:

Harvard Dataverse

Date of Deposit:

2026-06-03

Holdings Information:

https://doi.org/10.7910/DVN/7C6FNO

Study Scope

Keywords:

Computer and Information Science, Social Sciences

Abstract:

The Global Artificial Intelligence News Headlines (GAIN-H) is an open-access public informatics collection of three complementary datasets containing over 2.5 million artificial intelligence-related news headlines gathered from global news sources across multiple languages, countries, and time periods. The repository was created to support interdisciplinary research on how artificial intelligence is represented, framed, and discussed within the public sphere. The collection includes: (1) a metadata-rich corpus with temporal, linguistic, and URL-structural features; (2) a large-scale longitudinal corpus optimized for temporal analysis; and (3) an extended multilingual corpus containing search-term metadata that enables keyword-stratified analysis of AI discourse. Together, these datasets span more than two decades of AI-related news coverage and provide researchers with resources for studying media framing, sentiment, public discourse, AI governance, communication, computational social science, and natural language processing. The repository is intended for researchers, policymakers, educators, journalists, practitioners seeking to examine trends in AI-related media coverage across time, geography, language, and thematic domains. The datasets are released to promote transparency, reproducibility, and evidence-based research on the societal implications of artificial intelligence. The datasets were developed as part of the RAISE (Rethinking AI for Shared Empowerment) initiative at the MPI Program, Bloustein School, Rutgers University, and AIXosphere AI behavioral trends research.

Methodology and Processing

Sources Statement

Data Access

Notes:

<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a>

Other Study Description Materials

File Description--f13987192

File: Global Artificial Intelligence News Headlines (GAIN-H) Corpus Dataset 1.tab

  • Number of cases: 60168

  • No. of variables per record: 19

  • Type of File: text/tab-separated-values

Notes:

UNF:6:QQyeZ85+ugQm8lFmZK+2kg==

File Description--f13989194

File: Global Artificial Intelligence News Headlines (GAIN-H) Corpus Dataset 2.tab

  • Number of cases: 277428

  • No. of variables per record: 5

  • Type of File: text/tab-separated-values

Notes:

UNF:6:XXazCN8Yef4e/4Y+gOXORw==

Variable Description

List of Variables:

Variables

title

f13987192 Location:

Variable Format: character

Notes: UNF:6:26wk+LUGYCDgWILd4tnztQ==

link

f13987192 Location:

Variable Format: character

Notes: UNF:6:5E7XNebf6GEquZbtPkNlFA==

date

f13987192 Location:

Variable Format: character

Notes: UNF:6:5gQnKs9SP/zTsWmgTdGMmA==

source

f13987192 Location:

Variable Format: character

Notes: UNF:6:TLn38oRKPcjCl9kYC2AZ+Q==

country

f13987192 Location:

Variable Format: character

Notes: UNF:6:gBn/G7CYcefEuaCSgz3pPQ==

language

f13987192 Location:

Variable Format: character

Notes: UNF:6:WgrKGkA/mUfMd4rwECAHRQ==

translated_title

f13987192 Location:

Variable Format: character

Notes: UNF:6:TfbqA5k7fZYhb5AfoNnAig==

Day_of_Week

f13987192 Location:

Variable Format: character

Notes: UNF:6:o0kOWNKx6gXOfOlxvuhvEA==

Month

f13987192 Location:

Summary Statistics: StDev 3.338992393720864; Max. 12.0; Mean 6.914589150378958; Min. 1.0; Valid 60168.0;

Variable Format: numeric

Notes: UNF:6:RLP/1zR+OsX6M6o1nsthgg==

Year

f13987192 Location:

Summary Statistics: Max. 2023.0; StDev 0.8736259495169177; Min. 2020.0; Valid 60168.0; Mean 2022.3023201701901

Variable Format: numeric

Notes: UNF:6:uBmEEb+IS6DlmzWqc4lQaQ==

Quarter

f13987192 Location:

Summary Statistics: StDev 1.116907303049801; Mean 2.6526891370828385; Min. 1.0; Max. 4.0; Valid 60168.0;

Variable Format: numeric

Notes: UNF:6:U1ZY07K/Ifzc776gTdFGCA==

Is_Weekend

f13987192 Location:

Variable Format: character

Notes: UNF:6:D3hbJiTgOjm4/3+os4DJvw==

Is_Holiday

f13987192 Location:

Variable Format: character

Notes: UNF:6:H45u+iRQBnz6QycMimIhzQ==

Final_URL

f13987192 Location:

Variable Format: character

Notes: UNF:6:YTRLSKdM3xzD9VXNSEah3Q==

Domain

f13987192 Location:

Variable Format: character

Notes: UNF:6:a9YZ9zXdzoSaLYveliccPQ==

Subdomain

f13987192 Location:

Variable Format: character

Notes: UNF:6:QIRgzLt9nXawZodYUuKUdA==

URL_Depth

f13987192 Location:

Variable Format: character

Notes: UNF:6:UHlDOvSxIPqidhBKqU0iBw==

TLD

f13987192 Location:

Variable Format: character

Notes: UNF:6:nMqIqkm849U3F8xPRFzewg==

URL_Length

f13987192 Location:

Variable Format: character

Notes: UNF:6:YHnZD3W3tnsO3eFxvxjysA==

No

f13989194 Location:

Summary Statistics: StDev 80086.70957780698; Valid 277428.0; Max. 277427.0; Mean 138713.5; Min. 0.0;

Variable Format: numeric

Notes: UNF:6:rGhDWMhiBGv3ZHaAaoThqg==

date

f13989194 Location:

Variable Format: character

Notes: UNF:6:jQlztzPc0mMwwxHlm8Puvw==

title

f13989194 Location:

Variable Format: character

Notes: UNF:6:+uLI5FQyWW0pa866Tv8pWA==

source

f13989194 Location:

Variable Format: character

Notes: UNF:6:dg6f0KiurzvTSZwOvxOdLg==

language

f13989194 Location:

Variable Format: character

Notes: UNF:6:0epXYrTBdWLrDj790OJRCw==

Other Study-Related Materials

Label:

Global Artificial Intelligence News Headlines (GAIN-H) Corpus Dataset 3.csv

Notes:

text/comma-separated-values