All WARC and no playback: The materialities of data-centered web archives research

This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the soc...

Full description

Bibliographic Details
Main Author: Emily Maemura
Format: Article
Language:English
Published: SAGE Publishing 2023-01-01
Series:Big Data & Society
Online Access:https://doi.org/10.1177/20539517231163172
_version_ 1797871197465083904
author Emily Maemura
author_facet Emily Maemura
author_sort Emily Maemura
collection DOAJ
description This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Zoology which reveals how boundary objects and methods standardization are used to enroll actors in the work of collecting for natural history. I extend these concepts by pairing them with frameworks for studying digital materiality and the representational qualities of data artifacts. Through examples drawn from fieldwork observations studying two data-centered research projects, I consider how the materiality of the WARC format influences research methods and approaches to data extraction, selection, and transformation. Findings identify three modalities researchers use to configure WARC data for researcher needs: using indexes to support search queries, constructing derivative formats designed for certain types of analysis, and generating custom-designed datasets tailored for specific research purposes. Findings additionally reveal similarities in how these distinct methods approach automated data extraction by relying upon the WARC's standardized metadata elements. By interrogating whose information needs are being met and taken into account in the design of the WARC's underlying information representation, I reveal effects on the emerging field of web history, and consider alternative approaches to knowledge production with archived web data.
first_indexed 2024-04-10T00:39:16Z
format Article
id doaj.art-2277213278d545558b66e60a2932db77
institution Directory Open Access Journal
issn 2053-9517
language English
last_indexed 2024-04-10T00:39:16Z
publishDate 2023-01-01
publisher SAGE Publishing
record_format Article
series Big Data & Society
spelling doaj.art-2277213278d545558b66e60a2932db772023-03-14T07:03:19ZengSAGE PublishingBig Data & Society2053-95172023-01-011010.1177/20539517231163172All WARC and no playback: The materialities of data-centered web archives researchEmily MaemuraThis paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Zoology which reveals how boundary objects and methods standardization are used to enroll actors in the work of collecting for natural history. I extend these concepts by pairing them with frameworks for studying digital materiality and the representational qualities of data artifacts. Through examples drawn from fieldwork observations studying two data-centered research projects, I consider how the materiality of the WARC format influences research methods and approaches to data extraction, selection, and transformation. Findings identify three modalities researchers use to configure WARC data for researcher needs: using indexes to support search queries, constructing derivative formats designed for certain types of analysis, and generating custom-designed datasets tailored for specific research purposes. Findings additionally reveal similarities in how these distinct methods approach automated data extraction by relying upon the WARC's standardized metadata elements. By interrogating whose information needs are being met and taken into account in the design of the WARC's underlying information representation, I reveal effects on the emerging field of web history, and consider alternative approaches to knowledge production with archived web data.https://doi.org/10.1177/20539517231163172
spellingShingle Emily Maemura
All WARC and no playback: The materialities of data-centered web archives research
Big Data & Society
title All WARC and no playback: The materialities of data-centered web archives research
title_full All WARC and no playback: The materialities of data-centered web archives research
title_fullStr All WARC and no playback: The materialities of data-centered web archives research
title_full_unstemmed All WARC and no playback: The materialities of data-centered web archives research
title_short All WARC and no playback: The materialities of data-centered web archives research
title_sort all warc and no playback the materialities of data centered web archives research
url https://doi.org/10.1177/20539517231163172
work_keys_str_mv AT emilymaemura allwarcandnoplaybackthematerialitiesofdatacenteredwebarchivesresearch