Bibliothèque nationale de France
bibnum.bnf.fr ]

The WARC File Format (ISO 28500) - Information, Maintenance, Drafts

Purpose

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.

Context

For many years, memory organizations have tried to find the most appropriate ways to collect and keep track of World Wide Web material using web-scale tools such as web crawlers. At the same time, those same organizations have a rising need to archive large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects must be of unrestricted type (including many binary types), but fortunately the container needs only minimal knowledge of the nature of the objects.

The WARC file format offers all these capabilities. It is an extension of the ARC file format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The ARC format file has been created in 1996 by Brewster Kahle and Mike Burner from the Internet Archive for managing billions of objects, and is used today by several national libraries.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC - http://netpreserve.org), whose mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC Standards Working Group put forward to ISO TC46/SC4/WG12 a draft presenting the WARC file format. The draft was accepted as a new Work Item by ISO in May 2005.
It has been released as an international standard in May 2009.

WARC file format maintenance

Requests for modifications or a revision of the WARC file format can be addressed at: normalisation@bnf.fr
A normal revision procedure will be applied for the possible needs of revision of ISO 28500.
The procedure is monitored by the ISO TC46/SC4/WG12.

Latest draft

PDF file : Information and documentation - The WARC File Format - ISO 28500 / Draft as of November 2008
Word file : Information and documentation - The WARC File Format - ISO 28500 / Draft as of November 2008


Last updated: 2009-05