Skip to content

Home

WARCEX is an extensible command-line tool for extracting structured data out of WARC and WACZ files, developed by the Digital Observatory, as part of the Australian Internet Observatory (AIO).

AIO received co-investment (doi.org/10.3565/hjrp-b141) from the Australian Research Data Commons (ARDC) through the HASS and Indigenous Research Data Commons. The ARDC is enabled by the National Collaborative Research Infrastructure Strategy (NCRIS).

Installation

Install from GitHub using it pip:

pip install +git://github.com/QUT-Digital-Observatory/warcex.git

Usage

To get an overview of available commands, run:

warcex --help

You can see what plugins are available by running:

warcex plugins

And you can get more information about a plugin including instructions on web archiving activity by running:

warcex info <plugin-name>

Extracting data:

warcex --plugin fb-groups extract my_input_file.wacz my_output_folder/
You can specify more than one.