Packages from San Francisco
When Maik Fröbe orders a package from California, he isn’t awaiting the delivery of a smartphone, laptop or tablet. Nor is he expecting a Shiraz or Zinfandel. No, for the computer scientist is not about devices or delicacies, it’s about data. The most recent delivery from San Francisco contained 300 terabytes and transfer time took over a month despite a download rate of up to one gigabit per second. “That’s as fast as it gets,” he says, adding, “we don’t want to overload the lines.”
Fröbe is a research assistant at the Institute of Computer Science at MLU and is involved in the “Immersive Web Observatory” (IWO). His boss, Professor Matthias Hagen from the research group “Big Data Analytics”, initiated this innovative project four years ago together with Professor Martin Potthast from the University of Leipzig and Professors Benno Stein and Bernd Fröhlich from the Bauhaus-Universität Weimar. The IWO is funded by the Federal Ministry of Education and Research.
Knowledge repository with access barriers
“The Web is the largest repository of data and information there is,” says Matthias Hagen. “This makes it interesting not only for private and commercial use, but also for research.” Many areas of computer science deal with algorithms for storing and retrieving data - for example, to investigate how knowledge management can be linked with artificial intelligence. The Web is also invaluable for the digital humanities, a rapidly growing field of the humanities and social sciences. As an independent medium, it provides a snapshot of a section of society, showing how we communicate with each other, which topics determine our discourse, and which stakeholders get to have their say.
When conducting such analyses, however, researchers in the fields of computer science, sociology or history encounter two fundamental hurdles: First, the structure of the Web is very heterogeneous; it cannot be combed through like a single dataset - even using the best search engines. And secondly, research also requires historical data that maps the development of the Web itself. This, however, is particularly difficult because an average webpage is estimated to be online for 60 to 90 days at most before it is updated or even deleted. According to Hagen, “If you don’t work in the development department of Google or another internet corporation, and don’t have a ready-to-use copy of the Web at your fingertips, you have little chance of gaining access to this historical data.”
Transferring eight quadrillion bytes
But computer scientists have found a way to get around this hurdle by tapping into a source of inestimable value: the web archive of the Internet Archive. In 1996, two years before Google was founded, the American computer scientist Brewster Kahle began regularly archiving all the content available on the Web. Prominent websites such as news sites, which are constantly changing and accessed by millions of people, are copied several times a day; less important websites are copied less often. The library is stored on 20,000 hard drives in four data centres in San Francisco and now contains around 500 billion websites as well as more than 29 million books and texts, almost seven million videos and films, 14 million audio files, and almost four million image files.
The research group is now bringing a representative cross-section of the web archive to Germany. Eight petabytes, or eight quadrillion bytes, will be transferred from the data centres in California to IWO’s 78 servers at the Bauhaus-Universität. “We started downloading in 2019 and downloaded one petabyte by the end of January 2021,” says Maik Fröbe. The data is not being transferred continuously; when requested, the staff at the Internet Archive assemble single data packets which are then retrieved from Germany. The transfer is expected to take until the end of 2022. By then the servers in Weimar will contain as many as twelve petabytes of data because data is retained several times to prevent possible data loss. To illustrate: 12,000 standard PCs with one-terabyte hard drive each would be needed to store this amount of data.
Indexing and analysing big data in Halle
When the transatlantic data transfer is complete, the work of Halle’s computer scientists will be far from over: “The copy of the Web is initially nothing more than an unstructured collection of information,” says Matthias Hagen. “We will need to employ effective analytical tools to unlock the content.” An indexing cluster is to be created at MLU, a keyword system that enables structured searches prompted by scientific questions. With the help of big data analysis, it will be possible to look for search patterns in large amounts of data from different sources - for example, to find text, audio or image files on specific events or by particular authors.
The digital humanities will particularly benefit from the work of the computer scientists from Central Germany. The research questions are as diverse as society itself: Who publishes on the Web? Has the culture of discourse changed? Can the Web serve as a source of data for historians? Can the monetary value of platforms like Wikipedia be measured? How has the self-portrayal of public institutions, companies and individuals developed on the Web? Matthias Hagen expects a high degree of involvement from the research community. “We provide the impetus, create access and ensure good searchability. However, representatives from the different professional communities know best which raw data is the most relevant. That is why we are explicitly calling upon anyone interested to enter into constructive dialogue with us.”
Professor Matthias Hagen
Institut of Computer Science
Telephone: +49 345 55-24708
Mail: matthias.hagen@informatik.uni-halle.de