Norwegens Nationalbibliothek digitalisiert Gesamtbestand

Das aktuelle Interview mit Svein Arne Solbakk,
Direktor für IT und Digitalisierung an der Nationalbibliothek Norwegens

Immer mehr Bibliotheken sichern ihre unwiederbringlichen Bestände durch Digitalisierung, allerdings ist noch kein Land so konsequent vorgegangen wie Norwegen. Als erste Nationalbibliothek der Welt digitalisiert die Nationalbibliothek Norwegens alle jemals in Norwegen gedruckten Bücher, Zeitungen und Journale. Mehr als 200 Mio. Seiten Buch-, Zeitungs- und Journalbestände werden in den nächsten Jahren in einem eigens eingerichteten Digitalisierungszentrum nördlich des Polarkreises hoch im Norden Norwegens digitalisiert.

   Das von der Nationalbibliothek initiierte und von der Regierung geförderte Großprojekt ist, abhängig von der Zuteilung der Gelder, auf 15–25 Jahre terminiert. In dieser Zeit werden 60 Mio. Zeitungsseiten, mehr als 60 Mio. Buchseiten und 80 Mio. Journalseiten digitalisiert werden. In der Nationalbibliothek in Norwegen werden von zwei der größten Zeitungen bereits keine Mikrofilme mehr hergestellt, sondern die Bestände werden ausschließlich digitalisiert und als PDF-Dateien im Internet bereit gestellt. Zurzeit findet die Evaluierung darüber statt, ob die Mikroverfilmung aller anderen norwegischen Zeitungen ab Jahrgang 2008 ebenfalls beendet werden soll.

Svein Arne Solbakk

   Das Hamburger IT-Unternehmen CCS GmbH wurde Anfang 2007 mit der Lieferung der Workflow-Software beauftragt und stattet das interne Digitalisierungszentrum mit seiner Technologie aus. Das Unternehmen ist mit seiner Software docWORKS ein international gefragter Spezialist für die originalgetreue Digitalisierung von Texten aus Büchern, Zeitungen und Journalen. CCS konnte sich bei diesem Projekt gegen mehrere Wettbewerber auf Grund des besten Preis-Leistungsverhältnisses durch hohe Automatisierung durchsetzen. Weiterhin wurden die Flexibilität, alle Dokumententypen verarbeiten zu können und die hohe Digitalisierungsqualität als Entscheidungsgründe für CCS benannt.

   Der Projektverantwortliche Svein Arne Solbakk, Direktor für IT und Digitalisierung an der Nationalbibliothek Norwegens, und B.I.T.-online-Redakteurin Angelika Beyreuther sprachen im Oktober über Einzelheiten des Großprojekts.

The National Library in Norway is actually located both in Oslo and in the small town Mo i Rana, close to the polar circle. Currently, 200 of the 370 employees at the National Library have their main office in Mo i Rana. Also, the main safety magazines are located in Mo i Rana, and thus most of the material to be digitized is already in the magazines there. We are, however, also transporting some material from Oslo to Mo i Rana for digitization, since it is not cost effective to establish two complete digitization units at the National Library. We have a small digitization activity in Oslo, to digitize unique and rare material that is too fragile to be transported to Mo i Rana.

Also, it should be mentioned that the microfilming of newspapers as well as work on conversion of audiovisual materials to new formats are located in Mo i Rana. These activities are closely related to the digitization, and we have developed the competence of the staff in the direction of digitization.

First, I would like to mention an organizational change that was carried out as a consequence of our strategic decision on digitizing everything. Currently ICT development, ICT maintenance, and all the digitization work, are organized into one department in the National Library. The ICT and digitization department currently counts 84 people. We wanted to use ICT knowledge to streamline the digitization, to be able to utilize the potential of people and equipment as much as possible. Also, we wanted to ensure that all digital content were taken care of in a secure digital repository, and that we make use of the digital content in our digital library services. Close organizational ties between the digitization activity and the IT activity have made it easier to obtain these goals.

However, the digitization program affects the work flow in the National Library as a whole, and a lot of people in the organization are involved. The people closest to the collections make the priorities for digitization and prepare the material for the digitization process, the cataloguing department catalogue and organize the material to make it possible to retrieve it after digitization, the dissemination department is involved in developing the digital library services making use of the digital contents and the metadata, and all the departments are involved in fetching material from the magazines to the digitization areas.

The current focus is on digitizing books, manuscripts, photos, radio and other sound recordings, and video. In the near future, we will also start digitizing periodicals. Most of the digitization is taking place in Mo i Rana, but there is also a small unit in Oslo digitizing books, manuscripts and other rare and fragile material, in addition to digitization on request from our users. The Oslo unit counts six persons, and they use four I2S digibook scanners (three A1 and one A0), as well as one Hasselblad camera (H3D-39). The Oslo unit is located together with our conservators.

In Mo i Rana there are approximately 40 people working with digitization. In addition, there are approximately 30 IT people working with activities related to the digitization, and there is additional staff working with fetching material from the magazines for digitization, doing necessary cataloguing, and working on copyright matters. One example of the IT activity is systems development for streamlining the digitization. We are making production chains, integrating the process from ordering material from the vaults and all the way to the digitized material is stored in the digital repository in preservation quality. Making the digital content searchable and accessible in our digital library is a part of the production chain. The development of our digital library service, NBdigital, is also requiring effort. In addition, quite a few people are involved in the maintenance of the necessary IT infrastructure and the running applications.

The digitization in Mo i Rana is split into five subunits – mass digitization of printed material, object digitization of printed material, photo, sound recordings and radio, and video. Eight people are working on digitizing sound recordings, radio and video. In this area we are digitizing a large number of formats, some of which are very hard to find working technology for, to read the formats for digitization. The digitization for preservation is then of great importance. The contents on the rare formats will otherwise be lost for ever.

For the mass digitization of printed material we cut open the books, and use two automated scanners (Agfa MF S 655) for the book pages. In addition, we scan the covers on I2S Copibooks. With these scanners we digitize approximately 1.000 books every week, with an average of approximately 200 pages per book. On average, five to six people handle this production.

With 15–20 Mbytes per page, this amounts to up to 4 terrabytes of data every week that needs to be transported through the local network to secure storage.

Object digitization of printed material is focusing on the material which can not be de-assembled. In Mo i Rana we have two I2S digibook scanners, as well as one DL 3000 from 4DigitalBooks which automatically turn the pages for books and newspapers during digitization. In addition, we are still microfilming quite a lot of newspapers. However, it is likely that we in near the future will terminate the microfilming and instead digitize in preservation quality the new newspapers that we do not get deposited as high quality PDF-files. Scaling up the digital deposit is of course also an issue at hand. Object digitization also includes the digitization of photos in various formats. Altogether, five people digitize approximately 60.000 photos per year for preservation.

It should be mentioned that the digital preservation of these large amounts of digital contents is a major challenge. The storage capacity of our digital repository currently exceeds one petabyte of digital content. Every digital object is stored in three copies, one on disk and two in tape libraries in two different buildings. The maintenance of one petabyte of disk storage and two petabytes of tape storage is more challenging than we were able to foresee. Also, the very large number of digital files is challenging to handle. The challenges related to digital preservation comes on top of that. The estimated growth of digital content is 750 terrabytes per year for the next three year period. After that we expect to exceed 1 petabyte per year due to an increase in digital film content.

We want to be able to do full text search in all printed material we digitize. With our large volume of digitization, this requires a powerful tool for OCR and structural analysis. In an open tender, CCS offered us docWorks as a solution for this challenge. They won the bid because their offered solution had the best price/functionality ratio. For us, it is also an assurance that CCS themselves use docWorks extensively in their own production. Their knowhow is very valuable in the implementation of the processes at the National Library. The scalability of docWorks is important for us. We currently run eight servers that automatically handle OCR of our digitization production. In addition, we have a few workstations for manual handling of exceptions, and for more extensive structural analysis and proof reading of selected titles.

After making the strategic decision of digitizing the complete collection, The National Library of Norway started out with reallocating people and money to this strategic activity, within our existing budget. Based on this, as well as concrete results from the activity, and our careful explanation of the potential services that may be established on the basis of a large digital collection, the government has in their suggested budget for 2009 matched our reallocation of resources with fresh money. We believe that this approach has proven more successful than making ourselves fully dependent on fresh resources before starting this important work.

We also discussed collaboration with some of the commercial firms up front, but the terms for such a collaboration were not acceptable for us. In some areas, e.g. for radio broadcasts and newspapers, the owners of the material collaborate with the National Library on the digitization. We have been able to agree on the quality level on the digitization as well as on the OCR and structural data. The cost related to the digitization and post-processing is split between the National Library and the owner of the material, and thus we are able to digitize much more than would be possible if we did the job separately.

We are working closely together with the copyright holders organizations. At an early stage of the digitization programme, we established a pilot together with these organizations. In the pilot we make freely available on the Internet around 1.000 titles still protected by copyright. We have carefully monitored the use of these works, and thus we have been able to discuss with the copyright holders the consequences of giving free access to digitized works. The first impression is that for books that are not very hot on the market, such access is actually rather promoting the commercial use of the material than competing with the commercial use.

The copyright law in Norway states that the National Library can make its digital collection available for research and studies in the reading rooms of the National Library. Due to this, all the digitized material can be made available in the Library. All material that is free from copyright are made freely available for search and access in our digital library. We also plan to make it easy to download this material as PDF-files. Copyright protected material is only made available outside the Library when we have agreements with the copyright holders making this possible.

We are continuously making improvements in the processes. We have a lot of experience now that would have been very nice to have when we started the programme! Our first learning experience was that we were too optimistic on how fast we would be able to establish fully operational large scale digitization. It took more than six months just to carry out the necessary tenders and make the investments in scanner equipment. During this process we made changes in our organization, adapted necessary areas for the digitization workflow, and developed the first steps in the production chain to be able to start the digitization. During the first four to six months of production we had to make a lot of adjustments to come even close to our production goals. Even though we knew the data volumes in advance, the IT infrastructure had to be optimized further to be able to handle the large data volumes in an efficient way. The people doing the digitization work had mainly been doing microfilming before, and the work process was so different that e.g. trying to keep the scanners running continuously even during the lunch breaks, required a change of thought that was underestimated. Also, the scanners were not usually used for the kind of material we had. Old books contain a lot of dust that required more extensive cleaning of the scanners than normal.

After six months of production, however, we reached our production goals. This was in the beginning of 2007, and since then the digitization has been very stable at an acceptable production level.

Another experience is that it is important to be flexible regarding technology and the digitization method, while the chosen quality level and the choice of formats must ensure that the digital material can have a very long life. Rescanning is a very expensive option. When new scanner technology or relevant applications becomes available, the production chains must allow changes that make it possible to utilize the new inventions.

Especially in the automated digitization where we have the largest volumes, we have been working quite a lot to get a preservation quality level on the resulting images. We are scanning everything in 400 dpi with 24 bit colour depth. However, getting an authentic image of the book, especially regarding colours, have proven difficult. We are therefore now doing a colour correction of approximately 12 000 000 images (using Color Factory from Fotoware) to bring the images closer to the look of a new book rather than focussing on making an authentic image of the book being digitized. The focus is on the contents (i.e. text and layout) rather than on the condition of the book being digitized. However, for our object digitization, we focus on authenticity related to the original book.

In general, we proceed according to our plans. Next year we will expand the digitization of printed material to handle journals and magazines in addition to the books. This introduces a new level of complexity in the handling of OCR and document structure.

The presence of an extensive database of digital culture and knowledge is important in a world that is getting more and more digital by the day, to maintain freedom of expression, to support education at all levels, and to make our culture and knowledge easily available in our society in general. Also, such a resource will make it possible to enhance the quality of research, and even open new areas of research that have not been feasible before.

A danger we experience even today, is that knowledge that is not available in digital form on the internet, is not being taken into account at all by a steadily increasing number of researchers and students. If important national resources are not available in digital form, this is a threat both to the quality of the research and to the use and knowledge of our national heritage.

Librarians are specialists in handling, classifying and retrieving relevant information. These are valuable skills in a digital society, where the enormous amounts of information can be a hinder to get access to the relevant parts. Therefore, it is important to match the knowledge of the technological opportunities in information handling with the knowledge organization skills to get the best from both professions. Obviously, when a collection is available in a digital format, we must reconsider the ways we catalogue and retrieve information. However, the profession of a librarian is much broader than the cataloguing rules.

Also, I believe that the book in itself is an ingenious device. Who would replace the book with a computer in the bed, on the train, or in your best chair a Sunday afternoon? I believe that libraries need to change to adapt to the changing needs of our citizens. Digital resources will play an important role both in the libraries and in the homes of the citizens. However, I do not believe in bookless libraries, nor that the libraries will disappear. Therefore there will always be a need for service minded and skilful librarians.