A plan for libraries to aggregate metadata into one central portal

A guest post from Annie Schutte, a librarian, teacher and consultant for the Knight Foundation

The following is part of a series that looks at The Digital Public Library of America – the first national effort to aggregate existing records in state and regional digital libraries so that they are searchable from a single portal. It is written by Annie Schutte, a librarian, teacher and consultant for Knight Foundation.

Digital Public Library of America Director of Content Emily Gore has helped set an ambitious agenda for the DPLA launch in April – one that she says should “turn the people out there who are skeptics into believers.” The project’s mission, she explains, “really is to become the national digital library of the United States.” DPLA is making that happen not by building its own collection, but by bringing together the metadata from existing large institutions and state and regional aggregations around the country into one central portal.

The United States has seen large aggregation efforts before – The Digital Library Federation’s Aquifer portal, for example – but nothing on this scale with this mission has been attempted. Gore explains that many other countries have had an easier time getting their national digital libraries off the ground because they “have a natural organizational government structure where libraries report up through a national library, so it’s a more natural relationship … we don’t have that in the United States.”

Gore attributes DPLA’s success so far as being the result of “taking the time to plan and taking the time to really involve the community from the very beginning … so that there are people invested beyond just the people running the project.” Part of that planning has been working on how to build DPLA to accommodate all of the metadata variations in the United States, and then also to think globally so that DPLA can interact with other digital libraries around the world.

DPLA has already partnered with Europeana—the digital library of Europe—to launch a joint exhibition on immigration from Europe to the United States. The project “Leaving Europe: A New Life in America” has already become Europeana’s most popular exhibit, and Gore anticipates more collaboration in the future. Europeana is already planning, for example, to use DPLA’s open API in April to build a joint-search tool that would allow users to search both aggregations at once.

The Europeana partnership is possible in part because of DPLA’s commitment to open data—a commitment that Europeana shares. Gore explains DPLA’s model, saying, “one of the fundamental principles is openness, and that’s in the code base; that’s in the metadata; that’s in the communication; that’s in the community.” What that means is that DPLA will open its back door so that anyone can use the metadata for new creations. Gore hopes that this openness will make the DPLA collections more shareable and accessible to all—and open up the possibility of a truly global digital library in the future.

In this interview, Gore talks about the challenges of aggregating diverse metadata into DPLA, thinking globally in building a code base, making the collection as comprehensive and inclusive as possible, and how to build long-term sustainability for this exciting, new project.

Could you tell us about your mission for the Digital Public Library of America?

The mission really is to become the national digital library for the United States. Many countries have done this already, and many have a natural organizational government structure where libraries report up through a national library, so it’s a more natural relationship. And we don’t have that in the United States. We operate very independently. We do have the Library of Congress, but they have their own separate mission, and it’s not to be over other libraries like the organizational structure that exists in many other national libraries. That makes our challenge even greater here because we don’t have a lot of top down to be able to do this.

We feel like the time is right to begin to pull together this content that exists in the United States—pull together this metadata and provide broad access to this content.

DPLA is launching in April with seven service hubs. Can you explain what it means to be a service hub and how DPLA selected those pilot programs?

The seven service hubs are going to offer a suite of services. They’re going to first, for the April launch, take the metadata that they have already in their aggregations and share that with the DPLA. They’re doing a little bit of massaging of that metadata, getting it ready, doing some normalization in concert with us, and that is the first phase: to prepare the content that they already have and share that with us for the initial launch.

And then, post-launch, they will bring on some new content in their projects. That can potentially mean taking on some new partners—people and organizations they haven’t worked with in the past. They all have different scenarios, if you will, but they were all existing infrastructures that were either in the six states or one region that we have participating.

We’re really building off of existing infrastructure to get this thing going, and in many cases, they had existing partners. Minnesota, for example, has something in the range of 150 partners. They work with all kinds of cultural heritage organizations in their state, and they already know—even with working with 150 partners—that the capabilities of bringing on more are clearly there. There are more people that need materials digitized; there are more organizations who can bring this kind of content to the table.

Were you looking for geographic diversity, diversity of collections? Were there certain factors in selecting that group?

Sure. Geographic diversity was absolutely one. Some differences in the way they were organized was another, so whether or not they held stuff centrally—Georgia, Kentucky are both good examples; they have a centralized model where they have a central repository. South Carolina is an example of a distributed model where there are multiple repositories around the state, and they come together under a metadata aggregation. So, geographic diversity, differences in what they brought to the table regarding their setup, their staffing, those kinds of things.

DPLA also envisions having content hubs—large repositories that contribute content directly to DPLA. What partnerships have you made in this area so far and how do you anticipate that expanding?

Before I answer this, let me go back and say that part of what we are going to do with the service hubs in addition to the digitization and bringing on new partners and stuff like that, is the community engagement, which is a key part of the Knight Foundation’s vision for the project. And so, every service hub is planning community engagement activities around themes or what have you. They’re going to work with, in most cases, local public libraries in their communities to really get them involved in the project, so to get that down to the people, if you will. So that’s another piece of the service hubs.

But onto the content hubs. We have already announced that Harvard will be the first content hub, so the Harvard Libraries are aggregating various collections that they have, and they’ll be sharing those with us. And we anticipate being able to announce very soon that we’ll have some large-scale collaborators in the Smithsonian and the National Archives and Records Administration and the New York Public Library, as well as a couple large academic libraries.

So, we’re looking at a scale, with the National Archives and the Smithsonian alone, of more than a million records being shared from those institutions, so we’re exited about that. And then, really, marrying those together with somewhere in the neighborhood of a million plus from the hubs, and then bringing in these other folks like the New York Public Library and some of the academic institutions. ARTstor is also coming to the table. As a content hub, we anticipate them being able to share their content, and that announcement about that coming out fairly soon.

Do you think any of the content-hub content will be available at the April release?

Absolutely. I anticipate that all the folks that I listed will be available at the April release.

DPLA is also launching with an exhibition in collaboration with Europeana. Is this the start of the Digital Public Library of not just America, but the world?

The folks at Europeana are amazing to work with, and we have launched an initial exhibition on immigration, and that has been extremely successful. We visited Europeana, and they were sharing with us that [the collaborative exhibition] accounted for over 30 percent of the traffic to their website … We certainly envision continuing to work closely with Europeana.

So, not only will we have this portal in April, but we’ll also have an open API where people can get at the data in the data store and, like I said, apps or their own interfaces to access the content. So, we have talked to Europeana about making that API available to them when we feel like it is stable and has all the data that is going to be represented for April. And they would like to take that, and then take their API, and create a mashup so that we’ve got a cross search, so that we soon have what’s in DPLA and what’s in Europeana available in a cross-search app so that you can search both at the same time.

It’s the beginning of an amazing partnership.

Hopefully, together with them, we can push these things forward, and hopefully others will join… so that we can have global interoperability.

Can you talk more about the decision to build DPLA on a completely open-source platform — what drove that decision, and what types of challenges and opportunities does it present?

What drives that decision is that it’s the right thing to do. I just think it’s the right thing to do, and I think most people in our community involved in these kinds of projects would tell you the same thing. It’s easily replicable, shareable; there are not roadblocks for sharing. Working with proprietary systems typically means there are roadblocks to getting to the content, and so if we can tear down those roadblocks right from the beginning—or not put them in place—then we’re setting up the framework for sharing.

One of the fundamental principles of the DPLA is openness, and that’s in the code base; that’s in the metadata; that’s in the communication; that’s in the community. An open community. We have invited everyone from the community that wants to be a part of this, so it’s openness—not only open-source software, or a framework—but it’s the whole vision of the DPLA is about openness as a participatory platform, and that really is part of the whole vision.

The idea of having a national digital library or archive has been talked about and worked on in various forms before. What do you think has allowed the DPLA to gain traction as a large-scale project now, and what makes it different from other aggregation projects?

I think we’ve had aggregation projects. I don’t know that we’ve ever had the mission and the vision and the planning efforts to say: we are building a national, digital library. We’ve had very large-scale aggregations—we’ve had Oaister; we’ve had DLF Aquifer; we’ve has lots of different projects that were large-scale aggregations and large-scale projects. But to my knowledge, no one has ever set off with this mission format at very beginning to say: this is what we’re doing; community, you are all welcome; please come to the table, bring your ideas, and let’s do this together…

I feel like we’re starting out right. We’re starting out by taking the time to plan and taking the time to really involve the community from the very beginning in this project so that there are people invested beyond just the people running the project. I think that really makes the difference. It’s building a network; it’s building a large-scale community instead of just building a project.

What have been the biggest challenges so far in bringing the DPLA project to fruition?

From my perspective as director of content, it’s always a challenge to deal with an aggregation of this scale, because although we have obviously have standards in the field for metadata, everyone gives them their own slightly different interpretation. So, when we bring all this content together, and we try to make it work harmoniously, different fields may be implemented in very different ways.

Metadata is a huge challenge from the very beginning, working with that and making sure that we do some normalization work. We work with others to say alright, this may have worked in your local environment, but this is what it looks like in a global environment, and can we work together to make this better for a global environment? So, those kinds of things, just back and forth, that’s a challenge.

Another broad-scale challenge is the skepticism. A lot of people in the community are very skeptical about the project, and I think that is because there have been aggregations that have really failed in the past, and I think that when we have something to show, and we do have this initial launch, and we have this API that people are able to interact with, I’m very hopeful that that skepticism will really begin to change, and people will begin to see the possibility of DPLA, and we’ll turn the people out there who are skeptics into believers. And hopefully they’ll be interested in sharing and being a part of this platform, as well.

What challenges do you see going forward as DPLA continues to expand post April launch?

Obviously, comprehension, making sure that we are broadly represented, every community has a stake in this — libraries, archives, museums, historical societies, you name it. The cultural heritage record would like to be represented, so I think comprehension is something that I want to be very conscious of in terms of content coverage, and I think that is certainly a large-scale challenge—making sure that we have all kinds of content that meets the needs of all the communities who are interested in participating in this platform. So, that is certainly a big thing for the future.

There are tons of challenges. I think, honestly, it’s not something I worry about every night, but I think sustainability — financial, as well as staffing, and moving the project forward, is a key concern. Making sure that we’re building something that’s sustainable. In contrast to Europeana, we don’t have funding that comes from the government, so they have funding that’s allotted from their government every year. They raise money, as well, but they do have this solid funding base from their government, and that’s something that we don’t have as DPLA. Making sure that we have a constant, appropriate funding base, and that we’re out there working on that constantly, and working on making sure that the project, we’re able to sustain it. Maybe one day we will have government funding—that would be great—but we have to have success before that.

What’s your big-vision hope for the future of the DPLA?

First, I want to get as much content online in the US as possible. After that huge challenge, I really want us to think about what we can do globally—what we can do on a real, global scale and how what we’re doing operates in that world. I think about a researcher, or even an amateur historian, who is looking for something, and the possibility that they could find not only the cultural heritage content that might exist in United States institutions, but European institutions, or African institutions, or you name it.

I think the potential power of that is pretty amazing. But I think we can do a lot here in our own country first. That’s just my long vision — how great would that be on such a large scale?

By Annie Schutte, a librarian, teacher and consultant for Knight Foundation. View the original post here.

Photo of Emily Gore at DPLA Midwest (October 2012), courtesy of Joseph Mornin.