Fluid identity in repositories
The business of a library is to establish authoritative identities for the works they make available. That is why libraries put together authority files, as unambiguous names for authors: those are the names books are indexed under, and searched under in library catalogues. There are several advantages of having an unambiguous identity for an author are obvious. A researcher who wants credit for their work—or the department whose funding depends on it—doesn’t want credit to go to another researcher with the same name. Anyone collecting royalties on their published work will want their identity to be unambiguous as well—though not all fields of research make it as worthwhile to chase after residuals.
Library users also appreciate disambiguation: if I am looking for works by or about the contemporary German novelist Richard Wagner (1952- ), I’d like to avoid the deluge of works by or about the slightly more famous German composer Richard Wagner (1813-1883). And a library catalogue is being helpful when it includes the dates of birth to differentiate between the two Richard Wagners—just as Wikipedia is, when it refers to Richard_Wagner_(novelist).
Making those kinds of distinctions depends on having good enough metadata on the authors. If you’ve publishing a dead-tree book in the past few decades, your national library has been in cahoots with your publisher to make sure they have that metadata. *I* don’t remember giving the Library of Congress my year of birth, but it avoids a car dealer in Florida getting credit for any books I’ve written. (See Libraries Australia.)
Uncertainty in identity
Good metadata on authors is not always available though. With antiquity, for example, we’re driving blind, and the identity of authors is more fluid than most of us are used to. There are abundant mentions of authors with the same name, who we can’t be sure are the same person.
For example, there’s an ancient Oppian who wrote a book on hunting, and an ancient Oppian who wrote a book on fishing. We distinguish them by where they were born, which is metadata: Anazarbus (or Corycus) and Apamea (aka Pella). We’re not convinced the two Oppians were actually both called Oppian, though, and we can’t go to the Anazarbus (or Corycus) Town Hall and find out. (Our metadata is shaky enough, we don’t even know whether to look in Anazarbus or Corycus.) English Wikipedia doesn’t bother giving them separate entries either, though the French Wikipedia does: of Corycus, of Apamea.
Worse still, there are 13 ancient medical authors called Apollodorus (alongside 16 other ancient authors called Apollodorus); they may not be 13 different people, but they may not all be the same person either.
Faced with this lack of information on author identities, scholars can only shrug their shoulders: until we get more metadata, we don’t know who’s discrete from whom. And being Classicists, they shrug their shoulders in Latin: f.i.q., Fortasse idem qui: “possibly the same person as”. In classical bibliographies like the Thesaurus Linguae Graecae Canon , that abbreviation sees a lot of use, because a lot of ancient authors have doubts around their identity.
As some librarians could tell you (if you got them drunk enough), the deluge of information we now have means we know more about more, but with less certainty. If you don’t have the meticulous metadata that a National Library gathers, you’re left with papers or grant applications by several Apollodoruses (or John Smiths), that you can’t be sure are by the same person. Supplying Date Of Birth for disambiguation is not a prerequisite to publishing a journal article. Even institutional affiliation doesn’t always work: not all publications feature the institutional byline; institutions change names; researchers aren’t beholden to working at the same institution exclusively for their entire professional life; institutions can employ people with the same name.
And there is a lot of variability in how researchers choose to call themselves. A researcher may have published as Kate Mansfield; Katherine A. Mansfield; Katherine Alice Mansfield neé Beauchamp; K. Mansfield Beauchamp; Dr Kate Beauchamp; Katherine Mansfield PhD. And sorting through that variability is deduplication work. Without deduplication, you end up with situations like the current pilot of Research Data Australia, which lists as four different researchers at the Australian Institute of Marine Science:
- Dr Miles Furnas
- Furnas, Miles
- Furnas, M, Dr
- Furnas, Miles, Dr.
Of course Research Data Australia is still in pilot, and is continuing to working on improving its deduplication; there are no longer five variations of Alongi, Daniel M., Dr at the same institution. Selecting any one of the duplicate identities risks missing the research listed under another duplicate, and is a serious problem in using the registry; which is why ANDS has identified this as a priority concern.
(Persistent identifiers for authors, by the way, is not a silver bullet to that problem. Having a Global Unique Persistent identifier for “Katherine Mansfield PhD”, and a different Global Unique Persistent identifier for “Dr Katherine Mansfield”, is if anything worse than just using names: the identifier implies that the authoritative deduplicating has already been done, and it may not have been.)
Because of the deluge of data, and the lack of resources to deal with the data, deduplication is often done by computer through heuristics. But deduplication is fallible, and identifying Kate Mansfield with K. Mansfield Beauchamp, based on harvested data, is something tentative: if more metadata turns up, we may discover we’ve made the wrong identification. In fact we may discover that, even if we’ve done the deduplication by hand.
Fluidity of identity
That though takes us back to the Classicists’ annotation f.i.q. “possibly the same person as”: the identification of two authors as the same person is not always a given. It could be wrong, and it may need to be undone. Once we allow that, we no longer have an authoritative notion of identity. What we do have is a claim of identity by *an* authority. But then, all metadata are claims, and all claims are claims made by *an* authority. This may be reminiscent of “wikiality”—the notion that a fact is a fact only if enough people say it is on Wikipedia.
(Or, if you’d like a more erudite reference, cf. Bakunin’s anarchist take on the authority of bootmakers. Which makes for salutary reading anyway, particularly by predicting the Bloviating Internet Pundit and the intrinsic danger they pose, back in 1871.)
But reality is subject to consensus between authorities more often than we care to admit, especially when we’re dealing with bulk data. The dependency on multiple authorities means identity is fluid: it depends on who’s asking, who’s telling, and it’s subject to revision. The author identity claimed today may not be the same as tomorrow; and the works we attribute to that identity—which is why we care about author identity in the first place—may be broken up among different new identities, or merged with others’.
From the perspective of authority files and national libraries, this fluidity is merely an artefact of poor metadata: if we only knew the institutional affiliation, date of birth, and CV of the authors, surely we can deduplicate the world. That’s what national libraries try to do. But the fragmentation of identity providers in cyberspace means people routinely have multiple identities to manage, and multiple identities to write under. An institution’s citation metrics needs all those identities reconciled, so all the researcher’s work can be tracked. Normally the researcher wants that too. (We’ll explore in a future blog post the case where they do not.)
Approaches to identity: NicNames
The ambiguity of author names is an old problem, and the old solution has been to work out an unambiguous identity, through authority files, as much as your resourcing will allow. But with the overwhelming amount of identities to hack through, and limited resourcing, the more fluid notion of identity has started to be explored in repositories—particularly as cyberspace has made people familiar with fluid notions of identity.
Names are not just about software. They’re about people.
And this is where NicNames comes in.
The NicNames project, which is now concluding, has been working on deduplicating authors in institutional repositories. (Here’s an anecdote on their own Apollonius problem: two Charles Darwins—alias used to confuse the guilty.) In itself, that’s nothing new: all repositories have to do deduplication, whether or not they actually do it successfully.
But NicNames confronts the reality of researchers who may go by a variety of names, like the Katherine Mansfields we saw above. It addresses this reality with automated heuristics and analysis of social networks, as is becoming mainstream in deduplication. But it also addresses it with legwork—the social network of the institution itself is more effective than any number-crunching, and institutional libraries employ people who can actually get on the phone, and work out who’s who. The goal of NicNames is to effect a culture change in repositories: using not just authority files, but the authority of local knowledge.
Even more importantly, NicNames leaves researcher identity fluid. Even if the local legwork has established that the author J. Ron Black is the person University Payroll knows as Jonathan R. Black, NicNames does not have a notion of a primary identity. If you search for Jonathan R. Black, you’ll get results under Jonathan R. Black; but if you search under J. Ron Black—as the researcher prefers to use in publications—that’s the identity you will get search results for. The identities of researchers are acknowledged to be contingent and subject to revision: they are not hardcoded, but imposed over publications as an added layer.
That’s not payroll’s thinking or the grant agencies’, and NicNames *is* addressing their needs for establishing which publications are attributed to which researchers. But while they are accustomed to static records, what NicNames gives them is a snapshot; and the next snapshot may be different, because the world may be different.
NicNames does deduplication in just one repository, but it is eager to explore ways of sharing this deduplication among repositories. This would help avoid a lot of redundant work, given the widespread nature of scholarly collaboration. The model that suggests itself is a federation of NicNames instances, sharing resources, and consuming authority files from other initiatives as additional resources for their deduplication. Peter Sefton at the CAIRSS blog proposes one approach, involving the NLA. We are also investigating ways of moving the activity forwards.
In a future blog post, we will look at a complementary approach to deduplicating author identity, coming out of the UKOLN/DRIVER workshop on international repository infrastructure.



[...] Fluid identity in repositories « Linking research & learning technologies through standards blog.linkaffiliates.net.au/2009/10/21/fluid-identity-in-repositories – view page – cached The business of a library is to establish authoritative identities for the works they make available. That is why libraries put together authority files, as unambiguous names for authors: those are… (Read more)The business of a library is to establish authoritative identities for the works they make available. That is why libraries put together authority files, as unambiguous names for authors: those are the names books are indexed under, and searched under in library catalogues. There are several advantages of having an unambiguous identity for an author are obvious. A researcher who wants credit for their work—or the department whose funding depends on it—doesn’t want credit to go to another researcher with the same name. Anyone collecting royalties on their published work will want their identity to be unambiguous as well—though not all fields of research make it as worthwhile to chase after residuals. (Read less) — From the page [...]
Twitter Trackbacks for Fluid identity in repositories « Linking research & learning technologies through standards [linkaffiliates.net.au] on Topsy.com
October 22, 2009 at 6:47 pm