Modelling identity for different purposes
Registries of data—whether in research, learning, government, or other domains, and whether repositories, data warehouses, Learning Management Systems, or libraries—typically contain metadata not just on the content itself, but on who the data came from. The people responsible for the data are of interest to the people consuming the data; so registries need to record information about them as well. The primary kind of people (or groups of people) that are of interest are the authors of the data—or, where that concept is not as applicable, the contributors or compilers of the data. (Because institutions and organisations can also claim authorship, we prefer to refer to parties rather than people, following the ISO 2146 information model for registries.) But many parties can be responsible for data ending up in a registry, in the form it does; a registry can track a range of parties involved with data, in a range of roles: publisher, editor, validator, annotator, designer.
Because it is important to record information about parties, lots of registries record that information, in lots of ways. And to lots varying extents of detail. That means that there are a variety of information models at play for parties in registries. That doesn’t mean that all information models are rigorous and well thought out. Whacking in just the login name of an uploader, as YouTube does, is itself an information model for a party involved with the content—even if the amount of thought that went into it was not overwhelming.
But that does not mean YouTube’s information model is wrong. How much information you capture on parties for a registry depends on what use that information will be put to in the registry. The information model for parties is driven by the business requirements of the registry.
That of course is no great surprise, and working out what information is required is not particularly onerous: people may not put a lot of thought into it when they put registries together, but often enough they don’t need to. Still, especially if you are shopping for standards on representing parties, it is worth spending a couple of minutes working out what you need—and as importantly, what you don’t need.
Because after all, there is such a thing as Too Much Information. For several reasons:
- Date of Birth and Tax File Number as a requirement to upload YouTube videos may well have proven a non-starter: registering information about people comes up against privacy concerns, and they usually can’t be wished away.
- The more information registered on a party, the more information needs to be kept up to date and accurate, especially if the registry is being treated seriously as a registry. So more data introduces more data management.
- If data is going to be transferred between registries, inconsistencies between different local profiles of data standards will matter, and different data standards will matter even more. So different kind of information also raises interoperability concerns.
- Contributors that are adequately housetrained may well think nothing of filling out ten pages of forms in detail. But most contributors aren’t housetrained (or repository staff), and they can’t be bothered. They can be even less bothered if they don’t see what the relevance is of the metadata they need to provide. So the business relevance of metadata (or rather, the *perception* of relevance) is a motivator to people entering more or less metadata. And so is sloth.
We can tease out the considerations in business relevance some more, by presenting the reasons users might want to know about parties, and the information about parties required by those uses.
So, why should we record information about authors (or contributors or compilers)? The answer seems obvious: “I want to know who did this”. The answer is not that obvious as that; but even that answer implies more than just whacking a first name and surname into a text field. If we want to know who wrote something, we need enough metadata to identify that party uniquely from other candidates. In most naming conventions and most contexts, that works—or more precisely, naming conventions evolve to accommodate the usual contexts in their societies. (Surnames are not particularly old, Middle Names even less so, Roman names evolved to include disambiguating nicknames.) But with any large enough registry, there will be people with the same name—a problem we’ve already looked at extensively in a previous post. And there needs to be access to disambiguating information, to tell them apart.
The disambiguating information is critical. As we’ve argued in past projects, it’s the real definition of identifier resolution: if your registry involves two different Charles Darwins, resolving identifier A to the string “Charles Darwin” and identifier B to the string “Charles Darwin” is no more useful than using “Charles Darwin” to begin with. It’s only if A and B can take you to disambiguating information, such as Current Institutional Affiliation or Embarrassing Middle Name, that the identifiers become meaningful. The registry may not have to be responsible for gathering that disambiguating information itself; it is responsible for making sure someone does.
We should note that ambiguity is a problem YouTube doesn’t have: users all have a unique login name. Of course, a login name like Matrix141414 doesn’t tell you much about the person who uploaded the clip: where they live, their employer, their professional background, their related activities. But the question didn’t involve any of that. The question was simply “I want to know who did this”. And if you’ve never met or heard of Zóltan Székely, then whether I tell you it was Matrix141414 or Zóltan Székely leaves you none the wiser: both are legitimate answers to the question.
What good *does* it do you to know it’s Zóltan Székely rather than Matrix141414? To google them? To write to them? To link the registry objects up with your HR system? To find other stuff by them? YouTube’s already got the last one covered—so long as you’re limited to searching YouTube; but did you want to find stuff by them elsewhere?
But if you want any of that, you’re no longer asking “I want to know who did this”. You’re after different things, which require different models of identity, relying on gathering different kinds of information.
- Bridging Identity: If I know the party under one identity, and the party is being identified under another identity, I would usually want it made clearly to me that the two identities are the same. So if I already know Zóltan Székely, it is meaningful for me to map Matrix141414 to Zóltan Székely. Why that is meaningful is a different issue, and goes back to why you want to know who did this in the first place.
- Audit: We wish to account for someone’s intellectual output, whichever identity they happened to have published under. This is something the Research Office and the funding agencies are if anything more interested in than the researcher.
- Authority: We wish to establish whether the claims in the intellectual output should be trusted, by finding out more about who has made those claims.
- Attribution: We wish to explicitly acknowledge the authorship of some intellectual output, giving credit to its author.
- Discovery: More commonly than one might think, we already know that a certain author does good work or work in a field we are interested in, and we want to discover more intellectual outputs by the same author—which we think we will also be interested in. That means that discovery of intellectual output runs along the lines of social networking.
- Subject Matter: In the humanities in particular, parties may not just be responsible for contributing intellectual output to a registry: they may form the subject matter of those outputs.
If you are gathering information on a party as subject matter, you’re moving towards an online biography of the party; the information you’re gathering is open-ended, and no longer metadata, but data proper. If your requirements are audit, your metadata needs are minimal: you just need enough metadata to establish the unambiguous identity of someone you already have plenty of metadata on in your own HR systems. If your requirements are attribution, you don’t need much more: you presumably need to know the author’s affiliation, since the affiliated institution gets some of the credit.
But if you’re using the identity to establish authority, then the identity you need isn’t limited to disambiguation: you need access to the credentials which allow you trust the author. In fact, you need at least some of the author’s CV, that will have been gathered in a biography. Again, the registry may not gather that information, but it needs to provide access to an identity construction that does; that means that scholarly repositories have to upload content attributed to Zóltan Székely or szekel@akademia.edu.hu instead of Matrix141414, and that Zóltan Székely has an online CV, or at least consistently publishes under that name.
And if you’re using the identity for discovery, it is critical that the author published consistently under that name, or that some system is deduplicating the identities the author publishes under: the metadata becomes the contributions associated with the author, and how they are associated. Discovery is critical to navigating data: once people have established the authority of an author of a contribution in one contribution, it’s easy for them to carry it over to their other contributions. (Bakunin might have some things to say about this implicit authority, but you can ponder that in your own time.) It means that the registry must make the author identity searchable, not just visible.
That applies to aggregators like Research Data Australia even more than to aggregated registries, since the point of aggregators is to enable discovery across a range of repositories: such collections have to capture party identities, deduplicate them, and expose them to search, if they are to be used for discovery as they are intended to.


