Of Icebergs and the Internet


The use of online databases and search tools in family history research has provoked a kind of backlash among more traditional genealogists. It is often said that the Internet is just the tip of the iceberg with real research taking place in libraries, family history centers and archives. And it really is true that only a fraction of the genealogical data available can be found online. That is changing, of course, more information becomes available online every day, and the amount of data available now was undreamed of a few years ago, but creating new digital repositories is no easy task, and it’s not free. So, for the foreseeable future, we should expect family history to involve working with microfilm, reference works, and even physical papers stored in libraries, churches and private collections.

But the value of digital libraries should not be underestimated, they really have revolutionized genealogical research. In part, I think, there is a kind of nostalgia for traditional methods and archives, and it is thoroughly understandable. But depending on whether you identify more strongly with the digital camp or the traditional camp, you may find yourself either exaggerating or understating both the sheer amount of information available in digital form and the relative comprehensiveness of that information. A bit of explanation is in order here: no matter how much data is available online, if the information you’re looking for is not available, it won’t matter (to you) how much information there is out there that you can download using just a web browser and an Internet connection. Comprehensiveness is the degree to which an archive or digital repository includes all of the data you might need, and not just certain resources, or data of a particular type. Right now, comprehensiveness is the Achilles heel of digital repositories. Sooner or later, you’re going to find yourself needing data that hasn’t been digitized and indexed or documents that haven’t been scanned or photographed. Sure, there will be plenty of data out there to keep you busy, but there will always be those questions that remain unanswered until you start digging into special collections at the library, or spend some time ordering and reviewing microfilm at your local family history center.

So, how do genealogical records end up online? With new data, it is probably stored in databases to begin with. The extent to which data (in the form of public records) are made available through the Internet or print publications is governed by the type of information and applicable law. Not surprisingly, many of the records available on websites like Ancestry.com (such as census records, voter registration, land titles, etc.) are public records. Other countries have their own laws governing public records. For example, in Great Britain there is the Public Records Act of 1958. At the state and local level there are often open access laws providing that certain records be public, and states may make them available digitally. But what about old records such as parish registers, immigration records and the like? When computers did not exist there was no alternative to pen and paper or print publications. Today, many such records are photographed and stored on microfilm or microfiche. More and more of those records are made available on public servers, particularly when there is a strong need for access on the part of researchers. But placing records online is not free, and it stands to reason that only a portion of these documents are being placed on the web. Fortunately, as interest grows and the number of people willing to support projects such as these grows, and network technology makes bandwidth and storage less and less expensive, the proportion of records that are made available continues to grow. In the future, it may be that most, if not all, research will be done online. But we’re not there yet.

Now for an obvious question: how do we go from images of documents to actual text that can be searched and stored in structured records? It’s true that there are technologies like optical character recognition (OCR) that can help in this process, but usually it comes down to people actually reading the documents and transcribing them. Facsimiles of the actual manuscripts are also useful. This is a bit of a sidebar, but it doesn’t take much research before you start encountering conflicting information. How can these discrepancies be resolved? It depends, but in some cases, mistakes in records can be traced to transcription errors. I remember years ago looking at a photograph of a page of John Woodhouse’s pioneer journal and his place of birth. I now know that it was an abbreviation for Adwick Le Street, Yorkshire, England which is the location of the church where he was christened, but I’ve seen all kinds of transcriptions by people who did not know what to make of Adwick Lest. I’ve seen other place names I can find no trace of on maps or in online databases that are, most likely, misspellings.

But let’s return to the question of transcribing and publishing historical documents. In some cases, documents are transcribed in full, but in other cases they are merely indexed. That’s just what it sounds like. A card catalog in a library contains only a small amount of information for each book, but it tells you what is there and, just as importantly, where to find it. A library card catalog is an archetypical example of an index. In general, an index will be (typically) printed record that tells you, for example that the 1840 census for Cleveland, Ohio can be found on certain rolls of microfilm, It may not tell you exactly where to look, but it narrows things down enough that you can find the person you’re looking for (you hope!). This is analogous to an index in a book, which may give you a page number on which a key word can be found, but you still need to scan the page to find what you’re looking for.

So, how are indexes produced for digital libraries? If we’re lucky, the records will be indexed when they are created, but this is not an option for old archives. It turns out that the source of many indexes is something you may not expect: volunteer labor! One well-known project is FamilySearch.org Indexing. This website was created by the Church of Jesus Christ of Latter-day Saints (or simply the LDS church), but it is open to everyone, as is the indexing project. You may possibly know this project by its older name, Name Extraction, but the two are basically the same thing. You can participate by reading digital images (of a difficulty level you select) and transcribing them using free software provided at the web site. As a quality control measure, other people will review your work for accuracy. This strategy, sometimes called “crowdsourcing” takes a seemingly overwhelming project and divides the load among a large pool of participants. There is no direct benefit for you, of course, but the idea is that if everyone works together toward a common good everyone will benefit. There are other similar projects. The popular web site Ancestry.com maintains the World Archives Project. The idea is similar, and it works the same way: you download free software, select batches of items to transcribe and upload the results using their software. Both of these are good projects and whichever you choose, your work will benefit the greater genealogical community (not to mention anyone interested in history). Finally, there are other projects such as FindAGrave.com that are not traditionally thought of as indexing projects. But think about it: these projects provide online directories of cemeteries that can be searched by name and year to locate the grave site of a specific person. Volunteers provide photographs and other details. So, yes, it’s an indexing project.

