If you are familiar with genealogy, then you will almost certainly have come across GEDCOM file “formats”. I put the word in quotation marks because not only are there many versions of the format, every vendor has customised it to some extent, partly to ensure lock-in, partly because it contains errors and incompatibilities, and partly because the limitations of the format forces the decision upon them.

There are many problems with GEDCOM:

it is a data *transfer* format, not a data processing or data storage format
it centres on a “nuclear family” entity, which doesn’t really exist in the real world
some links need to be stored twice, allowing for broken and crossed links
adoption events belong to an individual and link to a “family”, instead of being an event between two people
it is repetitive; if a census record lists ten people, then the record, the source citations, etc. need to be stored ten times
there is no way of linking multiple people to the same event, such as census or residence
there is no way to attach sources to relationships; you must assume the relationship first, then add supporting evidence
there is little support for places
the name structure is very anglo-centric and cannot cope with patronyms, toponyms, surname declension, etc.

So why does the entire genealogical world seem to revolve around GEDCOM files? Because there is no real alternative. There are many XML based formats under development, but none has gained any noticable level of support. The Gentech Data Model shows what happens when you give data modellers too much caffeine. While it may model every aspect of genealogical research, the number of entities and levels of abstraction make it impractical for use in a real application, a fact the authors note themselves.

In the absence of a better solution, I have decided to implement one myself. This article documents my transition from GEDCOM to a relational database model that can be used as the basis for a usable genealogy application. The project is called “genesis”, a reference to the name of my family history data file: genes.ged.

Since we live in a GEDCOM world, this project has an important prerequisite; it must be able to export data in GEDCOM format.

Identifiers

In GEDCOM, all records have cross-reference identifiers called XREFs. These consist of up to 20 alphanumeric characters. A common approach is to use an alphabetic prefix and a numeric suffix, such as M123 or S456. The first problem with this is collation. Is R456 the same as r456? Can both R456 and r456 exist in the same GEDCOM? Is the prefix “R” always used for repositories?

In genesis, we will identify everything with integers. Databases like integer identifiers. They make good primary keys. Each type of object will have a separate sequence, so we can have both a source #123 as well as a repository #123. When we export to GEDCOM, we’ll add prefixes to these, so source #123 will become S123 and repository #123 will become R123.

Privacy

Since we will want to apply privacy-filtering to our data, either for displaying on-line or transmitting excerpts to fellow genealogists, we need a privacy model. This will be based on the GEDCOM restriction concept. Every object in the database will have one of three privacy restriction levels:

public – this record is public
private – this record contains information that should be restricted to family members; it corresponds to a GEDCOM “RESN privacy” tag
secret – this record contains information that should not be shared at all; it corresponds to a GEDCOM “RESN confidential” tag

Notes

Genealogists will always need to store unstructured notes and comments. This is the simplest GEDCOM structure. GEDCOM offers two ways to store notes. The simplest form is inline:

1 NOTE This is an inline note. 2 CONT It spans many lines, us 2 CONC ing a mixture of contin 2 CONC uation and concatenatio 2 CONT n tags.

The alternative, introduced in later versions of the specification, are record-based:

0 @N1@ NOTE This is a record-based note. 1 CONT Unlike other GEDCOM records, the 2 CONC data starts at the level 0 line, 2 CONT not in level 1 tags.

Record-based notes can be referenced by multiple objects, and any number of notes can be attached to a record or fact. We will only be using the record-based approach, and will impose a limit of one note per object. If you have two things to say, say them both in the same note. With this done, we can now store our notes in a table.

CREATE TABLE genesis_note ( note_id INTEGER NOT NULL AUTO_INCREMENT, note_text TEXT NOT NULL, restriction ENUM('public', 'private', 'secret') NOT NULL DEFAULT 'public', changed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (note_id) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

We’ve given the note a unique identifier (note_id), allowed up to 65535 characters for the note itself (note_text), assigned it a privacy level, and let the database take care of last modification time (changed_on).

Media Objects

Media objects serve two purposes in genealogy; firstly to illustrate people and their life events; secondly to document sources such as scans of censuses and birth certificates.

As with NOTE records, GEDCOM allows OBJE records to be both inline and record-based, and will restrict ourselves to record-based objects. GEDCOM also allows a media record to contain multiple media files – perhaps a .MP3 file of an interview with a grandparent together with a transcript of the interview. We will restrict ourselves to a single file per record, but will allow multiple records to be attached to people and facts (as illustrations) and to source-citations (as documentation).

Firstly, we need a little lookup table

CREATE TABLE genesis_mime_type ( mime_type VARCHAR(255) NOT NULL, extension VARCHAR(6) NOT NULL, PRIMARY KEY (mime_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; INSERT INTO genesis_mime_type (mime_type, extension) VALUES ('application/msword', 'doc'), ('application/pdf', 'pdf'), ('application/rtf', 'rtf'), ('application/vnd.ms-excel', 'xls'), ('application/x-gzip', 'gz'), ('application/zip', 'zip'), ('audio/mpeg', 'mp3'), ('audio/x-wav', 'wav'), ('image/bmp', 'bmp'), ('image/gif', 'gif'), ('image/jpeg', 'jpeg'), ('image/png', 'png'), ('image/svg+xml', 'svg'), ('image/tiff', 'tiff'), ('text/html', 'html'), ('text/plain', 'txt'), ('video/mpeg', 'mpeg'), ('video/quicktime', 'mov');

Our table will look like this:

CREATE TABLE genesis_media ( media_id INTEGER NOT NULL AUTO_INCREMENT, filename VARCHAR(255) NOT NULL, description VARCHAR(255) NOT NULL, mime_type VARCHAR(32) DEFAULT 'application/octet-stream', archive_obj MEDIUMBLOB DEFAULT NULL, display_png MEDIUMBLOB DEFAULT NULL, thumb_png BLOB DEFAULT NULL, note_id INTEGER DEFAULT NULL, restriction ENUM('public', 'private', 'secret') NOT NULL DEFAULT 'public', changed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (media_id), UNIQUE KEY description (description), UNIQUE KEY filename (filename), FOREIGN KEY (note_id) REFERENCES genesis_note (note_id) ON DELETE SET NULL ON UPDATE CASCADE, FOREIGN KEY (mime_type) REFERENCES genesis_mime_type (mime_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

We have three versions of the object – the orignal (archive quality), a lower resolution or cropped version for display and a smaller version for use as a thumbnail. [TODO: store image sizes, watermarked images, etc.]

Sources and Citations

Before proceeding, we need to clear up a few issues surrounding sources. In GEDCOM, sources and citatations are combined into a single entity, a SOUR tag with subordinate PAGE/TEXT tags. I’m going to separate these into two separate entities.

GEDCOM also allows you to assign sources to both facts and individuals. It doesn’t really make any sense to apply a source to an individual, so I will restrict sources to facts. The need for this restriction will become more apparent when we consider “persona” or “person-fragment” records. If you don’t know what a persona is, don’t worry. We won’t be dealing with them in this article.

This gives us the following logical structure:

A person has zero or more facts.

A fact has zero or more citations

A citation comes from exactly one source

A source can be found in one or more repositories.

This is a fairly rigid structure, and if you have been used to GEDCOM’s fairly lax approach to sourcing data, you may find it a little restrictive at first. Sources can be divided into two categories; complete and fragments. A complete source might be a gravestone and its inscription. A fragment may be a single entry from a parish register. For the latter, the citation will need some form of reference, such as volume/page numbers. For the former, there will be no such reference – the citation is the entire source.

Example One

A gravestone is a source. It is found in just one repository (the cemetery), and will probably have a repository “call number”, probably a plot number. The citation from this source will include the entire source, and include a transcript and photograph. The citation can then be used to assert a number of facts. As well as the burial, it may well tell us the dates of birth and death.

Example Two

A parish register is a source. It can be found in many repositories, including the county Public Records Office. The PRO will have a reference number for it (the call number). A citation from this will include details such as the volume (Baptisms 1775-1795) and a page number. The citation can be used to assert a number of facts, such as the birth and baptism of the child and the marriage of the parents.

Example Three

An email from a distant cousin, with details of their family, is a source. It can be found in two repositories (your inbox and their outbox). My repository is called “Greg Roach’s Family Tree Archives” and includes all my records and correspondence, and will have a call number “email from cousin Doris, received 14 Feb 2004”. I’ll create a repository “Doris Claggett’s Family Tree Archives”, and reference the source to it. This way, I can link all emails from cousin Doris. Depending on the length of the email, and the amount of information enclosed, this could be used in either one or many citations, and each citation could be used to assert a number of sources.

Repositories

Our repository table is straightforward:

CREATE TABLE genesis_repository ( repository_id INTEGER NOT NULL AUTO_INCREMENT, repository_name VARCHAR(255) NOT NULL, repository_address VARCHAR(255) DEFAULT NULL, repository_url VARCHAR(255) DEFAULT NULL, note_id INTEGER DEFAULT NULL, call_number_1 VARCHAR(16) DEFAULT NULL, call_number_2 VARCHAR(16) DEFAULT NULL, call_number_3 VARCHAR(16) DEFAULT NULL, call_number_4 VARCHAR(16) DEFAULT NULL, call_number_format VARCHAR(255) DEFAULT NULL, restriction ENUM('public', 'private', 'secret') NOT NULL DEFAULT 'private', changed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (repository_id), UNIQUE KEY (repository_name), FOREIGN KEY (note_id) REFERENCES genesis_note (note_id) ON DELETE SET NULL ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

The “call_number” columns probably need some explaining. This is how a repository defines its call numbers, and how it displays them. For example, a cemetery may specify its graves using a “lawn number” and a “plot number” within the lawn. For this, we’d set call_number_1 to “Lawn”, call_number_2 to “Plot” and call_number_format to “Lawn: {Lawn}, Plot: {Plot}”. This allows us to store the call number in a structured format, run queries against it, and so on. Of course, some repositories may not use call numbers, and some may have unstructured references. These can just have call_number_1=”Reference” and call_number_format “Reference: {Reference}”.

The more observant amongst you will now be muttering “normal form” under your breath. Yes I know. I know. I may create a separate table for the call number fields at a later date. For the time being, this is simple and easy for an application to use/implement.

Sources

The source table is similar to the repository table. As with the repository call number, we have a source citation.

CREATE TABLE genesis_source ( source_id INTEGER NOT NULL AUTO_INCREMENT, source_title VARCHAR(255) NOT NULL, source_author VARCHAR(255) DEFAULT NULL, source_publisher VARCHAR(255) DEFAULT NULL, citation_1 VARCHAR(16) DEFAULT NULL, citation_2 VARCHAR(16) DEFAULT NULL, citation_3 VARCHAR(16) DEFAULT NULL, citation_4 VARCHAR(16) DEFAULT NULL, citation_format VARCHAR(255) DEFAULT NULL, note_id INTEGER DEFAULT NULL, restriction ENUM('public', 'private', 'secret') NOT NULL DEFAULT 'private', changed_on TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (source_id), FOREIGN KEY (note_id) REFERENCES genesis_note (note_id) ON DELETE SET NULL ON UPDATE CASCADE, UNIQUE KEY (source_title) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_source_location ( source_id INTEGER NOT NULL, repository_id INTEGER NOT NULL, call_number_1 VARCHAR(64) DEFAULT NULL, call_number_2 VARCHAR(64) DEFAULT NULL, call_number_3 VARCHAR(64) DEFAULT NULL, call_number_4 VARCHAR(64) DEFAULT NULL, PRIMARY KEY (source_id, repository_id), UNIQUE KEY (repository_id, source_id), FOREIGN KEY (source_id ) REFERENCES genesis_source (source_id ) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (repository_id) REFERENCES genesis_repository (repository_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

Finally, citations. These have a transcript, and a number of media items.

CREATE TABLE genesis_citation ( citation_id INTEGER NOT NULL AUTO_INCREMENT, source_id INTEGER NOT NULL, citation_1 VARCHAR(64) DEFAULT NULL, citation_2 VARCHAR(64) DEFAULT NULL, citation_3 VARCHAR(64) DEFAULT NULL, citation_4 VARCHAR(64) DEFAULT NULL, transcript text COLLATE utf8_unicode_ci, PRIMARY KEY (citation_id), UNIQUE KEY (source_id, citation_id), FOREIGN KEY (source_id) REFERENCES genesis_source (source_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_citation_media ( citation_id INTEGER NOT NULL, media_id INTEGER NOT NULL, PRIMARY KEY (citation_id,media_id), UNIQUE KEY (media_id,citation_id), FOREIGN KEY (citation_id) REFERENCES genesis_citation (citation_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_id ) REFERENCES genesis_media (media_id ) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

Facts and Individuals

Finally, we come to the most important tables – facts and individuals.Â Note that unlike GEDCOM, we don’t have a “family” entity.Â Instead, we have facts that link two individuals, such as marriage or adoption.Â We’ll start with a simple table; a list of all the facts we’ll be dealing with.Â By adding various extra fields here, an application will be able to process facts in a nice generic way, making it easier to add new facts.

CREATE TABLE genesis_fact_type ( fact_typeÂ Â Â Â Â Â  VARCHAR(15) NOT NULL, fact_type_group ENUM ('birth', 'death', 'marriage', 'divorce', 'education', 'occupation') DEFAULT NULL, role1Â Â Â Â Â Â Â Â Â Â  ENUM ('child', 'parent', 'spouse', 'godparent') DEFAULT NULL, role2Â Â Â Â Â Â Â Â Â Â  ENUM ('child', 'parent', 'spouse', 'godparent') DEFAULT NULL, has_dateÂ Â Â Â Â Â Â  TINYINT NOT NULL, has_placeÂ Â Â Â Â Â  TINYINT NOT NULL, has_agencyÂ Â Â Â Â  TINYINT NOT NULL, has_causeÂ Â Â Â Â Â  TINYINT NOT NULL, PRIMARY KEY (fact_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; INSERT INTO genesis_fact_type ( fact_type, fact_type_group, role1, role2, has_date, has_place, has_agency, has_cause ) VALUES ('_MDCL', NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('_MILI', NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('_NMAR', NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('_NMR',Â  'marriage',Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('_SEPR', 'divorce',Â Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('_TODO', NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('ADOP',Â  NULL,Â Â Â Â Â Â Â Â  'child',Â  'parent',Â Â Â  1, 1, 1, 0), ('AFN',Â Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('ALIA',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('BAPL',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('BIRT',Â  'birth',Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('BURI',Â  'death',Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('CENS',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('CHR',Â Â  'birth',Â Â Â Â Â  NULL,Â Â Â Â  'godparent', 1, 1, 0, 0), ('CONF',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('CREM',Â  'death',Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('DEAT',Â  'death',Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 1), ('DIV',Â Â  'divorce',Â Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('DIVF',Â  'divorce',Â Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('EDUC',Â  'education',Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 1, 0), ('EMIG',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('ENGA',Â  'marriage',Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('FCOM',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('IMMI',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 0, 0), ('MARB',Â  'marriage',Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('MARR',Â  'marriage',Â Â  'spouse', 'spouse',Â Â Â  1, 1, 0, 0), ('NAME',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('NATU',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('NCHI',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('NMR',Â Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('OCCU',Â  'occupation', NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 1, 0), ('ORDN',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('PROB',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('PROP',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('RELI',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('RESI',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0), ('RETI',Â  'occupation', NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  1, 1, 1, 0), ('WILL',Â  NULL,Â Â Â Â Â Â Â Â  NULL,Â Â Â Â  NULL,Â Â Â Â Â Â Â  0, 0, 0, 0);

Before we can create the facts table, we’ll need a couple more tables.Â These don’t contain much at the moment, but we’ll make better use of them in the future.

CREATE TABLE genesis_date ( Â date_id INTEGER NOT NULL AUTO_INCREMENT, Â date_text VARCHAR(80) NOT NULL, Â PRIMARY KEY (date_id), Â UNIQUE KEY (date_text) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_place_name ( place_name_id INTEGER NOT NULL AUTO_INCREMENT, place_name VARCHAR(255) NOT NULL, PRIMARY KEY (place_name_id), UNIQUE KEY (place_name) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

We can now create the facts table.Â Note we have various attribute columns (date, place, agency, cause, address, etc.) which correspond to GEDCOM level 2 records.

CREATE TABLE genesis_fact ( fact_idÂ Â Â Â Â Â  INTEGERÂ Â Â Â Â  NOT NULL AUTO_INCREMENT, fact_typeÂ Â Â Â  VARCHAR(15)Â  NOT NULL, fact_valueÂ Â Â  VARCHAR(128) DEFAULT NULL, gedcomÂ Â Â Â Â Â Â  TEXT, fact_addressÂ  VARCHAR(128) DEFAULT NULL, fact_phoneÂ Â Â  VARCHAR(255) DEFAULT NULL, fact_emailÂ Â Â  VARCHAR(255) DEFAULT NULL, fact_agencyÂ Â  VARCHAR(64)Â  DEFAULT NULL, fact_causeÂ Â Â  VARCHAR(255) DEFAULT NULL, date_idÂ Â Â Â Â Â  INTEGERÂ Â Â Â Â  DEFAULT NULL, place_name_id INTEGERÂ Â Â Â Â  DEFAULT NULL, note_idÂ Â Â Â Â Â  INTEGERÂ Â Â Â Â  DEFAULT NULL, privacyÂ Â Â Â Â Â  ENUM ('public', 'private', 'secret') NOT NULL DEFAULT 'public', changed_onÂ Â Â  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (fact_id), FOREIGN KEY (place_name_id) REFERENCES genesis_place_name (place_name_id) ON DELETE RESTRICT ON UPDATE CASCADE, FOREIGN KEY (note_idÂ Â Â Â Â  ) REFERENCES genesis_noteÂ Â Â Â Â Â  (note_idÂ Â Â Â Â  ) ON DELETE SET NULL ON UPDATE CASCADE, FOREIGN KEY (fact_typeÂ Â Â  ) REFERENCES genesis_fact_typeÂ  (fact_typeÂ Â Â  ) ON DELETE RESTRICT ON UPDATE CASCADE, FOREIGN KEY (date_idÂ Â Â Â Â  ) REFERENCES genesis_dateÂ Â Â Â Â Â  (date_idÂ Â Â Â Â  ) ON DELETE RESTRICT ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

Since facts can have media items and source citations, we’ll need a couple of intersection tables to implement the many-to-many nature of the relationships

CREATE TABLE genesis_fact_citation ( fact_idÂ  INTEGER NOT NULL, citation_id INTEGER NOT NULL, PRIMARY KEY (fact_id, citation_id), UNIQUE KEY (citation_id, fact_id), FOREIGN KEY (fact_id ) REFERENCES genesis_fact (fact_id ) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (citation_id) REFERENCES genesis_citation (citation_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_fact_media ( fact_id int(11) NOT NULL, Â media_id int(11) NOT NULL, PRIMARY KEY (fact_id, media_id), UNIQUE KEY (media_id, fact_id), FOREIGN KEY (fact_id ) REFERENCES genesis_fact (fact_id ) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_id) REFERENCES genesis_media (media_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

The individual table is pretty simple.

CREATE TABLE genesis_individual ( individual_id INTEGER NOT NULL AUTO_INCREMENT, mother_id INTEGER DEFAULT NULL, father_id INTEGER DEFAULT NULL, sex ENUM ('M','F','U') NOT NULL DEFAULT 'U', name VARCHAR(64) NOT NULL, note_id INTEGER DEFAULT NULL, privacy ENUM ('public','private','secret') NOT NULL DEFAULT 'public', changed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (individual_id), FOREIGN KEY (mother_id) REFERENCES genesis_individual (individual_id) ON DELETE SET NULL ON UPDATE CASCADE, FOREIGN KEY (father_id) REFERENCES genesis_individual (individual_id) ON DELETE SET NULL ON UPDATE CASCADE, FOREIGN KEY (note_id ) REFERENCES genesis_note (note_id ) ON DELETE SET NULL ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

Unlike GEDCOM, which allows an individual to have any number of sets of parents, we just have the one.Â These are the biological parents.Â Adoptive parents are indicated using an adoption event.Â Unlike GEDCOM, an individual can only have one sex record.

Since individuals can have many names, the name in this table is the one the genealogist prefers.Â We’ll store all names, (which probably includes this one) in a separate table.Â The “gedcom_name” column is the name in gedcom format – i.e. with slashes around the surname.Â In the future, we may give a bit more structure to this.Â Names are a complicated issue, and need a separate discussion.

CREATE TABLE genesis_individual_name ( name_id INTEGER NOT NULL AUTO_INCREMENT, individual_id INTEGER NOT NULL, name_type ENUM ('birth', 'marriage', 'religious', 'adopted', 'change'), gedcom_name VARCHAR(64) NOT NULL, note_id INTEGER DEFAULT NULL, PRIMARY KEY (name_id), FOREIGN KEY (individual_id) REFERENCES genesis_individual (individual_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (note_id) REFERENCES genesis_note (note_id) ON DELETE SET NULL ON UPDATE CASCADE );

Since individuals can have associated media items, we’ll need intersection table to link them.Â The link to facts needs a few extra attributes.Â Firstly the age depends on the individual, not the fact (e.g. a census event applies to people with different ages), and secondly facts that create a relationship between two people need a role.

CREATE TABLE genesis_individual_media ( individual_id INTEGER NOT NULL, media_idÂ Â Â Â Â  INTEGER NOT NULL, PRIMARY KEY (individual_id, media_id), UNIQUEÂ  KEY (media_id, individual_id), FOREIGN KEY (individual_id) REFERENCES genesis_individual (individual_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_idÂ Â Â Â  ) REFERENCES genesis_mediaÂ Â Â Â Â  (media_idÂ Â Â Â  ) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_individual_fact ( individual_id INTEGER NOT NULL, fact_idÂ Â Â Â Â Â  INTEGER NOT NULL, roleÂ Â Â Â Â Â Â Â Â  ENUM ('child','parent','spouse','godparent') DEFAULT NULL, age_typeÂ Â Â Â Â  ENUM ('=','<','>') DEFAULT NULL, age_yearsÂ Â Â Â  TINYINT DEFAULT NULL, age_monthsÂ Â Â  TINYINT DEFAULT NULL, age_daysÂ Â Â Â Â  TINYINT DEFAULT NULL, PRIMARY KEY (individual_id, fact_id), UNIQUEÂ  KEY (fact_id, individual_id), FOREIGN KEY (individual_id) REFERENCES genesis_individual (individual_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (fact_idÂ Â Â Â Â  ) REFERENCES genesis_factÂ Â Â Â Â Â  (fact_idÂ Â Â Â Â  ) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

Export to gedcom

Finally, we need to be able to export this back to GEDCOM format.

DELIMITER // CREATE FUNCTION `gedcom_INDI` ( p_individual_id INTEGER, p_mother_id INTEGER, p_father_id INTEGER, p_sex ENUM('M','F','U'), p_note_id INTEGER, p_privacy ENUM('public','private','secret'), p_changed_on TIMESTAMP ) RETURNS TEXT DETERMINISTIC READS SQL DATA BEGIN DECLARE l_gedcom TEXT; DECLARE c_name CURSOR FOR SELECT gedcom_name, name_type, note_id FROM `genesis_individual_name` WHERE individual_id=p_individual_id; DECLARE c_fact CURSOR FOR SELECT fact_id, CONCAT_WS('' COLLATE utf8_unicode_ci, '1 ', fact_type, CASE WHEN fact_value<>'' THEN CONCAT(' ', fact_value, 'n') WHEN has_date AND has_place AND date_id IS NULL AND place_name_id IS NULL THEN ' Yn' ELSE 'n' END, CONCAT('2 DATE ', date_text, 'n'), CONCAT('2 PLAC ', place_name, 'n'), CONCAT('2 ADDR ', fact_address, 'n'), CONCAT('2 PHON ', fact_phone, 'n'), CONCAT('2 EMAIL ', fact_email, 'n'), CONCAT('2 CAUS ', fact_cause, 'n'), CONCAT('2 AGNC ', fact_agency, 'n'), CONCAT('2 NOTE @N', f.note_id, '@n'), CASE f.privacy WHEN 'public' THEN NULL WHEN 'private' THEN '2 RESN privacyn' WHEN 'secret' THEN '2 RESN confidentialn' END, GROUP_CONCAT(DISTINCT CONCAT('2 OBJE @M', media_id, '@n') SEPARATOR ''), GROUP_CONCAT(DISTINCT CONCAT( '2 SOUR @S', source_id, '@n', IFNULL(CONCAT( '3 PAGE ', CONCAT(s.citation_1, ': ', c.citation_1), IFNULL(CONCAT(', ', s.citation_2, ': ', c.citation_2), ''), IFNULL(CONCAT(', ', s.citation_3, ': ', c.citation_3), ''), IFNULL(CONCAT(', ', s.citation_4, ': ', c.citation_4), ''), 'n3 _CITATION_ID ', citation_id, 'n'), ''), IFNULL(CONCAT('3 DATAn4 TEXT ', REPLACE(transcript, 'n', 'n5 CONT '), 'n'), '') ) SEPARATOR '') ) AS gedcom FROM `genesis_individual_fact` JOIN `genesis_fact` f USING (fact_id) JOIN `genesis_fact_type` USING (fact_type) LEFT JOIN `genesis_date` USING (date_id) LEFT JOIN `genesis_place_name` USING (place_name_id) LEFT JOIN `genesis_fact_media` USING (fact_id) LEFT JOIN `genesis_fact_citation` USING (fact_id) LEFT JOIN `genesis_citation` c USING (citation_id) LEFT JOIN `genesis_source` s USING (source_id) WHERE individual_id=p_individual_id AND (role IS NULL OR role<>'spouse') GROUP BY fact_id; SET l_gedcom := CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @I', p_individual_id, '@ INDIn1 SEX ', p_sex, 'n', CONCAT('1 NOTE @N', p_note_id, '@n'), CASE p_privacy WHEN 'public' THEN NULL WHEN 'private' THEN '1 RESN privacyn' WHEN 'secret' THEN '1 RESN confidentialn' END, UCASE(DATE_FORMAT(p_changed_on, '1 CHANn2 DATE %d %b %Yn3 TIME %Tn')) ); OPEN c_name; BEGIN DECLARE l_gedcom_name TEXT; DECLARE l_name_type TEXT; DECLARE l_note_id INTEGER; DECLARE EXIT HANDLER FOR NOT FOUND CLOSE c_name; LOOP FETCH c_name INTO l_gedcom_name, l_name_type, l_note_id; SET l_gedcom := CONCAT(l_gedcom, '1 NAME ', l_gedcom_name, 'n'); IF l_name_type <> 'birth' THEN SET l_gedcom := CONCAT(l_gedcom, '2 TYPE ', l_name_type, 'n'); END IF; IF l_note_id IS NOT NULL THEN SET l_gedcom := CONCAT(l_gedcom, '2 NOTE @N', l_note_id, '@n'); END IF; END LOOP; END; OPEN c_fact; BEGIN DECLARE l_fact_id INTEGER; DECLARE l_fact TEXT; DECLARE EXIT HANDLER FOR NOT FOUND CLOSE c_fact; LOOP FETCH c_fact INTO l_fact_id, l_fact; SET l_gedcom := CONCAT( l_gedcom, l_fact, ( SELECT IFNULL(GROUP_CONCAT(CONCAT('2 ASSO @I', individual_id, '@n') SEPARATOR ''), '') FROM `genesis_individual_fact` WHERE fact_id=l_fact_id AND individual_id<>p_individual_id ) ); END LOOP; END; IF p_father_id IS NOT NULL OR p_mother_id IS NOT NULL THEN SET l_gedcom := CONCAT(l_gedcom, '1 FAMC @F', IFNULL(p_father_id, '0'), '_', IFNULL(p_mother_id, '0'), '@n'); END IF; IF p_sex='M' THEN SET l_gedcom := CONCAT( l_gedcom, ( SELECT IFNULL(GROUP_CONCAT(DISTINCT '1 FAMS @F', IFNULL(father_id, '0'), '_', IFNULL(mother_id, '0'), '@n' SEPARATOR ''), '') FROM `genesis_individual` WHERE father_id=p_individual_id ) ); ELSE SET l_gedcom := CONCAT( l_gedcom, ( SELECT IFNULL(GROUP_CONCAT(DISTINCT '1 FAMS @F', IFNULL(father_id, '0'), '_', IFNULL(mother_id, '0'), '@n' SEPARATOR ''), '') FROM `genesis_individual` WHERE mother_id=p_individual_id ) ); END IF; RETURN l_gedcom; END // DELIMITER ; DELIMITER // CREATE FUNCTION `gedcom_FAM`( p_husband_id INTEGER, p_wife_id INTEGER ) RETURNS TEXT DETERMINISTIC READS SQL DATA BEGIN DECLARE l_gedcom TEXT; DECLARE c_fact CURSOR FOR SELECT fact_id, CONCAT_WS('' COLLATE utf8_unicode_ci, '1 ', fact_type, CASE WHEN fact_value<>'' THEN CONCAT(' ', fact_value, 'n') WHEN has_date AND has_place AND date_id IS NULL AND place_name_id IS NULL THEN ' Yn' ELSE 'n' END, CONCAT('2 DATE ', date_text, 'n'), CONCAT('2 PLAC ', place_name, 'n'), CONCAT('2 ADDR ', fact_address, 'n'), CONCAT('2 PHON ', fact_phone, 'n'), CONCAT('2 EMAIL ', fact_email, 'n'), CONCAT('2 CAUS ', fact_cause, 'n'), CONCAT('2 AGNC ', fact_agency, 'n'), CONCAT('2 NOTE @N', f.note_id, '@n'), CASE f.privacy WHEN 'public' THEN NULL WHEN 'private' THEN '2 RESN privacyn' WHEN 'secret' THEN '2 RESN confidentialn' END, GROUP_CONCAT(DISTINCT CONCAT('2 OBJE @M', media_id, '@n') SEPARATOR ''), GROUP_CONCAT(DISTINCT CONCAT( '2 SOUR @S', source_id, '@n', IFNULL(CONCAT( '3 PAGE ', CONCAT(s.citation_1, ': ', c.citation_1), IFNULL(CONCAT(', ', s.citation_2, ': ', c.citation_2), ''), IFNULL(CONCAT(', ', s.citation_3, ': ', c.citation_3), ''), IFNULL(CONCAT(', ', s.citation_4, ': ', c.citation_4), ''), 'n3 _CITATION_ID ', citation_id, 'n'), ''), IFNULL(CONCAT('3 DATAn4 TEXT ', REPLACE(transcript, 'n', 'n5 CONT '), 'n'), '') ) SEPARATOR '') ) AS gedcom FROM `genesis_individual_fact` JOIN `genesis_fact` f USING (fact_id) JOIN `genesis_fact_type` USING (fact_type) LEFT JOIN `genesis_date` USING (date_id) LEFT JOIN `genesis_place_name` USING (place_name_id) LEFT JOIN `genesis_fact_media` USING (fact_id) LEFT JOIN `genesis_fact_citation` USING (fact_id) LEFT JOIN `genesis_citation` c USING (citation_id) LEFT JOIN `genesis_source` s USING (source_id) WHERE individual_id=p_husband_id AND role='spouse' GROUP BY fact_id; SET l_gedcom := CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @F', IFNULL(p_husband_id, 0), '_', IFNULL(p_wife_id, 0), '@ FAMn', CONCAT('1 HUSB @I', p_husband_id, '@n'), CONCAT('1 WIFE @I', p_wife_id, '@n'), ( SELECT GROUP_CONCAT('1 CHIL @I', individual_id, '@n' SEPARATOR '') FROM `genesis_individual` WHERE father_id=p_husband_id AND mother_id=p_wife_id ) ); OPEN c_fact; BEGIN DECLARE l_fact_id INTEGER; DECLARE l_fact TEXT; DECLARE EXIT HANDLER FOR NOT FOUND CLOSE c_fact; LOOP FETCH c_fact INTO l_fact_id, l_fact; SET l_gedcom := CONCAT(l_gedcom, l_fact); END LOOP; END; RETURN l_gedcom; END // DELIMITER ; CREATE VIEW `gedcom_family_subquery_view` AS SELECT if1.individual_id AS husband_id, if2.individual_id AS wife_id FROM genesis_individual_fact if1 JOIN genesis_individual_fact if2 USING (fact_id, role) JOIN genesis_individual i1 ON (if1.individual_id=i1.individual_id) WHERE if1.individual_id<>if2.individual_id AND role='spouse' AND i1.sex='M' GROUP BY if1.individual_id, if2.individual_id UNION SELECT father_id, mother_id FROM genesis_individual WHERE father_id IS NOT NULL AND mother_id IS NOT NULL; CREATE VIEW `gedcom_family_view` AS SELECT `gedcom_FAM`(husband_id, wife_id) AS gedcom FROM `gedcom_family_subquery_view`; CREATE VIEW `gedcom_individual_view` AS SELECT `gedcom_INDI`(individual_id, mother_id, father_id, sex, note_id, privacy, changed_on) AS gedcom FROM `genesis_individual`; CREATE VIEW `gedcom_note_view` AS SELECT CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @N', note_id, '@ NOTEn', '1 CONC ', REPLACE(note_text, 'n', 'n1 CONT '), 'n', CASE privacy WHEN 'public' THEN NULL WHEN 'private' THEN '1 RESN privacyn' WHEN 'secret' THEN '1 RESN confidentialn' END, UCASE(DATE_FORMAT(changed_on, '1 CHANn2 DATE %d %b %Yn3 TIME %Tn')) ) AS gedcom FROM `genesis_note`; CREATE VIEW `gedcom_media_view` AS SELECT CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @M' COLLATE utf8_unicode_ci, media_id, '@ OBJEn', '1 FILE ', filename, 'n2 FORM ', extension, 'n2 TITL ', description, 'n', CASE privacy WHEN 'public' THEN '' WHEN 'private' THEN '1 RESN privacyn' WHEN 'secret' THEN '1 RESN confidentialn' END, CONCAT('1 NOTE @N', note_id, '@n'), UCASE(DATE_FORMAT(changed_on, '1 CHANn2 DATE %d %b %Yn3 TIME %Tn')) ) AS gedcom FROM `genesis_media` JOIN `genesis_mime_type` USING (mime_type); CREATE VIEW `gedcom_repository_view` AS SELECT CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @R', repository_id, '@ REPOn', '1 NAME ', repository_name, 'n', CONCAT('1 ADDR ', repository_address,'n'), CONCAT('1 WWW ', repository_url,'n'), CONCAT('1 NOTE @N',note_id,'@n'), CASE privacy WHEN 'public' THEN '' WHEN 'private' THEN '1 RESN privacyn' WHEN 'secret' THEN '1 RESN confidentialn' END, UCASE(DATE_FORMAT(changed_on, '1 CHANn2 DATE %d %b %Yn3 TIME %Tn')) ) AS gedcom FROM `genesis_repository`; CREATE VIEW `gedcom_source_view` AS SELECT CONCAT_WS('' COLLATE utf8_unicode_ci, '0 @S', source_id, '@ SOURn', '1 TITL ', source_title, 'n', CONCAT('1 AUTH ', source_author, 'n'), CONCAT('1 PUBL ', source_publisher, 'n'), CONCAT('1 NOTE @N', s.note_id,'@n'), GROUP_CONCAT(CONCAT( '1 SOUR @S', repository_id, '@n', IFNULL( CONCAT( '1 CALN ', REPLACE( REPLACE( REPLACE( REPLACE( r.call_number_format, CONCAT('{', IFNULL(r.call_number_4,''),'}'), IFNULL(l.call_number_4,'') ), CONCAT('{', IFNULL(r.call_number_3,''),'}'), IFNULL(l.call_number_3,'') ), CONCAT('{', IFNULL(r.call_number_2,''),'}'), IFNULL(l.call_number_2,'') ), CONCAT('{', IFNULL(r.call_number_1,''),'}'), IFNULL(l.call_number_1,'') ), 'n' ), '' ) ) SEPARATOR ''), CASE s.privacy WHEN 'public' THEN '' WHEN 'private' THEN '1 RESN privacyn' WHEN 'secret' THEN '1 RESN confidentialn' END, UCASE(DATE_FORMAT(s.changed_on, '1 CHANn2 DATE %d %b %Yn3 TIME %Tn')) ) AS gedcom FROM `genesis_source` s LEFT JOIN `genesis_source_location` l USING (source_id) LEFT JOIN `genesis_repository` r USING (repository_id) GROUP BY source_id; CREATE VIEW gedcom_header_view AS SELECT CONCAT_WS('' COLLATE utf8_unicode_ci, '0 HEADn', '1 SOUR genesisn2 VERS 1.0.0n', '1 DEST DISKETTEn', '1 CHAR UTF-8n', '1 GEDCn2 VERS 5.5.1n2 FORM Lineage-Linkedn', UCASE(DATE_FORMAT(NOW(), '1 DATE %d %b %Yn2 TIME %Tn')) ) AS gedcom; CREATE VIEW gedcom_trailer_view AS SELECT '0 TRLRn' COLLATE utf8_unicode_ci AS gedcom; CREATE VIEW `gedcom_view` AS SELECT gedcom FROM `gedcom_header_view` UNION SELECT gedcom FROM `gedcom_individual_view` UNION SELECT gedcom FROM `gedcom_family_view` UNION SELECT gedcom FROM `gedcom_note_view` UNION SELECT gedcom FROM `gedcom_media_view` UNION SELECT gedcom FROM `gedcom_repository_view` UNION SELECT gedcom FROM `gedcom_source_view` UNION SELECT gedcom FROM `gedcom_trailer_view`;

2 Responses to Genesis – In the beginning was GEDCOM…

Dave says:

5 April 2011 at 15:42

“Record-based notes can be referenced by multiple objects, and any number of notes can be attached to a record or fact. We will only be using the record-based approach, and will impose a limit of one note per object. If you have two things to say, say them both in the same note.”

This does not facilitate colaboration does it? What if I make a note, then my Aunt Jamima wants to add her note, and my cousin Vinney another? Do they append to my note, possibly removing mine? Wouldn’t multiple notes make sense? This is allowed in GEDCOM output as you state. And what then about attribution and date of entry? Is this to be included in the body of the note? I can see this structure getting messy.

Some note text. – Note by Me entered on April 05 2011

Some more note text. – Aunt Jamima, Apr 5, 2011

Even more note text. – Cousin Vinney on 4/5/11

Maybe an intersection table as with media would help solve this problem? Maybe I’m missing something and you’ve account for this. I admit it was a rough read and I haven’t gone through it completely. Just a knee-jerk reaction after reading the one note statement. And of course I realize it’s YOUR database and just my obversation but a real life one I think.
- fisharebest says:
  
  5 April 2011 at 16:26
  
  I feel that “notes” are often misused and abused in applications such as webtrees or phpGedView. They are used to add narrative descriptions (wrong – this should be a media document of type text/plain or text/html). They are used to add transcripts of censuses, etc. (wrong – this should be in the source citation).
  
  If it contains genealogical information, it is not a note, it is a source.
  If it contains descriptive or narrative text, it is not a note, it is a media object.
  
  It is quite likely that what aunt Jamima and cousin VInney have to say is actually a source. I have many sources of the form “Email from XXX, received YYY”. These sources exist in the repository “Greg’s mail folder”, and are used to produce several citations each.
  
  To my mind, notes are short, simple, and frequently *temporary* annotations upon structured data. For example:
  
  “Need to check this in the Smallville public records office”
  “Sources Foo and Bar disagree on the date. Foo is generally more reliable, and this date is assumed”.
  
  Do they append to my note, possibly removing mine?
  
  Yes. In a collaborative system, they would presumably be able to delete existing notes, as well as adding their own.
  
  If the end system has an “approve edits”, or “undo changes” process, then the original notes will never actually be lost.

Comments are closed.

Genesis – In the beginning was GEDCOM…

Identifiers

Privacy

Notes

Media Objects

Sources and Citations

Repositories

Sources

Facts and Individuals

Export to gedcom

2 Responses to Genesis – In the beginning was GEDCOM…

Recent Posts

Recent Comments

Archives

Categories

Meta