Exodus – Leaving GEDCOM behind…

In an earlier article, I discussed ideas for a relational database to replace GEDCOM for storing genealogical data. Since writing it, I have put my words into action, and migrated all my genealogy research into this database. This has revealed a few deficiencies in the data model which needed a bit of tweaking. So, it’s time for an update…

Notes

The genesis_note table has been abandoned. Notes are no longer first-class objects; instead they are attributes of other objects. Notes in genealogy should be comments on the data, not part of the data. Each object that might be ‘noteworthy’ now gets its own ‘note’ column. If you have two things to say about a object, both can go in the same note.

Media

As before, we’ll start with a table to define the mime-types of the (multimedia) data we’ll be storing.

CREATE TABLE genesis_mime_type ( mime_type VARCHAR(255) NOT NULL, extension VARCHAR(6) NOT NULL, signature_precedence INTEGER NOT NULL, signature_type ENUM('regexp','like') NOT NULL, signature VARBINARY(64) NOT NULL, icon_png BLOB, PRIMARY KEY (mime_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

I’ve added some simple type-detection data, similar to “mime.magic”. This allows us to identify JPEG, PNG, AVI, etc. files. I’ve also added a default icon for the type. This way, we’ll have something to display for non-image types such as PDF.

The media table itself has become this:

CREATE TABLE genesis_media ( media_id INTEGER NOT NULL AUTO_INCREMENT, md5 CHAR(32) NOT NULL, description VARCHAR(255) NOT NULL, mime_type VARCHAR(32) NOT NULL DEFAULT 'application/octet-stream', object LONGBLOB, thumb_png BLOB, thumb_x INTEGER DEFAULT NULL, thumb_y INTEGER DEFAULT NULL, note TEXT, privacy ENUM('public','private','secret') NOT NULL DEFAULT 'public', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (media_id), UNIQUE KEY (md5), FOREIGN KEY (mime_type) REFERENCES genesis_mime_type (mime_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

I have abandonded filenames. Instead, we store the MD5 hash of the object. This allows a quick/easy way to prevent duplicates being inserted. It also gives us a unique, system-generated, filename – <MD5>.<extension>.

The mime-type column is technically redundant. We could theoretically derive it from the signature data in the mime-type table. (In fact, this is how we initially populate it). But, with several GB of media data, this becomes inefficient, so we duplicate it here.

I have also abandonded the distinction between ‘archive-quality’ and ‘web-quality’ versions of the image. The overhead of generating the ‘web-quality’ versions on-demand is not that high. Displaying full-sized copies of media images is an infrequent operation. Mostly we just display thumbnails. Genealogy is a long-term endeavour, and Moore’s Law will no doubt overtake us.

Repositories, Sources and Citations

The design for sources has remained largely unchanged.

CREATE TABLE genesis_repository ( repository_id INTEGER NOT NULL AUTO_INCREMENT, repository_name VARCHAR(255) NOT NULL, repository_address VARCHAR(255) DEFAULT NULL, repository_url VRACHAR(255) DEFAULT NULL, call_number_format VARCHAR(255) DEFAULT NULL, note TEXT DEFAULT NULL, privacy ENUM('public','private','secret') NOT NULL DEFAULT 'private', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (repository_id), UNIQUE KEY (repository_name) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_source ( source_id INTEGER NOT NULL AUTO_INCREMENT, title VARCHAR(255) NOT NULL, author VARCHAR(255) DEFAULT NULL, publisher VARCHAR(255) DEFAULT NULL, citation_format VARCHAR(255) DEFAULT NULL, note TEXT DEFAULT NULL, privacy ENUM('public','private','secret') DEFAULT 'private', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (source_id), UNIQUE KEY (title) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_citation ( citation_id INTEGER NOT NULL AUTO_INCREMENT, source_id INTEGER NOT NULL, citation_1 VARCHAR(64) DEFAULT NULL, citation_2 VARCHAR(64) DEFAULT NULL, citation_3 VARCHAR(64) DEFAULT NULL, citation_4 VARCHAR(64) DEFAULT NULL, transcript TEXT, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (citation_id), FOREIGN KEY (source_id) REFERENCES genesis_source (source_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_citation_media ( citation_id INTEGER NOT NULL, media_id INTEGER NOT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (citation_id, media_id), UNIQUE KEY (media_id, citation_id), FOREIGN KEY (citation_id) REFERENCES genesis_citation (citation_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_id ) REFERENCES genesis_media (media_id ) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci

Facts and Individuals

There is just one major change to the facts and individuals. I’ll give the table definitions first:

CREATE TABLE genesis_fact_type ( fact_type VARCHAR(15) NOT NULL, fact_category ENUM('attribute','event') NOT NULL, fact_type_group ENUM('birth','death','marriage','divorce','education','occupation') DEFAULT NULL, role1 ENUM('child','godchild','spouse','self') DEFAULT NULL, role2 ENUM('parent','godparent','spouse','self') DEFAULT NULL, fact_attributes SET('address','age','agency','cause','date','email','phone','place','value') NOT NULL, PRIMARY KEY (fact_type) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_gedcom_date ( gedcom_date_id INTEGER NOT NULL AUTO_INCREMENT, date_text VARCHAR(80) NOT NULL, start_date DATE DEFAULT NULL, end_date DATE DEFAULT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (gedcom_date_id), UNIQUE KEY (date_text) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_place_name ( place_name_id INTEGER NOT NULL AUTO_INCREMENT, place_name VARCHAR(255) NOT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (place_name_id), UNIQUE KEY (place_name) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_fact ( fact_id INTEGER NOT NULL AUTO_INCREMENT, fact_type VARCHAR(15) NOT NULL, fact_value VARCHAR(128) DEFAULT NULL, fact_address VARCHAR(128) DEFAULT NULL, fact_phone VARCHAR(255) DEFAULT NULL, fact_email VARCHAR(255) DEFAULT NULL, fact_agency VARCHAR(64) DEFAULT NULL, fact_cause VARCHAR(255) DEFAULT NULL, gedcom_date_id INTEGER DEFAULT NULL, place_name_id INTEGER DEFAULT NULL, note TEXT, privacy ENUM('public','private','secret') NOT NULL DEFAULT 'public', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (fact_id), FOREIGN KEY (place_name_id) REFERENCES genesis_place_name (place_name_id) ON UPDATE CASCADE, FOREIGN KEY (gedcom_date_id) REFERENCES genesis_gedcom_date (gedcom_date_id), FOREIGN KEY (fact_type) REFERENCES genesis_fact_type (fact_type) ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_fact_citation ( fact_id INTEGER NOT NULL, citation_id INTEGER NOT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (fact_id,citation_id), UNIQUE KEY (citation_id,fact_id), FOREIGN KEY (fact_id) REFERENCES genesis_fact (fact_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (citation_id) REFERENCES genesis_citation (citation_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_fact_media ( fact_id INTEGER NOT NULL, media_id INTEGER NOT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (fact_id,media_id), UNIQUE KEY (media_id,fact_id), FOREIGN KEY (fact_id ) REFERENCES genesis_fact (fact_id ) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_id) REFERENCES genesis_media (media_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_individual ( individual_id INTEGER NOT NULL AUTO_INCREMENT, sex ENUM('M','F','U') NOT NULL DEFAULT 'U', note TEXT, privacy ENUM('public','private','secret') NOT NULL DEFAULT 'public', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (individual_id) ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_name ( name_id INTEGER NOT NULL AUTO_INCREMENT, individual_id INTEGER NOT NULL, name_type ENUM('birth','married','religious','adopted','change') NOT NULL DEFAULT 'birth', gedcom_name VARCHAR(64) NOT NULL DEFAULT '//', surname VARCHAR(32) NOT NULL DEFAULT '', note text TEXT, privacy ENUM('public','private','secret') NOT NULL DEFAULT 'public', updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (name_id), FOREIGN KEY (individual_id) REFERENCES genesis_individual (individual_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci; CREATE TABLE genesis_individual_media ( individual_id INTEGER NOT NULL, media_id INTEGER NOT NULL, updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (individual_id,media_id), FOREIGN KEY (individual_id) REFERENCES genesis_individual (individual_id) ON DELETE CASCADE ON UPDATE CASCADE, FOREIGN KEY (media_id) REFERENCES genesis_media (media_id) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB COLLATE=utf8_unicode_ci;

The different (which you spotted, didn’t you!), is that we have abandoned all references to the concept of ‘family’. We just have individuals and facts/events. Since we can share facts between individuals, we can create a GEDCOM ‘FAMS’ relationship by having two individuals share a ‘MARR’ or ‘_NMR’ event with a role of ‘spouse’. Similarly, we can create a GEDCOM ‘FAMC’ by having two (or three!) individuals share a ‘BIRT’ or ‘ADOP’ event with roles of ‘child’ and ‘parent’.

Into Practice

A data model, however elegant, is of no use unless you can use it. I have successfully integrated this into the webtrees application.  Using the techniques described previously (strored procedures and views), I have created ‘wt_individualsâ’, ‘wt_families’, ‘wt_sources’, ‘wt_other’, and ‘wt_media’ views.  Using these it is possible to use most of the screens – I still need to work on the ‘wt_name’ and ‘wt_link’ tables to achieve full functionality.  However, this is enough to run the export function, so a full gedcom file can be created.  I have also created a replacement for the ‘Personal Facts’ tab, which displays the new data directly, and provides editing facilities using in-place editing techniques.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

4 Responses to Exodus – Leaving GEDCOM behind…

  1. Very interesting, and sounds promising. I particularly like the PLAC level concept, as we’ve discussed.

  2. fisharebest says:

    @Stephen – one small “ahem” problem with the place model is that it relies on MySQL’s geo-spatial extention functions, and these weren’t properly implemented until MySQL5.6. I’m hoping these will be back-ported into 5.5 and 5.1.

  3. Wes Groleau says:

    “abandonded the distinction between ‘archive-quality’ and ‘web-quality’ versions of the image. The overhead of generating the ‘web-quality’ versions on-demand is not that high”

    Are you sure? Seems to me we’ve had many occasions in the webtrees forums to explain that “the thumbnail creation failed because your image file is too big”

    • fisharebest says:

      @Wes – You will always be limited by your web host. There are hosts that doesn’t change the PHP default of 2M for the size of the largest file upload. There are hosts with such low bandwidth that large files are impractical.

      I’m designing for the future, and relying on Moores Law to help me out.

Comments are closed.