Building an email archive

As part of creating a life archive structure, I've started to work on how to structure my archival email. I have a few sources of historical email right now:

  • mbox-formatted email archive from college: I have these due to having a backup of a backup of a backup of a computer
  • Thunderbird email files: I have these from a CD backup I made a long time ago
  • Gmail

My goal is to have a relatively timeless archive–one that is relatively easy to use and that will be easy to access and preserve over time. Here's what I have so far.

Archive structure

In my mnemosyne folder I have a subfolder mail.

This folder is controlled by dovecot using maildir-style folders. The key line in the dovecot configuration file for doing this is:

mail_location=maildir:/path/to/mnemosyne/mail:LAYOUT=fs

(:LAYOUT=fs means don't put a dot at the beginning of the mailbox folder names.)

With folders, the mailbox structure is going to look like this:

  • received
    • 2010
    • 2011
    • 2012
    • 2013
  • sent
    • 2010
    • 2011
    • 2012
    • 2013

That is, I'm going to reorganize my mail into folders just based on whether it's sent or received.

Pulling gmail

In the case of gmail I'm using the excellent lieer tool to download all my email. This is my "working copy," not the archive–I use notmuch and lieer to move emails around in this copy (e.g. when I change their tags). This is a two-way synchronization to Gmail. However, this is also a reasonable starting place for creating an archive.

Importing

I am creating a temporary mailbox folder for each import, e.g. import-mbox.

For each set of mail, I get the mail set up in a directory and then I use doveadm import to import it into the temporary folder. For example, for the mbox export, the mail was in a folder ~/tmp/email as a bunch of files e.g. ~/tmp/email/sent. I could import this with

doveadm import ~/tmp/email import-mbox ALL

The ALL means import all the mail.

Restructuring the imported mail

I then used doveadm to search through this mail and restructure it, using commands like this:

doveadm move sent/2010 mailbox import-mbox/sent* SENTSINCE 2009-12-31 SENTBEFORE 2011-01-01
doveadm move received/2010 mailbox import-mbox/* SENTSINCE 2009-12-31 SENTBEFORE 2011-01-01

Assuming that worked, dovecot is still trying to separate out the "unread" email vs. the "read" email, since after all dovecot is designed to be an IMAP server not a mail archive manager. "unread" email is in the filesystem in folders called new; "read" email is in the filesystem in folders called cur. Therefore I went into the folders and moved messages manually with mv using a command similar to:

# n.b. this is pseudocode
cd /path/to/mnemosyne/mail
for i in */*/new
do
  mv -i $i/* ${i//new/cur}
done

I can use doveadm mailbox list and doveadm mailbox status all '*' to see what's happening within each mailbox directory.

Deduplicating email

I don't know how well this command works but I am trying to use doveadm deduplicate ALL to remove duplicate email. I'm not sure how the deduplication works, though–this may not do what I want it to do.

Using the archive

To use the archive I created a second notmuch-config file in /path/to/mnemosyne/mail/notmuch-config:

export NOTMUCH_CONFIG=/path/to/mnemosyne/mail/notmuch-config
notmuch new

This indexes all the email. The best thing about notmuch in this case is that notmuch creates its own index and does not modify the source email. I can therefore throw away and recreate notmuch's indexes if needed.

I can then use notmuch's search and show commands to find and export email into other reading programs.

Updated: