At this point in my life I've had a goodly number of computers and other technology equipment. I have emails dating back to the late 1990s. The historical data I've kept has been all over the place: random Google Drive folders; Dropbox; my local hard drive; my web sites.

The below describes what I've figured out so far for how to create a conscious "life archive." The idea behind this project is to have a place to store ALL my digital content, especially the archival content.

I'm trying to apply lessons learned from my own archives but also from my work: I've worked in library IT; I've worked with digitizing paper records; I've done a large number of server transitions moving data+technology from 20 years ago into new systems; I've restructured and reorganized knowledge bases and file shares.

Principles

  • allow unorganized content that will be structured "later"
  • relatively structured for content that has been "processed"/ingested
  • file formats that will last
  • do not worry about the size of the archive

Location

A limiter for me before this project has been my computer's local storage: between my old mp3's and movies and other content, I've consistently had 300+ GB of content that I'm trying to corral.

Right now I'm experimenting with using Dropbox plus the "smart sync" feature. "Smart sync" works similarly to Google Drive file stream–the data isn't actually on your computer. I'm doing this so that my archive is not constrained to the size of my local computer's hard drive.

I subscribed to Dropbox plus and created a folder within Dropbox for the archive, which I've called mnemosyne.

Warning re: structuring content

At least for Dropbox, I've found that its performance scales by the number of files not just the size of files. If you have a large number of small files, consider creating a tarball/zip file of these files and then putting that into the life archive. For example, I had a few web archive folders with tens of thousands of tiny files in them–by compressing these archives into one file, I gained the benefit of much faster backups at the (small) expense of not being able to run file comparisons/search the archive content.

Structure

Within the mnemosyne folder, I've structured content like so:

  • ingest/: This folder is where I put unstructured content. Right now for example I have
    • Google-Drive-archive/: this is my personal Google Drive "archive" folder that I've had for a long time but never dealt with. Notably, Google Drive lets you download folders and will convert the Google docs/sheets into MS Word docs/MS Excel files.
    • old-phone-20130503/: this is a copy of my old Android phone that I never did anything with.
    • pre-2020 dropbox/: this is old Dropbox content that I don't want to deal with.
    • etc: I add other folders here, one folder per data source.

And then structured folders for content. These folders usually have some date representation underneath them for their content. I create new folders whenever I have a different type of structured data.

  • art/: Non-photographs that I've made or that I like.
  • backgrounds/: Desktop backgrounds
  • books/: Digital books that I own.
  • camps/: Camps for my kid by year (e.g. camps/2016 for camps they went to in 2016).
  • certificates/: Professional certificate documentation, e.g. my ITIL expert certificate.
  • code/: Archived programming code, named YYYY-MM projectname e.g. 2019-12 day-one-converter. Ideally these are the git repos.
  • contacts/: Ideally, historic copies of my address contacts1
  • eve-pvp/: When I played EVE I recorded myself in PVP, mainly because I was really bad and panicky and wanted to learn. This folder exists outside of videos/ because I have so many of these videos and they're really specific.
  • evernote/: Quarterly backups of my Evernote export.
  • guitar/: Guitar tabs.
  • itunes/: Copies of my historic iTunes metadata.
  • job-searches/: Turns out I have kept copies of my cover letters and resumes for a bunch of jobs.
  • keys/: Historic GPG/other security-related keys.
  • letters/: Letters I've written.
  • mail/: Eventually this will hold maildir-formatted emails, maybe structured as YYYY/sent and YYYY/received. This will be where I put the "archival copy" of my emails. I am probably going to populate this from gmail and then purge gmail of my email from prior to say 2018. This isn't too big of a deal for me because I now use other systems for email searching such as notmuch(1).
  • medical/: Medical records.
  • minecraft-saves/: Archival copies of Minecraft worlds.
  • music/: My mp3s. This is structured the way iTunes structures music: Artist/Album/song.mp3.
  • my-journal/: Any journal entries I've written, in text or text-compatible format (e.g. markdown, org-mode). Apparently I was journaling using Day One for a few years; I exported my Day One archives into text and then I wrote a Perl program to split out the entries into their own files.
  • organizations/: Records related to specific organizations, such as the NCSU Perl Mongers chapter.
  • people/: Random files pertaining to specific people.
  • pictures/: Pictures, structured into subfolders by YYYY/MM/DD.
  • pinboard/: I export my pinboard bookmarks quarterly with a date prefix. Those archives go here.
  • playlists/: Song playlists.
  • presentations/: Any presentations I've given, structured as YYYY-MM presentation.
  • projects/: Artifacts related to projects or project-like things, structured as YYYY-MM project name. For example, I have project content for each of our house moves.
  • receipts/: Random receipts.
  • recordings/: Audio recordings in folders, prefixed with YYYY-MM e.g. 2018-02 song. This way the recordings sort. One folder could potentially have many recordings.
  • resume/: My authoritative resume plus past copies of the resume.
  • school/: Records related to my going to school.
  • screenshots/: I have more screenshots than I expected! Organized by what the screenshot is about, e.g. screenshots from Minecraft.
  • social-media/: Exports from social media, e.g. my Twitter export.
  • software/: Copies of software that I bought but may not be easy to find in the future. Right now this is mostly versions of TurboTax going back to 2009.
  • taxes/: Tax returns.
  • teaching/: Records related to classes I've taught.
  • toodledo/: Archive of my tasks from toodledo the task management system.
  • travel/: Records related to travel, e.g. itineraries.
  • videos/: Video content that I've created (vs. movies/), structured into folders as YYYY-MM video name.
  • web-archives/: Archives of web sites over time, including my web site as well as websites I've maintained for others.
  • writing/: Anything I've written that's longer form such as an article or report, in folders named YYYY-MM name.

I will add more folders as I learn about other structured content types.

What doesn't go in the life archive?

I am thinking about this archive as being for content that's not going to change. So code I'm actively developing, for example, wouldn't go in there. This wiki wouldn't go in there, unless I was taking a backup copy.

My idea is this archive is where stuff goes that I want to be able to refer to later, but I am not going to edit. Ideally, the archive would be "append-only."

Backups

The "authoritative" copy of mnemosyne is on my Dropbox. However, I want to have redundant copies to ensure that I can still access my files even if something happens to Dropbox. To that end, I create two backups:

  1. external hard drive
  2. Arq backup of external hard drive

External hard drive

Backing up to the external hard drive makes extensive use of rsync. This is somewhat complicated by using Smart Sync. Mainly, at least on OS X you need to ensure that Dropbox's "Smart sync For OS X" is off. If it's on, trying to rsync across online-only files will not work: all your files will come across as 0 bytes so rsync will try to sync them again.

In my initial backup run, I told Dropbox to pull down smart sync folders to my local machine, and then I ran rsync on them. My hope is that in subsequent runs I can just use rsync and it works.

Arq backup of external hard drive

I use (and love) Arq for backups. You can tell Arq to back up an external drive. I run the Arq backup.

Tinfoil hat time: protecting external drive

I am working on putting the external hard drive in a shielded container so that my backups are also protected in the event of a solar flare/EMP. I am pretty sure this is ridiculous.

Footnotes:

1

I have not figured this out yet.