Data Imports with Migrate and Table Wizard

Many of the Drupal projects we're working on currently have large imports of legacy data. The specific subject of this post is a user import I've recently completed for NewsU.org, which is a project of the Poynter Institute and funded by the Knight Foundation. In this post I'll be talking about the process I used to migrate their 110,000+ users from SQL Server to Drupal, but many of the concepts apply regardless of the source database and the type of content you are dealing with. If you'd like to read the less nerdy bits about the NewsU project, you can visit the project blog at next.newsu.org.

The Process

Let me begin by talking a bit about the process I've been using. My plan of attack has several steps:

Load database backup into a SQL server instance
For this I've been using Amazon EC2 instances (thanks to greggles suggestion). Many of the database backups I've been working with exceed the 2 gigabyte limit present in the developer version of SQL server. Left with a choice, we've been paying $1.50 an hour to use an appropriate Windows server with SQL Server 2008 on the EC2 platform, rather than spending thousands of dollars on a SQL Server license.
Use the MySQL Migration Toolkit to create a MySQL schema and retrieve the data from SQL Server.
The MySQL Migration Toolkit will do a good bit of the heavy lifting involved when mapping your SQL Server schema to something appropriate for MySQL, but it will often still take a lot of manual massaging of the create table statements it provides to get something that will work for you.
Get the data into your Drupal database.
This is fairly straightforward. The result of the MySQL Migration Toolkit step is typically two MySQL import files. One of these is for the table structure, and one is for the data. Import them as you would any other MySQL database dump.
Field mappings and import process.
This is the most involved of all steps. Use the Migrate + Table Wizard modules to Views-enable the required tables you have imported, perform field mappings with the migrate module, then import the data.

The Export

In the interest of time, I'm going to skip over the process of spawning a EC2 server with SQL Server and restoring a backup on it. Instead, I'm going to start at the point at which I already have SQL Server set up with the source database loaded. Once that is complete I can load up the MySQL Migration Toolkit and get to work.

The first step is to input the database credentials for my source database ...

... and target database.

Once it has connection credentials the Migration Toolkit will examine what tables and stored procedures are available to import.

In this case, because I only need a snapshot of this data to perform an import from, I'm only interested in the table data. I'm not interested in maintaining data integrity over time, or doing any of the dozens of other things the stored procedures might do. I just need the data exactly how it exists at this particular moment in time to base this import on. After selecting which tables I want to import the Migration Toolkit will automatically take a shot at making an applicable MySQL create table statement.

I also have the opportunity to run the create table statements against my target database so I immediately spot any errors in what is automatically generated by the Migration Toolkit. This typically takes a couple rounds of trial and error to get something that will work. Once I have something that works, I can proceed to exporting the data. There's not a whole lot to this, except to let the Migration Toolkit crunch away for a while, selecting data from SQL Server and transforming it into MySQL insert statements.

From there, I perform an import of the database dump into MySQL (or, if you transferred the objects online rather than saved database exports in the previous steps, it will already be there).

The Import

At this point I can start taking advantage of the tools which Migrate and Table Wizard provide. I'll start by providing some background information on the modules, and what each is used for.

<dl>
  <dt><strong>Table Wizard</strong></dt>
    <dd>The table wizard module provides an interface for exposing any table in your database to Views. Furthermore, you can define relationships between tables by tagging various columns as foreign keys. Although in this case I'm only using it to expose the data to Views for data import purposes, it has a wide range of use outside of data migrations. If you haven't tried it yet, <a href="http://drupal.org/project/tw">go get it</a>. Now. Really.</dd>
    <dt><strong>Migrate</strong></dt>
    <dd>Migrate provides a mechanism for mapping Views data to Drupal objects. Once you have whatever data you are importing exposed to Views, Migrate will allow you to map data in the Views fields to various properties on the object. If that sounds a bit abstract, it's because it is. Out of the box it provides support for importing data to users, content types, taxonomies, profiles and more. Furthermore it provides hooks for you to declare your own destination objects.</dd>
</dl>

The next step is to Views-enable the table I'd like to import from. To do that I'll navigate to the Table Wizard section under Administer -> Content and select the table I'd like to expose to Views.

Once this is done, I can go in and view the range of values represented by the data, comment on the purpose of various fields, or use Views to display the content of the table.

Now is where the real fun begins. Now that I've imported my data to MySQL, and used Table Wizard to Views-enable it, I can leverage the migrate module to map values in the various database columns to my Drupal users. First, a bit of background on that. For profiles on this site, we needed a range of field types which weren't provided by the core profile module. For this reason we decided to use content profile. The catch is that from a Drupal perspective, both profile and user information were combined in the legacy database. I need to be able to simultaneously map values from the legacy database to both Drupal users and content profile fields. Luckily the migration module has built-in support for the content profile module and automatically adds available mappings from profile nodes into the user import. Migrate will also be smart enough to map and save data for the Drupal user before anything happens on the profile node, ensuring a Drupal uid is in place and the profile node can be correctly associated with the Drupal user. With this in mind I can head over to Admin -> Content -> Migrate and create a new migration set.

The migration set will allow me to map column values from my views into the appropriate Drupal fields. It should be noted that anything being mapped here is treated as a raw value, without any additional processing done to it. Typically if I need to do any additional processing, like transforming a SQL Date field into a Unix timestamp, I'll just leave it out of the mapping here and process it in the hooks which migrate provides.

If I have data which needs additional processing in some form, hope is not lost. Migrate exposes both the Drupal objects I'm working with as well as the Views data via hooks, so I can perform any additional processing I need. Here's a toned down example of what I'm using for user profiles. In this case I'm pulling data from a handful of tables with many to one relationships with the legacy user data, as well as preserving legacy course enrollment data (which are organic groups in the new Drupal site).

taxonomy = array();
  // Determine user 'role'.
  $roles = db_query("
    SELECT t.tid FROM {tableRole} tr 
    INNER JOIN {tablelLinkRoleToUser} l ON l.intRoleID_fk = tr.autRoleID 
    INNER JOIN {term_data} t ON t.name = tr.strRole  
    WHERE l.intUserID_fk = %d", $row->autUserID
  );

  while ($role = db_fetch_object($roles)) {
    $node->taxonomy[] = $role->tid;
  }

  // Determine user 'medium'.
  $mediums = db_query("
    SELECT t.tid FROM {tableMedium} tr 
    INNER JOIN {tableLinkkMediumToUser} l ON l.intMediumID_fk = tr.autMediumID 
    INNER JOIN {term_data} t ON t.name = tr.strMedium 
    WHERE l.intUserID_fk = %d", $row->autUserID
  );
  
  while ($medium = db_fetch_object($mediums)) {
     $node->taxonomy[] = $medium->tid;
  }

  // Add in course enrollment.
  $user_courses = db_query("SELECT * from {LEGACY_COURSE_ROSTER} cr WHERE cr.USER_ID = '%s'", $row->userLegacy_strUsername);
  while ($user_course = db_fetch_object($user_courses)) {
    if ($course_nid = newsu_user_import_get_course_nid($user_course->COURSE_ID)) {
      if ($course_nid && $node->uid) {
        og_save_subscription($course_nid, $node->uid, array('is_active' => 1));
      }
    }
  }

  // ....

  return $errors;
}
?>

Naturally, there's lots of other processing I might (and do!) need, but this should give you the basic idea of how the process works. Once I have both my mappings in place and my code for additional processing I can start the data import using Migrate. Before I do that though, I'll want to keep in mind that Drupal loves sending emails to users when accounts are created, I'm going to want to make sure I disable that to prevent thousands of users from getting emails pointing them to my development informing them their accounts have been created. I'll do this by using the a href="http://drupal.org/project/devel">devel module and setting SMTP to log only on the devel settings page. Once that's done I can start the import. To do this I'll head over to Admin -> Content -> Migrate -> Process and run the user import process. Again, and again, and again and.. well, you get the point. The cycle of running a sample import, verifying the data, making any adjustments needed, and running it again can be a very time consuming process. However, it's integral to making sure the import is working as planned.

Updates, Updates, Updates

So far I've covered the process of initially importing the data, which might be useful as we develop the site, but what about a couple months from now when the site is ready to go live? How do I handle changes to old accounts, and import new users? Well, it turns out the solution to that is pretty easy. There are actually two ways I can go about that. The first is quite simple. I just use data from the view row in the migrate hook I am implementing to figure out if an object of that type already exists. In the case of this user import, here's the code I'm using to figure out if the user I'm importing is already in the system.

userLegacy_strUsername))) {
    $account['uid'] = $uid;
  }
  // ....

}
?>

In the event the user has already been imported, easy enough, I just set the uid on the $account array to equal their existing Drupal uid and the call to user_save will perform an update, rather than adding a user.

The second possibility requires a bit more knowledge about how Migrate works. Every content set which gets created by Migrate gets an entry in the migrate_content_sets table, which contains a unique migration content set id, or mcsid. Migrate also creates a table for each content set, in the form of migrate_map_<mcsid>. This table contains two columns, a sourceid which stored the primary key from the legacy data, and the destid which stored the new id of the Drupal Object. Migrate does this for two reasons. The first is that when doing an import on a content set Migrate needs to keep track of what it has processed and what it hasn't. This table provides an easy way for Migrate to track this information. The second reason is that there is also the possibility that a later import of related data will need to preserve a relationship which was present, and the migrate_map_<mcsid> table provides an easy mechanism for doing that. In either case, when using this method just run a full import once. You can then copy the data from migrate_map_<mcsid> into a new table, truncate the original (to make migrate think it hasn't imported your data set yet) and run queries to get Drupal id's against your copy, attempting to match against the legacy data's source id. This is useful for situations where the primary key in the legacy data is your only source for ensuring the uniqueness of content, like article content where titles and other fields cannot be counted on to be unique.

Wrapping it up

If you're still with me, I'm proud of you. That pretty much wraps things up, at this point I've covered exporting data from SQL server to MySQL, setting up field mappings using Migrate and Table Wizard, writing additional code to process values which can't be directly mapped, and managing updates to data that's previously been imported. Thanks should also go to Mike Ryan and Moshe Weitzman of Cyrve for their work on Migrate and Table Wizard.

Drupal Planet

Read This Next