Posts by Brad
Data Imports with Migrate and Table Wizard
Many of the Drupal projects we’re working on currently have large imports of legacy data. The specific subject of this post is a user import I’ve recently completed for NewsU.org, which is a project of the Poynter Institute and funded by the Knight Foundation. In this post I’ll be talking about the process I used to migrate their 110,000+ users from SQL Server to Drupal, but many of the concepts apply regardless of the source database and the type of content you are dealing with. If you’d like to read the less nerdy bits about the NewsU project, you can visit the project blog at next.newsu.org.
The Process
Let me begin by talking a bit about the process I’ve been using. My plan of attack has several steps:
- Load database backup into a SQL server instance
- For this I’ve been using Amazon EC2 instances (thanks to greggles suggestion). Many of the database backups I’ve been working with exceed the 2 gigabyte limit present in the developer version of SQL server. Left with a choice, we’ve been paying $1.50 an hour to use an appropriate Windows server with SQL Server 2008 on the EC2 platform, rather than spending thousands of dollars on a SQL Server license.
- Use the MySQL Migration Toolkit to create a MySQL schema and retrieve the data from SQL Server.
- The MySQL Migration Toolkit will do a good bit of the heavy lifting involved when mapping your SQL Server schema to something appropriate for MySQL, but it will often still take a lot of manual massaging of the create table statements it provides to get something that will work for you.
- Get the data into your Drupal database.
- This is fairly straightforward. The result of the MySQL Migration Toolkit step is typically two MySQL import files. One of these is for the table structure, and one is for the data. Import them as you would any other MySQL database dump.
- Field mappings and import process.
- This is the most involved of all steps. Use the Migrate + Table Wizard modules to Views-enable the required tables you have imported, perform field mappings with the migrate module, then import the data.
The Export
In the interest of time, I’m going to skip over the process of spawning a EC2 server with SQL Server and restoring a backup on it. Instead, I’m going to start at the point at which I already have SQL Server set up with the source database loaded. Once that is complete I can load up the MySQL Migration Toolkit and get to work.
The first step is to input the database credentials for my source database …

… and target database.

Once it has connection credentials the Migration Toolkit will examine what tables and stored procedures are available to import.

In this case, because I only need a snapshot of this data to perform an import from, I’m only interested in the table data. I’m not interested in maintaining data integrity over time, or doing any of the dozens of other things the stored procedures might do. I just need the data exactly how it exists at this particular moment in time to base this import on. After selecting which tables I want to import the Migration Toolkit will automatically take a shot at making an applicable MySQL create table statement.

I also have the opportunity to run the create table statements against my target database so I immediately spot any errors in what is automatically generated by the Migration Toolkit. This typically takes a couple rounds of trial and error to get something that will work. Once I have something that works, I can proceed to exporting the data. There’s not a whole lot to this, except to let the Migration Toolkit crunch away for a while, selecting data from SQL Server and transforming it into MySQL insert statements.

From there, I perform an import of the database dump into MySQL (or, if you transferred the objects online rather than saved database exports in the previous steps, it will already be there).
The Import
At this point I can start taking advantage of the tools which Migrate and Table Wizard provide. I’ll start by providing some background information on the modules, and what each is used for.
- Table Wizard
- The table wizard module provides an interface for exposing any table in your database to Views. Furthermore, you can define relationships between tables by tagging various columns as foreign keys. Although in this case I’m only using it to expose the data to Views for data import purposes, it has a wide range of use outside of data migrations. If you haven’t tried it yet, go get it. Now. Really.
- Migrate
- Migrate provides a mechanism for mapping Views data to Drupal objects. Once you have whatever data you are importing exposed to Views, Migrate will allow you to map data in the Views fields to various properties on the object. If that sounds a bit abstract, it’s because it is. Out of the box it provides support for importing data to users, content types, taxonomies, profiles and more. Furthermore it provides hooks for you to declare your own destination objects.
The next step is to Views-enable the table I’d like to import from. To do that I’ll navigate to the Table Wizard section under Administer -> Content and select the table I’d like to expose to Views.

Once this is done, I can go in and view the range of values represented by the data, comment on the purpose of various fields, or use Views to display the content of the table.

Now is where the real fun begins. Now that I’ve imported my data to MySQL, and used Table Wizard to Views-enable it, I can leverage the migrate module to map values in the various database columns to my Drupal users. First, a bit of background on that. For profiles on this site, we needed a range of field types which weren’t provided by the core profile module. For this reason we decided to use content profile. The catch is that from a Drupal perspective, both profile and user information were combined in the legacy database. I need to be able to simultaneously map values from the legacy database to both Drupal users and content profile fields. Luckily the migration module has built-in support for the content profile module and automatically adds available mappings from profile nodes into the user import. Migrate will also be smart enough to map and save data for the Drupal user before anything happens on the profile node, ensuring a Drupal uid is in place and the profile node can be correctly associated with the Drupal user. With this in mind I can head over to Admin -> Content -> Migrate and create a new migration set.
The migration set will allow me to map column values from my views into the appropriate Drupal fields. It should be noted that anything being mapped here is treated as a raw value, without any additional processing done to it. Typically if I need to do any additional processing, like transforming a SQL Date field into a Unix timestamp, I’ll just leave it out of the mapping here and process it in the hooks which migrate provides.

If I have data which needs additional processing in some form, hope is not lost. Migrate exposes both the Drupal objects I’m working with as well as the Views data via hooks, so I can perform any additional processing I need. Here’s a toned down example of what I’m using for user profiles. In this case I’m pulling data from a handful of tables with many to one relationships with the legacy user data, as well as preserving legacy course enrollment data (which are organic groups in the new Drupal site).
<?php /** * Implementation of hook_migrate_prepare_node(). * * Provides customizations for the user profile node before it gets sent off for saving. */ function newsu_user_import_migrate_prepare_node(&$node, $tblinfo, $row) { // …. $node->taxonomy = array(); // Determine user 'role'. $roles = db_query(" SELECT t.tid FROM {tableRole} tr INNER JOIN {tablelLinkRoleToUser} l ON l.intRoleID_fk = tr.autRoleID INNER JOIN {term_data} t ON t.name = tr.strRole WHERE l.intUserID_fk = %d", $row->autUserID ); while ($role = db_fetch_object($roles)) { $node->taxonomy[] = $role->tid; } // Determine user 'medium'. $mediums = db_query(" SELECT t.tid FROM {tableMedium} tr INNER JOIN {tableLinkkMediumToUser} l ON l.intMediumID_fk = tr.autMediumID INNER JOIN {term_data} t ON t.name = tr.strMedium WHERE l.intUserID_fk = %d", $row->autUserID ); while ($medium = db_fetch_object($mediums)) { $node->taxonomy[] = $medium->tid; } // Add in course enrollment. $user_courses = db_query("SELECT * from {LEGACY_COURSE_ROSTER} cr WHERE cr.USER_ID = '%s'", $row->userLegacy_strUsername); while ($user_course = db_fetch_object($user_courses)) { if ($course_nid = newsu_user_import_get_course_nid($user_course->COURSE_ID)) { if ($course_nid && $node->uid) { og_save_subscription($course_nid, $node->uid, array('is_active' => 1)); } } } // …. return $errors; } ?>
Naturally, there’s lots of other processing I might (and do!) need, but this should give you the basic idea of how the process works. Once I have both my mappings in place and my code for additional processing I can start the data import using Migrate. Before I do that though, I’ll want to keep in mind that Drupal loves sending emails to users when accounts are created, I’m going to want to make sure I disable that to prevent thousands of users from getting emails pointing them to my development informing them their accounts have been created. I’ll do this by using the a href=”http://drupal.org/project/devel”>devel module and setting SMTP to log only on the devel settings page. Once that’s done I can start the import. To do this I’ll head over to Admin -> Content -> Migrate -> Process and run the user import process. Again, and again, and again and.. well, you get the point. The cycle of running a sample import, verifying the data, making any adjustments needed, and running it again can be a very time consuming process. However, it’s integral to making sure the import is working as planned.
Updates, Updates, Updates
So far I’ve covered the process of initially importing the data, which might be useful as we develop the site, but what about a couple months from now when the site is ready to go live? How do I handle changes to old accounts, and import new users? Well, it turns out the solution to that is pretty easy. There are actually two ways I can go about that. The first is quite simple. I just use data from the view row in the migrate hook I am implementing to figure out if an object of that type already exists. In the case of this user import, here’s the code I’m using to figure out if the user I’m importing is already in the system.
<?php /** * Implementation of hook_migrate_prepare_user(). */ function newsu_user_import_migrate_prepare_user(&$account, $tblinfo, $row) { if ($uid = db_result(db_query("SELECT uid FROM {users} WHERE name = '%s'", $row->userLegacy_strUsername))) { $account['uid'] = $uid; } // …. } ?>
In the event the user has already been imported, easy enough, I just set the uid on the $account array to equal their existing Drupal uid and the call to user_save will perform an update, rather than adding a user.
The second possibility requires a bit more knowledge about how Migrate works. Every content set which gets created by Migrate gets an entry in the migrate_content_sets table, which contains a unique migration content set id, or mcsid. Migrate also creates a table for each content set, in the form of migrate_map_<mcsid>. This table contains two columns, a sourceid which stored the primary key from the legacy data, and the destid which stored the new id of the Drupal Object. Migrate does this for two reasons. The first is that when doing an import on a content set Migrate needs to keep track of what it has processed and what it hasn’t. This table provides an easy way for Migrate to track this information. The second reason is that there is also the possibility that a later import of related data will need to preserve a relationship which was present, and the migrate_map_<mcsid> table provides an easy mechanism for doing that. In either case, when using this method just run a full import once. You can then copy the data from migrate_map_<mcsid> into a new table, truncate the original (to make migrate think it hasn’t imported your data set yet) and run queries to get Drupal id’s against your copy, attempting to match against the legacy data’s source id. This is useful for situations where the primary key in the legacy data is your only source for ensuring the uniqueness of content, like article content where titles and other fields cannot be counted on to be unique.
Wrapping it up
If you’re still with me, I’m proud of you. That pretty much wraps things up, at this point I’ve covered exporting data from SQL server to MySQL, setting up field mappings using Migrate and Table Wizard, writing additional code to process values which can’t be directly mapped, and managing updates to data that’s previously been imported. Thanks should also go to Mike Ryan and Moshe Weitzman of Cyrve for their work on Migrate and Table Wizard.
Safely Querying CCK Data Types
I recently came upon an article by Matt Johnson talking about the dangers of querying CCK tables directly. Due to shifting data within the CCK schema depending on a number of factors, directly querying tables can be hazardous if not done correctly. It’s easy to forget this until you’ve done a site where you’ve written dozens of queries directly against CCK tables, only to later change fields on your content types and have them break. Let’s take the example of building a site which catalogs artist and album information. I might start by creating an artist content type, with a field on it which defines the record label the artist is current signed to. Upon doing this the table structure would look like this:

So if after I’ve developed for a while and written a handful of queries which depend on the “label” field to work correctly I decide it’s time to add a content type for albums, this type also needs a label field as artists might switch labels between albums. So after creating my artist content type and sharing the field, our table structure looks like this:

The catch when querying these tables directly is that even if you already have data in your fields and change the field definitions, the CCK will change the schema and move the data around as necessary, making direct queries against these fields hazardous. In his article Matt recommends using views as an abstraction layer for queries by instantiating the view objects and executing the query, here’s a simplified example:
<?php $view = views_get_view($view_name); $view->execute(); if ($view->result) { … } ?>
Let’s take a look at the pros and cons of this approach:
Pros:
- Queries which are more stable, since views handles any shifting within the CCK table schema automatically.
- Ability to take advantage of the recently committed views caching patch written by Jeff Eaton and Earl Miles. This can offer performance gains if you’re not already using some form of external caching.
Cons:
- Needing to write a different view for every query on the CCK schema you’re making.
- The need to export those views to code.
Although the views method works, and has some advantages, I find it a little cumbersome to create a view for every custom query I’m running on a site. An alternate method I’ve been using for some time is to find the table information directly through the CCK API and run the queries based on the information it provides. Heres an example:
<?php // Get an array of field settings for the record label field. $field_info = content_fields('field_label'); // Get the database table and column(s) information for the field. $db_info = content_database_info($field_info); // Get all nid's which have a label field value which matches the $record_label query parameter. $nids = db_query("SELECT n.nid FROM {node} n INNER JOIN {". $db_info['table'] ."} fd ON fd.vid = n.vid and fd.nid = n.nid WHERE fd.". $db_info['columns']['value']['column'] ." = '%s'", $record_label); ?>
I much prefer this method as it maintains all the advantages of the views method in terms of stability, yet avoids the need to clutter the views interface, and maintains portability without the need to export the views you define to code.
Platinum Sponsors at DrupalCamp Colorado!
We're proud to be Platinum Sponsors for this years DrupalCamp Colorado. The camp this year is shaping up to be an excellent one. It's been moved from the Denver Open Media facilities where about 90 people met last year, to the DPPA events center in downtown Denver which should have room for up to 400 attendees. In addition to regular camp activities, we'll also be joined this year by Ubercamp, as well as a media and documentation sprint.
Looking over the sessions list on drupalcampcolorado.org, it looks like this years camp is shaping up to be an exciting event. We'll see you there!
University of Baltimore Usability Testing Day 2
Quote of the day: "Taxonomy is like biology and stuffed deer."
So we've wrapped up the first two days of Drupal 7 usability testing at the University of Baltimore, and we're all really excited about the information we've gotten so far. I've been surprised to learn how stressful it can be sitting in a dark room on the other side of a two way mirror watching a red eye marker bounce around the page as someone tries to solve a particularly challenging task. There are definitely times you just want to cheer them on when they finally get through it.
As with most testing, we've uncovered some positive things:
... and a few negative things:

It's actually quite interesting how much we've uncovered. In fact, after seeing our list of participants, I was a little bit worried we'd have trouble uncovering many issues at all. An interesting aspect of the people we're testing is that as opposed to other test results I've seen, we're testing people who are experienced. In fact, they're really experienced. Of the 4 people we've tested so far, they have a combined 32 years of web development experience. Furthermore 3 of them have previous exposure to Drupal, and on average they've previously had exposure to 4 CMS's. These are not people who's nephew built a site for them which they occasionally post content to, these are people who should be able to use Drupal.
I should wrap this post up, since we'll have 4 test participants coming in tomorrow. But if you'd like to follow our progress by day, tune in to the twitter tag #drupalusability or check out a few of the usability team members who have been sending regular updates about the sprint:
- Addisun 'add1sun' Berry
- Brad 'beeradb' Bowman
- Nathaniel 'catch' Catchpole
- Matthew 'ultimateboy' Tucker
Also, if you're in the Baltimore area and want to hang out with the Usability Team on Monday night, we're getting together at the UB Lab between 4:00-6:00PM on Monday. Look for details here in the groups.drupal.org Maryland group.
Usability Testing at the University of Baltimore next month!
In late February I and a handful of other Drupal contributors will be headed towards Drupalcon a week early to do formal Usability Testing of Drupal at the University of Baltimore to help continue the research done by Becca Scollan and other graduate students there last year. We'll be doing 3 days of testing, followed by 3 more compiling the data and turning them into actionable items in the Drupal.org issue queues. Finally, we'll be presenting an overview of the issues we found at Drupalcon.
If you'd like to get involved there's a lot to get done between now and then. Specifically, the UX Team is working towards finding solutions to the following issues:
- Text links for 'Select all/None/Invert'
- Allow more users to see the node admin pages
- Help System - Patch #1
- Vertical Tabs
- Text format widgets
- And anything else tagged as a usability issue on Drupal.org.
Drupal 7 core maintainer Angela 'webchick' Byron has been nice enough to promise a focus of her review/commit time on usability issues this next month as we prepare for testing, so now's the time to focus on these issues.
We'll also be using this month to create and refine our test plan. A current working version of that document has been posted to the Usability group on https://groups.drupal.org. It's already starting to solicit some feedback, but we'd love to get as many eyes on it as possible.
And finally, if contributing isn't your forte, but you'd still like to know how you can help out, we're trying to raise sponsorship money to help fund the testing. Everyone attending the testing is volunteering their time, but we will have travel expenses as well as the need to put together a budget to pay test subjects for their time. If you're willing to help out financially you can do so using the ChipIn widget to the right.
