MySQL User Defined Function Example: Data Cleanup
(Updated June 9, 2016)
One of my current projects involves migrating a large amount of data away from an old custom Borland Paradox application into a new CiviCRM system. As with too many quickly-constructed apps, this old Paradox database wasn’t especially well designed and, among other the other challenges in salvaging the data, there were no restrictions on how dates could be entered. This means that, in multiple fields within each of the 20,000+ database records, I might see any of the following:
03-14-96 5/20/1975 10/02/83 09051975 07.19.1995 19830211121500
While CiviCRM does have a utility for importing data from CSV and other SQL tables, it was having quite a time with this collection and many of the dates were being mishandled. That’s if the records weren’t rejected entirely for other reasons. Data migration doesn’t often happen with just a few settings adjustments and a click of the Import button.
One of the problems you can see above is that many of the dates use two digits for the year. MySQL handles two-digit years by converting anything from 00-69 to the 2000’s and anything else to the 1900’s. The database has a lot of birth dates for older people and my client wouldn’t be happy to read that many of their clients had yet to be born. The Y2K bug strikes again!
Then I remembered that I was a programmer and decided to write my own function to handle things …
Every function starts of with a series of steps to be turned into code. In this case, it started out like this –
- Strip all leading zeros out of the date. It’s easier to standardize by subtracting information than adding.
- Change any separator other than a slash (“/”) to forward slashes. I discovered midway through coding that a variety of separators were being used, sometimes none at all.
- Determine, if possible, if the year uses four digits or two. If it’s four, leave it alone. If two and greater than the current year (i.e. 16), convert it to the last century.
This is not perfect, partly because of the bad condition of the data and partly because there is the remote chance of a 101 year old person in that collection who will suddenly be led to the Fountain of Youth by the new database (07-01-15 becomes 07/01/2015) but those aren’t disasters and can be cleaned up through later inspections. This is a start.
User-defined functions (UDFs) in MySQL can accept one or more input variables and then return values. They can also be used within SQL queries in order to run the code on entire sets of data. Once I was finished pounding my fingers and occasionally my head on the keyboard, this was the function I ended up with.
CREATE FUNCTION `standard_date_convert`(inputDate VARCHAR(255)) RETURNS date BEGIN DECLARE output date DEFAULT null; DECLARE result date DEFAULT null; DECLARE r_input varchar(255) DEFAULT ''; DECLARE y_length int default 0; DECLARE y_value INT(11) default 0; IF length(inputDate) > 0 THEN -- Trim leading spaces SET inputDate = TRIM(inputDate); -- Trim starting 0 if there is one. IF LEFT(inputDate, 1) = '0' THEN SET inputDate = RIGHT(inputDate, LENGTH(inputDate) - 1); END IF; -- Replace dashes and periods SET inputDate = REPLACE(inputDate, '-', '/'); SET inputDate = REPLACE(inputDate, '.', '/'); -- Isolate and process year value -- First, get year string by reversing string and finding first (last) separator. SET r_input = REVERSE(inputDate); SET y_length = LOCATE('/', r_input) - 1; -- Get the year value from the last two digits ... SET y_value = CONVERT(REVERSE(LEFT(r_input, 2)), UNSIGNED INT); -- If the year value is greater than or equal to the current year, assume last century, else this century. IF y_length > 0 AND y_length < 4 THEN IF y_value < 10 THEN SET inputDate = CONCAT(LEFT(inputDate, LENGTH(inputDate) - (y_length)), '200', y_value); ELSEIF y_value >= DATE_FORMAT(CURDATE(), '%y') THEN SET inputDate = CONCAT(LEFT(inputDate, LENGTH(inputDate) - (y_length)), '19', y_value); ELSE SET inputDate = CONCAT(LEFT(inputDate, LENGTH(inputDate) - (y_length)), '20', y_value); END IF; END IF; END IF; -- Remove extra zeroes SET inputDate = REPLACE(inputDate, '/0', '/'); SET result = STR_TO_DATE(inputDate, '%c/%e/%Y'); IF result IS NOT NULL THEN SET output = result; ELSE SET OUTPUT = '00/00/0000'; END IF; RETURN output; END
The TRIM() function removes extra spaces from both the front and the end of the string. I like this since some languages make you trim from the left and right separately.
SET inputDate = RIGHT(inputDate, LENGTH(inputDate) - 1);
Much of the work of the function is string manipulation and concatenation. The input string is repeatedly overwritten with new versions of itself as the function molds it into the correct format. Here, the leading zero at the beginning of a date is stripped off by leaving it behind while the rest of the string value is transferred into a new string with the same name. The RIGHT() function calculates the length of the string with the nested LENGTH() function, subtracts 1 and then takes that many characters from the right side of the string.
SET r_input = REVERSE(inputDate); SET y_length = LOCATE('/', r_input) - 1;
The REVERSE() function provides a shortcut for getting the number of characters after the last forward slash in the string which gives us the length of the year value.
'6/25/85' becomes '58/52/6' LOCATE('/', r_input) = 3 Subtract 1 for the slash y_length = 2 (two-digit year value)
One important consideration when doing string manipulation in any language is to know whether the character places are numbered from 0 or 1. In some languages, they are in fact zero-based. This makes a difference with functions like the LOCATE() function above. For example, if the value being tested above is “58/” (two digits for the year 1985, reversed). If SQL was zero-based, the slash separator would actually be at character position 2 since the first position would be zero. String character places in MySQL are one-based, however so the LOCATE() function returns a 3 and we subtract 1 from it to get a length of 2 for the year value.
SET y_value = CONVERT(REVERSE(LEFT(r_input, 2)), UNSIGNED INT);
Nested functions can get pretty hairy to look at but they allow you to do a lot on one line. Just be sure to close each set of parentheses. Regardless of the length of the year value, we need the last two digits to decide later what century it falls in so this line takes the two left-hand digits from the reversed date string, reverses the two digits again to get the two-digit year and then converts it from a string to to an unsigned integer so it can be evaluated as greater or less than a given year later on.
The final IF … THEN only acts if the year value is between 1 and 3 digits. If the year is already four digits, there’s no need to do anything and the function gets to skip over a lot of steps. If the LENGTH function returned 0 for the year, that means there are no separators in the string and there’s nothing to work with. The function then uses concatenation to add either 19 or 20 to the date string in the right place based on the year value.
The STR_TO_DATE() function converts the final product to a date that the function then returns to whatever other SQL called it. If the string cannot be converted to a date, this function returns NULL and I have the function return the date 0000-00-00 which is an obvious error that can be checked.
The function runs surprisingly fast for all the work it does. A query like the one shown on the left fetched 1000 records in 0.239 seconds. This makes it simple to add the function to the INSERT and UPDATE queries I’m using to move the import data into the new tables. Below is an example which updates the birth dates for all the spouses of contacts within the database.
UPDATE civicrm_value_spouse_2 S INNER JOIN ContactImport CI ON CI.CRM_ID = S.entity_id SET S.date_of_birth_9 = standard_date_convert(RI.SPOUSE_DOB);