Data Analysis with Microsoft Access
Tackling the Data
One of the strengths of Microsoft Access is its ability to import and link to data from different sources. With Access you can also query and report on the data in ways that a spreadsheet program like Excel doesn’t readily provide. This can come in handy as part of a database application but it’s also useful for quick analysis of new information.
I found a new example of this the other day when I was looking over the website statistics for my site. This data is stored in log files which record every incoming request to the site. Every time someone pulls up a page, like the one you’re reading now, their web browser generates a number of file requests that are stored in these logs.
This adds up to thousands of requests per day, each one being stored as a separate line in the log file. The log stores such basic information as the time and date of the request, the file name, the IP address of the computer making the request and the type of browser being used. This information comes in very handy for analyzing the site traffic. My hosting service provides analysis software that summarizes these logs but it’s not customizable, I don’t fully trust its analysis and it doesn’t always answer the questions that I have like the one that came up the other day;
“Why are my visitors getting so many 404 errors??”
For those who don’t know, 404 is the error code you get when you try to pull up a page that doesn’t exist. The analysis software was telling me I had hundreds of these results which is worrisome because they’re not very attractive to visitors and can indicate problems with the site. The statistics didn’t tell me what page requests were generating the errors, however.
I decided to look at the logs directly and see what I could find out. I figured if the analysis software could read it, I could find a way to work with it. The initial view was not encouraging.
It only looks like it’s from NASA …
After I looked at the file for a few moments, and turned off the word-wrap, I saw that it was a delimited text file. This is a text file that contains the same fields on each line and where each field is separated by a specific character. That character could be a comma, a space or another character so long as it doesn’t show up within any of the fields that are presented. In this case, the character is a blank space. Every time a blank space occurs, it means one field has ended and the next one is beginning.
Okay, that’s a little clearer.
Another type of text data file is a Fixed Width file where each field is allocated a certain number of characters on the line. The fields do not need to be separated although if the value for a field exceeds the allocated width, it might be cut off. If a field value falls short of its specified width, the remaining space may be filled by spaces or other characters. Either way, this type of file is read by knowing the list of fields and their character allotments.
Now I knew it was just a matter of importing the data somewhere and manipulating it as needed.
This Looks Like a Job for ACCESS!
I was working with Access 2002 for this project as shown by the screenshots. The Import Wizard is also present in Access 2007 and is virtually identical to the one shown here. This is not something that has changed a lot between Access versions.
Start with the Get External Data menu (Access 2002)
In Access 2007, the Import Wizard can be accessed from the External Data portion of the control ribbon.
Access provides a handy wizard that will guide you through the import process for a data file. It will import a variety of file types including other spreadsheet and database files. For this exercise, I just wanted to import the text file I’d saved the log contents to so I selected the text file in the Open File dialog that the wizard presented.
The wizard immediately recognizes the file contents as you can see below. The first screen presents a choice between the two file types, Delimited and Fixed Width.
The wizard makes its best guesses but, in the end, depends on the user to pick the right settings.
Clicking on the Advanced button brings up the Import Specification screen which provides additional settings that will help Access read the data. The below screenshot shows the settings I chose for the file. It’s a delimited file using a blank space for a field delimiter. From the above screenshots of the file, you can see the dates are shown with the four-digit year first and then the month and day and dashes in between. The months and days use leading zeroes (i.e. 2010-08-31). The timestamps use the colon to separate the hour and minute.
I could use the Text Qualifier selection to specify if quotations are marked with single or double quotation marks but I don’t need it here.
Under the field information, I could even specify the name, type and other information for each field. This would be useful as part of a regular import if the data was going to be referenced by an application or if Access had trouble with part of the data but this is a one-time import and Access seems to have figured out the fields okay so I’m not bothering with that.
Using the command buttons on the right, you can even save and load specification settings for reuse.
The second screen of the wizard specifies the delimiter character that’s being used although I already did that on the Advanced Settings screen. This screen also presents the option to use the first row of the data file as a set of fields names which is the case with this file.
The wizard can store the data in either a new or existing able. Using an existing table could be useful if you had a number of files to import and wanted to combine them all in a single table or if you were importing similar data from other database.
The next screen is another chance to manually define the fields as was earlier shown on the Advanced Settings screen. You can also mark fields to be skipped in the import.
Access specializes in relational data so, unlike Excel, it wants to add a primary key to any tables that are created. This key acts as a unique identifier for each row in the table and can be used to link the table to other tables in the database. On the next screen of the wizard, you are prompted to set the table key.
- If you let Access add a primary key, the program will add an AutoNumber field to your table which will increment by 1 for each row in the table.
- With the second option, you can choose an existing field within the table that provides a unique value. This will be used as the identifier.
- The third option skips the primary key altogether.
The final screen in the wizard simply asks for a table name if the data is to be imported to a new table.
Once the import is completed, the wizard notifies you of the results. As you can see from the above message, things don’t always go perfectly. There were thousands of records in the log file I imported so that’s not a surprise. The data was still imported but not quite as it exists in the file. The wizard also creates an ImportErrors table with details on what went wrong.
Out of 3783 rows imported, it seems that 153 of the rows had at least one field that was too long for the field type selected by Access. I could start over and specify a memo type for the field but I’m not concerned with the that particular field right now so I won’t bother.
For my purposes, I call that a successful import. Now let’s look at the data.
Under the Microscope
Once the import wizard has put the data into a nice neat table, all that’s left is to query it at will. Here’s the one that I used to find out where all the 404 errors were coming from.
After looking at the file, I was able to figure out that the cs-uri-stem field is the name of the file that’s being requested, whether it’s a page, a graphic or any other type of file.
The cs(Referer) field is used to hold the page from which the visitor followed a link to the requested page. This could be a page on another site or a page on this site. If the visitor typed the address of the page in manually (i.e. “http://www.drewslair.com/about.shtml”), the field is blank.
The sc-status field shows the status of the request. If the file was retrieved successfully, as in most cases, it should be 200. I’m specifying “404” as the criteria because those are the only rows I want to see.
The results paid off and told me about a few things that were going on with the site.
- A problem with a relative URL reference caused one of the page borders to look to the wrong directory for the cascading style sheet. Knowing that all the errors were coming from pages in one directory helped me to track this one down. Fortunately, it probably wasn’t noticeable to visitors.
- Many browsers are looking for the favicon.ico file. That’s the file that supplies that little icon you see for so many websites. It shows up in your Favorites or Bookmarks menu and on the title bar of your browser when you’re viewing the site. Drewslair.com doesn’t have one. It probably should but it’s not a priority.
- I’ve republished a few articles elsewhere on the web and these articles link back to graphics and other files from the site. Some of those files have changed names since the Knols were published. Updating the Knols fixed the problem.
- Then there are the external sites linking to files that have long since been moved or deleted or just getting the address wrong altogether. There will always be a need for a 404 page.
Having fixed the problems that I could, I’ll keep an eye on the stats for a few days and look forward to seeing the number of 404 errors go down. In addition to tracking down this issue, it’s turning out to be a quick way to check the popularity of specific pages and downloads with confidence in the results.
Filling in the Gaps
This is not the first time I’ve used Microsoft Access to analyze data in a way the original software couldn’t. It’s a great tool for both rapid development and quick analysis of many types of data. Many more people have Excel on their desktops which also does a great job of manipulating external data but if you have Access at your disposal, it’s worth your time to learn as much as you can about it beyond it’s native data format.