Here at thrudigital we recently struck an interesting problem when a client got in touch wanting us to create them a custom analytics solution that combined usage statistics from the forum section of their website with the whole-of-site statistics they already got from Google Analytics. They wanted to be able to see, for example, what proportion of pageviews are generated from the forum, as compared with the whole site.
We thought, ‘Too easy, we will just hook into the Google Analytics API and bring down the stats from the client’s account…”. Whoops, a quick search derailed that plan as we realised that Google Analytics can not be currently accessed via an API! Guess they just haven’t got around to that feature yet…
What Google Analytics does have is an option to to schedule emails containing data in a number of formats to an email address of your choice. The formats that are available are CSV, TSV, XML and PDF.
The CSV and TSV formats are limited to providing information about one particular metric, for example, ‘Whole-of-Site Pageviews’. The XML format is much more useful in that it provides the statistics for all the top-level metrics in a single document, which allows you to pick and choose what you want to extract.
We figured that what the client needed was a script that could:
- check a given mailbox for a new email from Google Analytics
- pull the XML attachment out of the email
- parse the XML and insert the data into a table in the database
This script could then be scheduled to run at any specified interval (via cron) in order to keep the database copy of the Google Analytics data up-to-date.
We wrote the script in Ruby, but it is simple enough that it could be easily ported to other languages. We noticed that there is even an API for Java for parsing the Google Analytics XML format, if you wanted to get into more complex stuff.
The script relies on a couple of assumptions, if you are planning on ‘dropping it in’ and using it:
- You are using a MySQL database
- You have the mysql and tmail Ruby gems installed (you can install them by typing gem install mysql tmail)
- You are sending your Google Analytics emails to a POP3 mailbox
Once you have the script you can simply include instantiate the ’sucker’ in a simple Ruby script like the following:
require 'ga_tools'
sucker = GoogleAnalyticsTools::Sucker.new({
:mail_host => 'somewhere.com',
:mail_user => 'analytics@somewhere.com',
:mail_password => 'password',
:mail_subject_string => '** Somewhere.com Analytics **',
:db_host => 'somewhere.com',
:db_user => 'analytics',
:db_password => 'password',
:db_schema => 'somewhere_db',
:db_table => 'google_analytics'
})
sucker.get_latest_data
sucker.update_database
Here is an explanation of the parameters that you need to feed into this thing:
- mail_host, mail_user and mail_password: The details for accessing your POP3 mailbox.
- mail_subject_string: A special string that the script looks out for in the subject line, that lets it know which emails to get the XML data from. You would have specified a subject when you set up your Google Analytics scheduled email.
- db_host, db_user, db_password, db_schema: The details of the database that you wish to load the Google Analytics data into.
- db_table: The name of the table that will hold your Google Analytics data. If it does not exist when you run the script, it will be created automatically.
The script itself follows. Any feedback is always appreciated.
ga_tools.rb
require 'net/pop'
require 'rexml/document'
require 'rexml/xpath'
require 'date'
# This module requires the tmail and mysql gems, install them by typing:
# gem install mysql tmail
require 'tmail'
require 'mysql'
module GoogleAnalyticsTools
# Provides an interface through which to:
# * communicate with a mailbox that contains emailed Google Analytics data; and
# * insert the data into a MySQL database table
class Sucker
public
# The attribute data_file_contents is read-only outside this class
attr_reader :data_file, :data
# Returns a new Sucker with the specified set of configuration parameters
# * mail_host
# * mail_port
# * mail_user
# * mail_password
# * mail_subject_string
# * db_host
# * db_user
# * db_password
# * db_schema
# * db_table
def initialize(parameters)
# Set any configuration defaults
@configuration = { :mail_port => 110 }
# Update configuration
@configuration.update(parameters)
end
# Checks the specified mailbox for the latest matching email and extracts
# the XML from its attachment, returning a hash of the data.
def get_latest_data
output_status "Connecting to #{@configuration[:mail_host]}..."
# Connect to mail server
Net::POP3.start(@configuration[:mail_host], @configuration[:mail_port], @configuration[:mail_user], @configuration[:mail_password]) do |pop|
# Check whether mailbox is empty or not
if pop.mails.empty?
output_status "No mail found in the mailbox of #{@configuration[:mail_user]}."
return false
end
output_status "Total #{pop.mails.size} messages in mailbox of #{@configuration[:mail_user]}."
# Get last email that has a subject containing the specified subject
# string
message = get_last_mail_with_subject(pop.mails, @configuration[:mail_subject_string])
# Abort if there was no matching message
if !message
output_status 'No emails found from Google Analytics, aborting.'
return false
end
# Get the first XML attachment from the message
xml = get_attachment(message, 'text/xml', 'application/xml')
# If no XML was found, abort and print a message
if xml
# Print status
output_status "Found XML in last Google Analytics email, posted at #{message.date}."
# Parse XML data file and store contents
@data_file = xml
parse_data_file
# Return hash of data from file
return @data
else
output_status "No XML found in email, aborting."
return false
end
end
end
# Updates the specified database table with the data currently stored in
# the data hash.
def update_database
output_status "Saving data to table #{@configuration[:db_table]} in database #{@configuration[:db_schema]}@#{@configuration[:db_host]}..."
# Connect to database
db = Mysql.new(@configuration[:db_host], @configuration[:db_user], @configuration[:db_password], @configuration[:db_name])
# Select database
db.query("USE #{@configuration[:db_schema]};")
# If the target table does not exist yet, create it
create_table(db)
# If it does, update the record to reflect the latest value
# If not, insert a new record
@data.each do |date, measures|
# For each date key in the data hash, check if data already exists for
# this date in the database
result = db.query("SELECT *
FROM #{@configuration[:db_table]}
WHERE date = '#{date}';")
if result and result.num_rows > 0
# If a record does already exist, update the record to reflect the
# latest value
statement = db.prepare("UPDATE #{@configuration[:db_table]}
SET page_views = '#{measures[:pageviews]}',
visits = '#{measures[:visits]}',
time_on_site = '#{measures[:tos]}'
WHERE date = '#{date}';")
statement.execute
statement.close
else
# If no record exists yet, insert a new record
statement = db.prepare("INSERT INTO #{@configuration[:db_table]} (date, page_views, visits, time_on_site)
VALUES ('#{date}', '#{measures[:pageviews]}', '#{measures[:visits]}', '#{measures[:tos]}');")
statement.execute
statement.close
end
# Free resources
result.free
end
# Free resources
db.close
end
# Convenience method for running a set of statements on the database.
def run_script(name, sql)
output_status "Running script '#{name}' on database #{@configuration[:db_schema]}@#{@configuration[:db_host]}..."
# Connect to database
db = Mysql.new(@configuration[:db_host], @configuration[:db_user], @configuration[:db_password], @configuration[:db_name])
# Run script
db.real_query("USE #{@configuration[:db_schema]};")
db.real_query('START TRANSACTION;');
sql.split(';').each do |statement|
db.real_query(statement) if !statement.strip.empty?
end
db.real_query('COMMIT;');
# Free resources
db.close
end
protected
attr :configuration
attr_writer :data_file, :data
# Prints a status message to standard output
def output_status(message)
puts message
$stdout.flush
end
# Finds the last mail in the supplied array with the specified string in
# the subject
def get_last_mail_with_subject(mails, string)
message = false
mails.each do |mail|
peek = TMail::Mail.parse(mail.pop)
if peek.subject.to_s.index(string)
message = peek
end
end
return message
end
# Gets the first attachment of the supplied mail that has one of the
# specified content types
def get_attachment(mail, *content_types)
attachment = false
if mail.multipart? then
mail.parts.each do |part|
if content_types.include?(part.content_type)
attachment = part.body
break
end
end
end
return attachment
end
# Parses the XML data file and populates the data hash
def parse_data_file
output_status 'Parsing XML data file...'
# Parse XML
document = REXML::Document.new(@data_file)
# Get starting and ending dates of document
date_from = Date.strptime(REXML::XPath.first(document, "//PrimaryDateRange").text.split(' - ').first, '%B %d, %Y')
date_to = Date.strptime(REXML::XPath.first(document, "//PrimaryDateRange").text.split(' - ').last, '%B %d, %Y')
output_status "Date range: #{date_from} to #{date_to}"
# Get page views from document
output_status 'Getting page views...'
pageviews = get_text_elements(document, '//Sparkline[@id="PageviewsSparkline"]/PrimaryValue')
# Get visits from document
output_status 'Getting visits...'
visits = get_text_elements(document, '//Sparkline[@id="VisitsSparkline"]/PrimaryValue')
# Get time on site from document
output_status 'Getting time on site...'
tos = get_text_elements(document, '//Sparkline[@id="TimeOnSiteSparkline"]/PrimaryValue')
# Iterate through each date in the date range, adding the page views,
# visits and time on site values to a hash
number_of_days = date_to - date_from
@data = Hash.new
for i in 0..number_of_days
# Use date as key for each set of values
date_key = (date_from + i).to_s
# Each date has its own hash of the three values
@data[date_key] = {
:pageviews => pageviews[i],
:visits => visits[i],
:tos => tos[i]
}
end
end
# Finds all elements within the supplied document that match the specified
# XPath expression and returns their text contents as an array
def get_text_elements(document, xpath)
text_elements = []
document.elements.each(xpath) do |element|
text_elements.push(element.text)
end
return text_elements
end
# Create table to hold Google Analytics data using the specified database
# connection, if one does not already exist
def create_table(connection)
statement = connection.prepare("CREATE TABLE IF NOT EXISTS #{@configuration[:db_table]} (
date DATE,
page_views int(11) NOT NULL,
visits char(8) NOT NULL,
time_on_site int(11) NOT NULL,
PRIMARY KEY (date)
);")
statement.execute
statement.close
end
end
end