Script for getting Google Analytics data into your database

Posted on March 24, 2009 Categories: Coding

post author

Written by: Charles

Charles has spent the past few years as the big cheese at thrudigital. On any normal day you will catch him with a milky cup of tea (no bubbles on top thank you very much) and at least 30 browser tabs open.

Here at thrudigital we recently struck an interesting problem when a client got in touch wanting us to create them a custom analytics solution that combined usage statistics from the forum section of their website with the whole-of-site statistics they already got from Google Analytics. They wanted to be able to see, for example, what proportion of pageviews are generated from the forum, as compared with the whole site.

We thought, ‘Too easy, we will just hook into the Google Analytics API and bring down the stats from the client’s account…”. Whoops, a quick search derailed that plan as we realised that Google Analytics can not be currently accessed via an API! Guess they just haven’t got around to that feature yet…

What Google Analytics does have is an option to to schedule emails containing data in a number of formats to an email address of your choice. The formats that are available are CSV, TSV, XML and PDF.

The CSV and TSV formats are limited to providing information about one particular metric, for example, ‘Whole-of-Site Pageviews’. The XML format is much more useful in that it provides the statistics for all the top-level metrics in a single document, which allows you to pick and choose what you want to extract.

We figured that what the client needed was a script that could:

  • check a given mailbox for a new email from Google Analytics
  • pull the XML attachment out of the email
  • parse the XML and insert the data into a table in the database

This script could then be scheduled to run at any specified interval (via cron) in order to keep the database copy of the Google Analytics data up-to-date.

We wrote the script in Ruby, but it is simple enough that it could be easily ported to other languages. We noticed that there is even an API for Java for parsing the Google Analytics XML format, if you wanted to get into more complex stuff.

The script relies on a couple of assumptions, if you are planning on ‘dropping it in’ and using it:

  • You are using a MySQL database
  • You have the mysql and tmail Ruby gems installed (you can install them by typing gem install mysql tmail)
  • You are sending your Google Analytics emails to a POP3 mailbox

Once you have the script you can simply include instantiate the ’sucker’ in a simple Ruby script like the following:

require 'ga_tools'

sucker = GoogleAnalyticsTools::Sucker.new({
  :mail_host => 'somewhere.com',
  :mail_user => 'analytics@somewhere.com',
  :mail_password => 'password',
  :mail_subject_string => '** Somewhere.com Analytics **',
  :db_host => 'somewhere.com',
  :db_user => 'analytics',
  :db_password => 'password',
  :db_schema => 'somewhere_db',
  :db_table => 'google_analytics'
})

sucker.get_latest_data
sucker.update_database

Here is an explanation of the parameters that you need to feed into this thing:

  • mail_host, mail_user and mail_password: The details for accessing your POP3 mailbox.
  • mail_subject_string: A special string that the script looks out for in the subject line, that lets it know which emails to get the XML data from. You would have specified a subject when you set up your Google Analytics scheduled email.
  • db_host, db_user, db_password, db_schema: The details of the database that you wish to load the Google Analytics data into.
  • db_table: The name of the table that will hold your Google Analytics data. If it does not exist when you run the script, it will be created automatically.

The script itself follows. Any feedback is always appreciated.

ga_tools.rb
require 'net/pop'
require 'rexml/document'
require 'rexml/xpath'
require 'date'

# This module requires the tmail and mysql gems, install them by typing:
# gem install mysql tmail
require 'tmail'
require 'mysql'

module GoogleAnalyticsTools

  # Provides an interface through which to:
  # * communicate with a mailbox that contains emailed Google Analytics data; and
  # * insert the data into a MySQL database table
  class Sucker
    public

      # The attribute data_file_contents is read-only outside this class
      attr_reader :data_file, :data

      # Returns a new Sucker with the specified set of configuration parameters
      # * mail_host
      # * mail_port
      # * mail_user
      # * mail_password
      # * mail_subject_string
      # * db_host
      # * db_user
      # * db_password
      # * db_schema
      # * db_table
      def initialize(parameters)

        # Set any configuration defaults
        @configuration = { :mail_port => 110 }

        # Update configuration
        @configuration.update(parameters)
      end

      # Checks the specified mailbox for the latest matching email and extracts
      # the XML from its attachment, returning a hash of the data.
      def get_latest_data
        output_status "Connecting to #{@configuration[:mail_host]}..."

        # Connect to mail server
        Net::POP3.start(@configuration[:mail_host], @configuration[:mail_port], @configuration[:mail_user], @configuration[:mail_password]) do |pop|

          # Check whether mailbox is empty or not
          if pop.mails.empty?
            output_status "No mail found in the mailbox of #{@configuration[:mail_user]}."
            return false
          end
          output_status "Total #{pop.mails.size} messages in mailbox of #{@configuration[:mail_user]}."

          # Get last email that has a subject containing the specified subject
          # string
          message = get_last_mail_with_subject(pop.mails, @configuration[:mail_subject_string])

          # Abort if there was no matching message
          if !message
            output_status 'No emails found from Google Analytics, aborting.'
            return false
          end

          # Get the first XML attachment from the message
          xml = get_attachment(message, 'text/xml', 'application/xml')

          # If no XML was found, abort and print a message
          if xml

            # Print status
            output_status "Found XML in last Google Analytics email, posted at #{message.date}."

            # Parse XML data file and store contents
            @data_file = xml
            parse_data_file

            # Return hash of data from file
            return @data
          else
            output_status "No XML found in email, aborting."
            return false
          end
        end
      end

      # Updates the specified database table with the data currently stored in
      # the data hash.
      def update_database
        output_status "Saving data to table #{@configuration[:db_table]} in database #{@configuration[:db_schema]}@#{@configuration[:db_host]}..."

        # Connect to database
        db = Mysql.new(@configuration[:db_host], @configuration[:db_user], @configuration[:db_password], @configuration[:db_name])

        # Select database
        db.query("USE #{@configuration[:db_schema]};")

        # If the target table does not exist yet, create it
        create_table(db)

        # If it does, update the record to reflect the latest value
        # If not, insert a new record
        @data.each do |date, measures|

          # For each date key in the data hash, check if data already exists for
          # this date in the database
          result = db.query("SELECT *
            FROM #{@configuration[:db_table]}
            WHERE date = '#{date}';")
          if result and result.num_rows > 0

            # If a record does already exist, update the record to reflect the
            # latest value
            statement = db.prepare("UPDATE #{@configuration[:db_table]}
              SET page_views = '#{measures[:pageviews]}',
                visits = '#{measures[:visits]}',
                time_on_site = '#{measures[:tos]}'
              WHERE date = '#{date}';")
            statement.execute
            statement.close
          else

            # If no record exists yet, insert a new record
            statement = db.prepare("INSERT INTO #{@configuration[:db_table]} (date, page_views, visits, time_on_site)
              VALUES ('#{date}', '#{measures[:pageviews]}', '#{measures[:visits]}', '#{measures[:tos]}');")
            statement.execute
            statement.close
          end

          # Free resources
          result.free
        end

        # Free resources
        db.close
      end

      # Convenience method for running a set of statements on the database.
      def run_script(name, sql)
        output_status "Running script '#{name}' on database #{@configuration[:db_schema]}@#{@configuration[:db_host]}..."

        # Connect to database
        db = Mysql.new(@configuration[:db_host], @configuration[:db_user], @configuration[:db_password], @configuration[:db_name])

        # Run script
        db.real_query("USE #{@configuration[:db_schema]};")
        db.real_query('START TRANSACTION;');
        sql.split(';').each do |statement|
          db.real_query(statement) if !statement.strip.empty?
        end
        db.real_query('COMMIT;');

        # Free resources
        db.close
      end

    protected

      attr :configuration
      attr_writer :data_file, :data

      # Prints a status message to standard output
      def output_status(message)
        puts message
        $stdout.flush
      end

      # Finds the last mail in the supplied array with the specified string in
      # the subject
      def get_last_mail_with_subject(mails, string)
        message = false

        mails.each do |mail|
          peek = TMail::Mail.parse(mail.pop)
          if peek.subject.to_s.index(string)
            message = peek
          end
        end

        return message
      end

      # Gets the first attachment of the supplied mail that has one of the
      # specified content types
      def get_attachment(mail, *content_types)
        attachment = false

        if mail.multipart? then
          mail.parts.each do |part|
            if content_types.include?(part.content_type)
              attachment = part.body
              break
            end
          end
        end

        return attachment
      end

      # Parses the XML data file and populates the data hash
      def parse_data_file
        output_status 'Parsing XML data file...'

        # Parse XML
        document = REXML::Document.new(@data_file)

        # Get starting and ending dates of document
        date_from = Date.strptime(REXML::XPath.first(document, "//PrimaryDateRange").text.split(' - ').first, '%B %d, %Y')
        date_to = Date.strptime(REXML::XPath.first(document, "//PrimaryDateRange").text.split(' - ').last, '%B %d, %Y')
        output_status "Date range: #{date_from} to #{date_to}"

        # Get page views from document
        output_status 'Getting page views...'
        pageviews = get_text_elements(document, '//Sparkline[@id="PageviewsSparkline"]/PrimaryValue')

        # Get visits from document
        output_status 'Getting visits...'
        visits = get_text_elements(document, '//Sparkline[@id="VisitsSparkline"]/PrimaryValue')

        # Get time on site from document
        output_status 'Getting time on site...'
        tos = get_text_elements(document, '//Sparkline[@id="TimeOnSiteSparkline"]/PrimaryValue')

        # Iterate through each date in the date range, adding the page views,
        # visits and time on site values to a hash
        number_of_days = date_to - date_from
        @data = Hash.new
        for i in 0..number_of_days

          # Use date as key for each set of values
          date_key = (date_from + i).to_s

          # Each date has its own hash of the three values
          @data[date_key] = {
            :pageviews => pageviews[i],
            :visits => visits[i],
            :tos => tos[i]
          }
        end
      end

      # Finds all elements within the supplied document that match the specified
      # XPath expression and returns their text contents as an array
      def get_text_elements(document, xpath)
        text_elements = []

        document.elements.each(xpath) do |element|
          text_elements.push(element.text)
        end

        return text_elements
      end

      # Create table to hold Google Analytics data using the specified database
      # connection, if one does not already exist
      def create_table(connection)
        statement = connection.prepare("CREATE TABLE IF NOT EXISTS #{@configuration[:db_table]} (
            date DATE,
            page_views int(11) NOT NULL,
            visits char(8) NOT NULL,
            time_on_site int(11) NOT NULL,
            PRIMARY KEY  (date)
          );")
        statement.execute
        statement.close
      end
  end
end
Follow us:

Leave a Reply

Archive

July 2010

June 2010

May 2010

April 2010

March 2010

February 2010

January 2010

December 2009

November 2009

May 2009

April 2009

March 2009

February 2009

January 2009

December 2008

November 2008

October 2008

September 2008

July 2008

June 2008

March 2008

February 2008

January 2008

December 2007

November 2007

October 2007

About Us

A team of nerds, creatives and strategy ninjas based in central London, building websites, social networks, widgets and social media apps.

We have a portfolio that is good enough to make a male peacock blush, and some killer outside-the-box products...in a box.
Ask us a Question

Blog posts