Python

Pulling Data from DoubleClick Search Directly into Python

04 Mar 2018

Pulling data direct from the data source itself unlocks plenty of possibilities. For those using DoubleClick Search (DS3), it's best to pull data directly from DS3 when working with search engine marketing (SEM) data if we're hoping to do some data manipulation.

Google's documentation isn't the best - I've seen better.

I've recently managed to pull data directly from DS3 to Python and here I will provide some basic guide and code samples on how to execute some simple, essential calls.

We're going to focus on downloading data (i.e. reports) from DS3 using Python in this guide and here's the documentation that I will be referring to a lot.

The Painful First Step: Authorisation

Before the fun part, let's install the client library and get authenticated. Firstly, run this pip install on your terminal/command prompt to install the Google API Client Library for Python.

$ pip install --upgrade google-api-python-client

Next is the painful step of authorisation. I won't be repeating the authorisation documentation (which you should still follow), but let me just highlight one step.

Even though you can use Python 3 to run the rest of the code to pull from DS3 (FYI I only write in Python 3 ;P ), the authorisation requires the installation of dsApi .py that only works on Python 2.7+.

So if you're like me and only uses Python 3, do look for ways to create a virtual environment with Python 2. If you're using conda, it will be easier to create virtual environments.

$ conda create -n [yourEnvironmentName] python=2.7
$ source activate [yourEnvironmentName]

Else, you can also install both Python 2 and Python 3 in your machine, but I recommend using virtual environments as it's just a one-use thing.

Make sure you install the cURL utility too. Look for your version under Packages and install it accordingly.

Don't worry about the step "Make dsApi.py executable." That can be done by calling "python dsApi.py" on Python 2.7+

Following that, please follow these authorisation steps. Make sure you keep your client ID, client secret and refresh token somewhere accessible and safe. We'll be using them later.

Start Building Your App

After all previous steps are done, we're now ready to build our script that calls data from DS3.

The scripts here are based on the "Set Up Your Application" documentation, so all credit goes there. I'll be repurposing it to make it easier to understand.

Firstly, we'll import some libraries for authentication.

# Import essential libraries for authentication
import httplib2
from apiclient.discovery import build
from oauth2client import GOOGLE_TOKEN_URI
from oauth2client.client import OAuth2Credentials

Next is the actual authentication. This is where you need to refer to your client ID, client secret and refresh token again.

# Authentication
client_id = 'your client ID' # Replace with your ID
client_secret = 'your client secret' # Replace with your secret
refresh_token = 'your refresh token' # Replace with your token

creds = OAuth2Credentials(access_token=None,
                           client_id=client_id,
                           client_secret=client_secret,
                           refresh_token=refresh_token,
                           token_expiry=None,
                           token_uri=GOOGLE_TOKEN_URI,
                           user_agent=None)

http = creds.authorize(http = httplib2.Http())
service = build('doubleclicksearch', 'v2', http=http)

The "service" object will serve as the main object to interact with the DoubleClick Search API.

Next is an optional piece to define a time range. The code snippet here will help to pull the report for most recent ___ days later on.

# Define time range to pull
import datetime
latest_date = (datetime.date.today() - datetime.timedelta(days = 1)).isoformat()
no_of_days_before = 600 # Define no. of days before yesterday here
earliest_date = (datetime.date.today() - datetime.timedelta(days = no_of_days_before)).isoformat()

Now, we're ready to define the attributes for the report. Here's a list of additional documentation links that will be helpful for you to define for your request.

Report types (click on respective ones for info on columnName)
Filtering reports and operators

We'll be referring to the Asynchronous Requests documentation due to its added flexibility.

# Define your request parameters
request = service.reports().request(
        body =
        {
        'reportScope': {
        'agencyId': '[your agency id]', # Replace with your ID
        'advertiserId': '[your advertiser id]' # Replace with your ID
        },
        'reportType': 'campaign',
        'columns': [
            { 'columnName': 'date' },
            { 'columnName': 'campaign' },
            { 'columnName': 'impr' },
            { 'columnName': 'clicks' },
            { 'columnName': 'cost' },
            ],
        'timeRange': 
            { 'startDate' : earliest_date,
              'endDate': latest_date
              },
        'filters':[
            {
                'column':{
                    'columnName': 'campaign'
                        },
                'operator': 'containsSubstring',
                'values': [
                    '[something to filter your campaigns]'
                ]
                    }
                ],
        'downloadFormat': 'csv',
        'maxRowsPerFile': 6000000,
        'statisticsCurrency': 'agency',
        'verifySingleTimeZone': 'false',
        'includeRemovedEntities': 'false'
        }
    )

WARNING: The snippet above pulls data by Days from 600 days ago. It's HUGE. Please redefine it for your use!

The snippet above uses the "service" object to request a report based on the attributes defined.

After the API received the request, we'll be executing it and get a unique ID to that we can use to pull this similar report again and again (if you want the report for the same date range over and over again)

json_data = request.execute()
report_id = json_data['id']

Now, your report is... not actually ready yet. If you inspect the json_data, the "isReportReady" attribute will most likely be "False".

That is because you just (very recently) prompted DS3 to generate the report and it's still generating.

Once your "isReportReady" is "True", we can simply request it using the "service" object again.

Here's a script that will help to constantly check if the report is ready and pulls the report into a pandas DataFrame for your perusal.

import time
import simplejson
from googleapiclient.errors import HttpError
csv_raw = []

for _ in range(10):
    try:
      request = service.reports().get(reportId=report_id)
      json_data = request.execute()
      if json_data['isReportReady']:
        print('Woohoo! The report is ready. Please wait a moment.')

        # Recursively pull data into csv_raw
        for i in range(len(json_data['files'])):
          request_file = service.reports().getFile(reportId=report_id, reportFragment=str(i))
          csv_raw.append(str(request_file.execute()))
        csv_string = '\n'.join(csv_raw) # Join reports if you have more than 1 generated

        # Turning csv format data into 2-D arrays
        data = csv_string.split('\\n')
        data = [i.split(",") for i in data]
        data[0][0] = data[0][0].replace("b'","")
        header = data[0] 
        data = data[1:]
        
        # Commence DataFrame build
        import pandas as pd 
        df = pd.DataFrame(data = data, columns = header)
        df = df.dropna(axis = 0, how = 'any')
        break
      else:
        print('Report is not ready. Damn. Please wait 10 seconds.')
        time.sleep(10)
        
    except HttpError as e:
      error = simplejson.loads(e.content)['error']['errors'][0]

      # See Response Codes
      print('HTTP code %d, reason %s' % (e.resp.status, error['reason']))
      break

You're basically done at this point and you can skip to the next optional section. I'll be explaining several steps in the long code above.

Why csv_raw = [] ?

Sometimes when your request is too large, DS3 will segment the report into multiple files. You can check if yours is segmented from the "files" attribute in your json_data that has "isReportReady" attribute as "True".

The code will populate all data into this csv_raw array by elements, and then joins them using

csv_string = '\n'.join(csv_raw)

where "\n" is the new-line character in CSV syntax.

What is happening in the "Turning csv format data into 2-D arrays" section?

DS3 is very helpful. Essentially, you can turn the csv_string object into a CSV file and you're done for the day.

But my objective is to turn it into a manipulable DataFrame in Python, which will unlock plenty of possibilities in terms of analysis, data manipulation etc.

Hence, the steps in this section will break the CSV data apart into a 2-D array based on CSV syntax.

To explain line-by-line

data = csv_string.split('\\n')
data = [i.split(",") for i in data]
data[0][0] = data[0][0].replace("b'","")

Breaks the data into rows of arrays
Breaks the data into columns of arrays within arrays
Removes the b' in the head of CSV data that helps Excel identify it as a CSV

Optional: Request for Specific User-defined Column

Most DS3 users will have custom columns that are built on Floodlights i.e. conversion data.

To pull a column for your conversion data, you first need to find out how DS3 define that custom column.

# To get the list of saved columns
request_list = service.savedColumns().list(agencyId = '[your agency id]', advertiserId = '[your advertiser id]') # Replace with your ID
json_list = request_list.execute()

As custom columns are defined on Advertiser-level, the request above only requires Agency and Advertiser ID.

You'll find all custom columns in the json_list object. Scan through the data and then define them in the earliest request object (when we haven't yet got the report ID)

request = service.reports().request(
        body =
        {
        'reportScope': {
        'agencyId': '[your agency id]',
        'advertiserId': '[your advertiser id]'
        },
        'reportType': 'campaign',
        'columns': [
            { 'columnName': 'date' },
            { 'columnName': 'campaign' },
            { 'columnName': 'impr' },
            { 'columnName': 'clicks' },
            { 'columnName': 'cost' },
            { 'savedColumnName': 'myConversion', 'headerText': 'conversions'} # Define it here
            ],
        'downloadFormat': 'csv',
        'maxRowsPerFile': 6000000,
        'statisticsCurrency': 'agency',
        'verifySingleTimeZone': 'false',
        'includeRemovedEntities': 'false'
        }
    )

That's all for pulling data directly from DS3 into Python as a pandas DataFrame. For more information, do feel free to visit the documentation that is widely referenced in this post.

Pulling Data from DoubleClick Search Directly into Python

The Painful First Step: Authorisation

Start Building Your App

Why csv_raw = [] ?

What is happening in the "Turning csv format data into 2-D arrays" section?

Optional: Request for Specific User-defined Column

Share the love

Related Posts

Analysing Subreddit's Keyword Concentration with Python and PRAW

Scraping Internal Links from Entire Domain

Using Graph Theory for Keyword Ideas Expansion via Networkx