Skip to main content

Google BigQuery Source

google-bigquery-icon

Google Cloud’s BigQuery is a fully managed enterprise data warehouse that helps you to manage and analyze your data, which also provides built-in features such as ML, geospatial analysis, and business intelligence. The Google BigQuery integration gets data from a Google BigQuery table via a provided query.

Data Source

The Google BigQuery Integration fetches results of a query using BigQuery API.

Setup and Configuration

Follow the below steps to get the Service Account's Credential JSON file to run BigQuery jobs:

  1. Open IAM & Admin under Google Cloud Console.
  2. Select the Service Account tab.
    Google_IAM_and_Admin
  3. From the project dropdown button, select the project where you will run the BigQuery jobs.
    Google_Project_Name
  4. Click on Create a Service Account and follow the instructions in Create service accounts google cloud docs.
    Google_Create_Service_Account
  5. Click on the email address provisioned during the creation and then click the KEYS tab.
    Google_Service_Account_Keys
  6. Click ADD KEY and choose Create new key.
    Google_Create_Service_Account
  7. Select key type as JSON.
    Google_Create_Service_Account
  8. Click Create. A JSON key file is downloaded to your computer.

States

Google BigQuery integration Source is a fully managed enterprise data warehouse that helps you to manage and analyze your data. When you create an Google BigQuery Source, it goes through the following stages:

  1. Pending. Once the Source is submitted, it is validated, stored, and placed in a Pending state.
  2. Started. A collection task is created on the Hosted Collector.
  3. Initialized. The task configuration is complete in Sumo Logic.
  4. Authenticated. The Source successfully authenticated with Google BigQuery.
  5. Collecting. The Source is actively collecting data from Google BigQuery.

If the Source has any issues during any one of these states, it is placed in an Error state.

When you delete the Source, it is placed in a Stopping state. When it has successfully stopped, it is deleted from your Hosted Collector. On the Collection page, the Health and Status for Sources is displayed. You can click the text in the Health column, such as Error, to open the issue in Health Events to investigate.

Create Google BigQuery Source

When you create an Google BigQuery Source, you add it to a Hosted Collector. Before creating the Source, identify the Hosted Collector you want to use or create a new Hosted Collector. For instructions, see Configure a Hosted Collector.

note

Before setting up the integration, test out the query with the checkpointing logic and a specific checkpoint value in the Google BigQuery console.

To configure an Google BigQuery Source:

  1. In Sumo Logic, select Manage Data > Collection > Collection
  2. On the Collection page, click Add Source next to a Hosted Collector.
  3. Search for and select Google BigQuery.
    google-bigquery-icon
  4. Enter a Name for the Source. The description is optional.
    google-bigquery-config-main.png
  5. (Optional) For Source Category, enter any string to tag the output collected from the Source. Category metadata is stored in a searchable field called _sourceCategory.
  6. (Optional) Fields. Click the +Add button to define the fields you want to associate. Each field needs a name (key) and value.
    • green check circle.png A green circle with a check mark is shown when the field exists in the Fields table schema.
    • orange exclamation point.png An orange triangle with an exclamation point is shown when the field doesn't exist in the Fields table schema. In this case, an option to automatically add the nonexistent fields to the Fields table schema is provided. If a field is sent to Sumo Logic that does not exist in the Fields schema it is ignored, known as dropped.
  7. Project ID. Enter the unique identifier number for your BigQuery project. You can find this from the Google Cloud Console.
  8. Checkpoint Field. Enter the name of the field in the query result to be used for checkpointing. This field has to be increasing and of type number or timestamp.
  9. Checkpoint Start. Enter the first value for the checkpoint that the integration will plug into the query.
  10. (Optional) Time Field. Enter the name of the field in the query result to be parsed as timestamp. If not provided, the current time will be used.
  11. Query. Enter the query that you need to run. You must include the phrase %CHECKPOINT% and sort the checkpoint field.
  12. (Optional) Query Interval. Enter the time interval to run the query in the format: Xm (for X minutes) or Xh (for X hours).
  13. Google BigQuery Credential. Upload the Credential JSON file downloaded from Google Cloud IAM & Admin.
  14. (Optional) Processing Rules for Logs. Configure any desired filters, such as allowlist, denylist, hash, or mask, as described in Create a Processing Rule.
  15. When you are finished configuring the Source, click Save.

Sample values for Query, Checkpoint, and Checkpoint Start fields

Each query must contain a phrase %CHECKPOINT%. Integration will extract and save the current checkpoint and use it in place of this phrase. The value of Checkpoint Start must be the same type as the Checkpoint Field.

note

Quote the phrase as "%CHECKPOINT%" if the Checkpoint Field is a timestamp string.

Following are some examples that demonstrate what values to use for the Query, Checkpoint, Time Field, and Checkpoint Start fields.

Example 1: Checkpoint Field is timestamp.

You can see double quotes for the timestamp as it is a string.

Select * from MyProject.MyDataSet.MyTable where timestamp > "%CHECKPOINT%"
FieldValue
Checkpoint Fieldtimestamp
Checkpoint Start2022-02-02 11:00:00.000+0700
Time Fieldtimestamp

Specific example on a public dataset:

SELECT base_url,source_url,collection_category,collection_number,timestamp(sensing_time) as sensing_time FROM bigquery-public-data.cloud_storage_geo_index.landsat_index where sensing_time > '%CHECKPOINT%' order by sensing_time asc LIMIT 100
FieldValue
Checkpoint Fieldsensing_time
Checkpoint Start2022-02-02 11:00:00.000+0700
Time Fieldsensing_time

Example 2: Checkpoint Field is a numeric field.

SELECT trip_id,subscriber_type,start_time,duration_minutes FROM bigquery-public-data.austin_bikeshare.bikeshare_trips where trip_id > %CHECKPOINT% order by start_time asc LIMIT 100
FieldValue
Checkpoint Fieldtrip_id
Checkpoint Start0
Time Fieldstart_time

Example 3: Query Gmail Logs

In the example below, you'll need to replace MyProject and MyDataSet with values matching your environment.

SELECT gmail.message_info,gmail.event_info,gmail.event_info.timestamp_usec AS TIMESTAMP FROM `MyProject.MyDataSet.activity` WHERE gmail.event_info.timestamp_usec > %CHECKPOINT% order by TIMESTAMP LIMIT 30000
FieldValue
Checkpoint FieldTIMESTAMP
Checkpoint Start1683053865563258
Time FieldTIMESTAMP

Note that the value of Checkpoint Start above is an epoch MICRO seconds timestamp (16 digits) for May 2, 2023 06:57:45.563258 PM GMT and the query also sorts by the checkpoint field (TIMESTAMP).

When setting up this source for Gmail logs for the first time and collecting historical Gmail logs, it is important to set the Checkpoint Start in epoch microseconds (16 digits), and sort the checkpoint field explicitly in your query. Also note that it might take a long time for the source (and many BigQuery queries to execute) to backfill if the starting point is set far in the past - depending on your Gmail logs volume.

Error Types

When Sumo Logic detects an issue, it is tracked by Health Events. The following table shows the three possible error types, the reason for the error, if the source attempts to retry, and the name of the event log in the Health Event Index.

TypeReasonRetriesRetry BehaviorHealth Event Name
ThirdPartyConfigNormally due to an invalid configuration. You'll need to review your Source configuration and make an update.No retries are attempted until the Source is updated.Not applicableThirdPartyConfigError
ThirdPartyGenericNormally due to an error communicating with the third-party service APIs.YesThe Source will retry indefinitely.ThirdPartyGenericError
FirstPartyGenericNormally due to an error communicating with the internal Sumo Logic APIs.YesThe Source will retry indefinitely.FirstPartyGenericError

Restarting your Source

If your Source encounters ThirdPartyConfig errors, you can restart it from either the Sumo Logic UI or Sumo Logic API.

UI

To restart your source in the Sumo Logic platform, follow the steps below:

  1. Open the Collection page, and go to Manage Data > Collection > Collection.
  2. Select the source and click the information icon on the right side of the row.
  3. The API usage information popup is displayed. Click the Restart Source button on the bottom left.
    restart-source-button
  4. Click Confirm to send the restart request.
    restart-source-confirm
  5. The bottom left of the platform will provide a notification informing you the request was successful.
    restart-source-initiated

API

To restart your source using the Sumo Management API, follow the instructions below:

  • Method: POST
  • Example endpoint:
    https://api.sumologic.com/api/v1/collectors/{collector_id}/sources/{source_id}/action/restart

Sumo Logic endpoints like api.sumologic.com are different in deployments outside us1. For example, an API endpoint in Europe would begin api.eu.sumologic.com. A service endpoint in us2 (Western U.S.) would begin service.us2.sumologic.com. For more information, see Sumo Logic Endpoints.

JSON Configuration

Sources can be configured using UTF-8 encoded JSON files with the Collector Management API. See how to use JSON to configure Sources for details. 

ParameterTypeRequiredDescriptionAccess
configJSON ObjectYesContains the configuration-parameters of the Source.na
schemaRefJSON ObjectYesUse {"type":"Google BigQuery"} for Google BigQuery Source.not modifiable
sourceTypeStringYesUse Universal for Google BigQuery.not modifiable

Config Parameters

ParameterTypeRequiredDescriptionAccess
nameStringYesType the desired name of the Source and it must be unique per Collector. This value is assigned to the metadata field _sourcemodifiable
descriptionStringNoType the description of the Source.modifiable
categoryStringNoType the category of the source. This value is assigned to the metadata field _sourceCategory.modifiable
fieldsJSON ObjectNoJSON map of key-value fields (metadata) to apply to the Collector or Source. Use the boolean field _siemForward to enable forwarding to SIEM.modifiable
projectIdStringYesThe project ID is the globally unique identifier for your project. For example, pelagic-quanta-364805.modifiable
credentialsJsonStringYesThis field contains the credential JSON of the Service Account used for accessing BigQuery service.modifiable
QueryStringYesThe query to be used in BigQuery. The special string %CHECKPOINT% will be replaced with the largest value seen in the checkpoint field.modifiable
timeFieldStringNoThe name of the column to be used to extract timestamp. If not specified, the C2C will use the current time for each row or record we collect. The TIMESTAMP data type is recommended, but any number type will be converted into a epoch milliseconds or epoch microseconds.modifiable
checkpointFieldStringYesThe column whose largest value will be used as the %CHECKPOINT% in the next search. The checkpoint field has to be of type number of timestamp.modifiable
checkpointStartStringYesThe very first value of the checkpoint to be used in the query.modifiable

JSON Example

{
"api.version":"v1",
"source":{
"schemaRef":{
"type":"Google BigQuery"
},
"config":{
"name":"MyBigQuerySource",
"checkpointField":"timestamp_usec",
"timeField":"timestamp_usec",
"checkpointStart":"0",
"query":"select message_info,event_info,event_info.timestamp_usec as timestamp_usec from `bigquery-dev-382704.BigQueryTest.GmailTest` where event_info.timestamp_usec > %CHECKPOINT% LIMIT 2",
"projectId":"********",
"fields":{
"_siemForward":false
},
"pollingInterval":"2m",
"credentialsJson":"********"
},
"state":{
"state":"Collecting"
},
"sourceType":"Universal"
}
}
Legal
Privacy Statement
Terms of Use

Copyright © 2023 by Sumo Logic, Inc.