Skip to main content

Google BigQuery Source

google-bigquery-icon

Google Cloud’s BigQuery is a fully managed enterprise data warehouse that helps you to manage and analyze your data, which also provides built-in features such as ML, geospatial analysis, and business intelligence. The Google BigQuery integration gets data from a Google BigQuery table via a provided query.

note

This source is available in the Fed deployment.

Data collected

Polling IntervalData
5 minBigQuery API

Setup

Vendor configuration

Follow the below steps to get the Service Account's Credential JSON file to run BigQuery jobs:

  1. Open IAM & Admin under Google Cloud Console.
  2. Select the Service Account tab.
    Google_IAM_and_Admin
  3. From the project dropdown button, select the project where you will run the BigQuery jobs.
    Google_Project_Name
  4. Click on Create a Service Account and follow the instructions in Create service accounts google cloud docs.
    Google_Create_Service_Account
  5. Click on the email address provisioned during the creation and then click the KEYS tab.
    Google_Service_Account_Keys
  6. Click ADD KEY and choose Create new key.
    Google_Create_Service_Account
  7. Select key type as JSON.
    Google_Create_Service_Account
  8. Click Create. A JSON key file is downloaded to your computer.

Source configuration

When you create an Google BigQuery Source, you add it to a Hosted Collector. Before creating the Source, identify the Hosted Collector you want to use or create a new Hosted Collector. For instructions, see Configure a Hosted Collector.

note

Before setting up the integration, test out the query with the checkpointing logic and a specific checkpoint value in the Google BigQuery console.

To configure an Google BigQuery Source:

  1. In Sumo Logic, select Manage Data > Collection > Collection
  2. On the Collection page, click Add Source next to a Hosted Collector.
  3. Search for and select Google BigQuery.
  4. Enter a Name for the Source. The description is optional.
  5. (Optional) For Source Category, enter any string to tag the output collected from the Source. Category metadata is stored in a searchable field called _sourceCategory.
  6. (Optional) Fields. Click the +Add button to define the fields you want to associate. Each field needs a name (key) and value.
    • green check circle.png A green circle with a check mark is shown when the field exists in the Fields table schema.
    • orange exclamation point.png An orange triangle with an exclamation point is shown when the field doesn't exist in the Fields table schema. In this case, an option to automatically add the nonexistent fields to the Fields table schema is provided. If a field is sent to Sumo Logic that does not exist in the Fields schema it is ignored, known as dropped.
  7. Project ID. Enter the unique identifier number for your BigQuery project. You can find this from the Google Cloud Console.
  8. Checkpoint Field. Enter the name of the field in the query result to be used for checkpointing. This field has to be increasing and of type number or timestamp.
  9. Checkpoint Start. Enter the first value for the checkpoint that the integration will plug into the query.
  10. (Optional) Time Field. Enter the name of the field in the query result to be parsed as timestamp. If not provided, the current time will be used.
  11. Query. Enter the query that you need to run. You must include the phrase %CHECKPOINT% and sort the checkpoint field.
  12. (Optional) Query Interval. Enter the time interval to run the query in the format: Xm (for X minutes) or Xh (for X hours).
  13. Google BigQuery Credential. Upload the Credential JSON file downloaded from Google Cloud IAM & Admin.
  14. (Optional) Processing Rules for Logs. Configure any desired filters, such as allowlist, denylist, hash, or mask, as described in Create a Processing Rule.
  15. When you are finished configuring the Source, click Save.

Sample values for Query, Checkpoint, and Checkpoint Start fields

Each query must contain a phrase %CHECKPOINT%. Integration will extract and save the current checkpoint and use it in place of this phrase. The value of Checkpoint Start must be the same type as the Checkpoint Field.

note

Quote the phrase as "%CHECKPOINT%" if the Checkpoint Field is a timestamp string.

Following are some examples that demonstrate what values to use for the Query, Checkpoint, Time Field, and Checkpoint Start fields.

Example 1: Checkpoint Field is timestamp.

You can see double quotes for the timestamp as it is a string.

Select * from MyProject.MyDataSet.MyTable where timestamp > "%CHECKPOINT%"
FieldValue
Checkpoint Fieldtimestamp
Checkpoint Start2022-02-02 11:00:00.000+0700
Time Fieldtimestamp

Specific example on a public dataset:

SELECT base_url,source_url,collection_category,collection_number,timestamp(sensing_time) as sensing_time FROM bigquery-public-data.cloud_storage_geo_index.landsat_index where sensing_time > '%CHECKPOINT%' order by sensing_time asc LIMIT 100
FieldValue
Checkpoint Fieldsensing_time
Checkpoint Start2022-02-02 11:00:00.000+0700
Time Fieldsensing_time
Example 2: Checkpoint Field is a numeric field.
SELECT trip_id,subscriber_type,start_time,duration_minutes FROM bigquery-public-data.austin_bikeshare.bikeshare_trips where trip_id > %CHECKPOINT% order by start_time asc LIMIT 100
FieldValue
Checkpoint Fieldtrip_id
Checkpoint Start0
Time Fieldstart_time
Example 3: Query Gmail Logs

In the example below, you'll need to replace MyProject and MyDataSet with values matching your environment.

SELECT gmail.message_info,gmail.event_info,gmail.event_info.timestamp_usec AS TIMESTAMP FROM `MyProject.MyDataSet.activity` WHERE gmail.event_info.timestamp_usec > %CHECKPOINT% order by TIMESTAMP LIMIT 30000
FieldValue
Checkpoint FieldTIMESTAMP
Checkpoint Start1683053865563258
Time FieldTIMESTAMP

Note that the value of Checkpoint Start above is an epoch MICRO seconds timestamp (16 digits) for May 2, 2023 06:57:45.563258 PM GMT and the query also sorts by the checkpoint field (TIMESTAMP).

When setting up this source for Gmail logs for the first time and collecting historical Gmail logs, it is important to set the Checkpoint Start in epoch microseconds (16 digits), and sort the checkpoint field explicitly in your query. Also note that it might take a long time for the source (and many BigQuery queries to execute) to backfill if the starting point is set far in the past - depending on your Gmail logs volume.

JSON schema

Sources can be configured using UTF-8 encoded JSON files with the Collector Management API. See how to use JSON to configure Sources for details. 

ParameterTypeValueRequiredDescription
schemaRefJSON Object{"type":"Google BigQuery"}YesDefine the specific schema type.
sourceTypeString"Universal"YesType of source.
configJSON ObjectConfiguration objectYesSource type specific values.

Configuration Object

ParameterTypeRequiredDefaultDescriptionExample
nameStringYesnullType a desired name of the source. The name must be unique per Collector. This value is assigned to the metadata field _source."mySource"
descriptionStringNonullType a description of the source."Testing source"
categoryStringNonullType a category of the source. This value is assigned to the metadata field _sourceCategory. See best practices for details."mySource/test"
fieldsJSON ObjectNonullJSON map of key-value fields (metadata) to apply to the Collector or Source. Use the boolean field _siemForward to enable forwarding to SIEM.{"_siemForward": false, "fieldA": "valueA"}
projectIdStringYesnullThe project ID is the globally unique identifier for your project. For example, pelagic-quanta-364805.
credentialsJsonStringYesnullThis field contains the credential JSON of the Service Account used for accessing BigQuery service.
QueryStringYesnullThe query to be used in BigQuery. The special string %CHECKPOINT% will be replaced with the largest value seen in the checkpoint field.
timeFieldStringNonullThe name of the column to be used to extract timestamp. If not specified, the C2C will use the current time for each row or record we collect. The TIMESTAMP data type is recommended, but any number type will be converted into a epoch milliseconds or epoch microseconds.
checkpointFieldStringYesnullThe column whose largest value will be used as the %CHECKPOINT% in the next search. The checkpoint field has to be of type number of timestamp.
checkpointStartStringYesnullThe very first value of the checkpoint to be used in the query.

JSON example

{
"api.version":"v1",
"source":{
"schemaRef":{
"type":"Google BigQuery"
},
"config":{
"name":"MyBigQuerySource",
"checkpointField":"timestamp_usec",
"timeField":"timestamp_usec",
"checkpointStart":"0",
"query":"select message_info,event_info,event_info.timestamp_usec as timestamp_usec from `bigquery-dev-382704.BigQueryTest.GmailTest` where event_info.timestamp_usec > %CHECKPOINT% LIMIT 2",
"projectId":"********",
"fields":{
"_siemForward":false
},
"pollingInterval":"2m",
"credentialsJson":"********"
},
"state":{
"state":"Collecting"
},
"sourceType":"Universal"
}
}

Download example

Terraform example

resource "sumologic_cloud_to_cloud_source" "google_bigQuery_source" {
collector_id = sumologic_collector.collector.id
schema_ref = {
type = "Google BigQuery"
}
config = jsonencode({
"name":"MyBigQuerySource",
"checkpointField":"timestamp_usec",
"timeField":"timestamp_usec",
"checkpointStart":"0",
"query":"select message_info,event_info,event_info.timestamp_usec as timestamp_usec from `bigquery-dev-382704.BigQueryTest.GmailTest` where event_info.timestamp_usec > %CHECKPOINT% LIMIT 2",
"projectId":"********",
"fields":{
"_siemForward":false
},
"pollingInterval":"2m",
"credentialsJson":"********"
})
}
resource "sumologic_collector" "collector" {
name = "my-collector"
description = "Just testing this"
}

Download example

FAQ

info

Click here for more information about Cloud-to-Cloud sources.

Status
Legal
Privacy Statement
Terms of Use

Copyright © 2024 by Sumo Logic, Inc.