Cloud OnBoard
{
('Module 1') Introducing Google Cloud Platform Page 1 - 11
('Module 2') Compute & Storage Fundamentals Page 11 - 21
('Module 3') Data Analysis on the Cloud Page 22 - 35
('Module 4') Scaling Data Analysis Page 35 - 49
('Module 5') Machine Learning Page 50 - 70
('Module 6') Data Processing Architecture Page 70 - 79
Summary | Continue learning with Google Cloud Page 79 - 85

Introducing

Google Cloud Platform:

Big Data and Machine Learning

Agenda

What is Google Cloud Platform

Google Cloud Big Data products

Cloud computing is a continuation of a long-term
shift in how computing resources are managed

First Generation
Cloud Virtualized
data centers
First Wave
Server on-premises You don't rent hardware and
You own everything. space, but still control and configure virtual
It is yours to manage. machines. Pay for what
you provision.

2000 Next

1980s Now
Second Wave Third Wave Data centers Managed service
You pay for the hardware Completely elastic storage,
but rent the space. processing, and machine
Still yours to manage. learning so that you can
invest your energy in great apps. Pay for what you use.

Agenda

What is Google Cloud Platform

Google Cloud Big Data products

Google's mission is to organize

the world's information and make

it universally accessible and

useful

To organize the world's

information,Google has been

building the most powerful

infrastructure on the planet

In terms of hardware, Google Cloud has the largest cloud network, with
over 100 points of presence, and 100,000s of miles of fiber optic cable.

FASTER (US, JP, TW) 2016

Unity (US, JP) 2010
PLCN (HK, LA) 2019
SJC (JP, HK, SG) 2013

Monet (US, BR) 2017
Network
Junior (Rio, Santos) 2017
Network sea cable investments
Tannat (BR, UY, AR) 2017
Edge points of presence >100
Indigo (SG, ID, AU) 2019
Edge node locations >1000

The network connects 15 regions,
with 3 more coming

3 Finland
Netherlands
2 2
London
3 Oregon 3 3 3 Frankfurt
Iowa 4 3 Montreal Belgium
3 Los Angeles 3 3 N Virginia 8 Tokyo
3 S Carolina
HongKong 3 3 Taiwan
33
Mumbai
2 Singapore

Future region and 3
number of zones
3 Sydney Current region and São Paulo
number of zones

In terms of software, organizing the world's information
has meant that Google needed to invent data processing methods

Flume
MapReduce Dremel Millwheel TensorFlow

GFS Megastore TPU
Pub/Sub
Bigtable Colossus
Spanner
F1

2002 2004 2006 2008 2010 2012 2014 2016 2018

Google Cloud opens up that innovation and infrastructure to you

Dataflow
Dataproc BigQuery Dataflow ML Engine Auto ML

Cloud Storage Datastore
Bigtable Cloud Storage
Pub/Sub Cloud Spanner

2002 2004 2006 2008 2010 2012 2014 2016 2018

A suite of products that can be put together for data processing

Data-handling
Foundation Databases Analytics and ML frameworks

Compute Cloud BigQuery Cloud Pub/Sub
Engine Spanner

Cloud Cloud
Storage Cloud SQL Datalab Cloud Dataflow

Cloud ML APIs Cloud Dataproc
Bigtable

...

Spotify illustrates the typical journey of companies that come to
Google Cloud: From lower costs to increased reliability to business
transformation

Spend less
No-ops, Pay 1
for use, Secure

Flexible
Complete 2

Innovative
Powerful 3

A suite of products that can be put together for data processing

Improve scalability
Change where you compute Change how you compute
and reliability

Atomic Fiction lowered their costs with per-minute
(now per-second) billing

Change where you compute

FIS was able to improve reliability and scalability
on a massive data-processing challenge

1.7 GIGs 10 BN 1.7 GIGABYTES
6 BILLION
PER SECOND WRITTEN PER SECOND
MARKET EVENTS
PER HOUR
10 TERABYTES
BURSTS
WRITTEN PER HOUR
6 TBs PER HOUR
PER HOUR

The Consolidated Audit Trail (CAT) is a data repository of all equities and options
orders, quotes, and events; FIS processed the CAT to organize 100 billion market events
into an "order lifecycle" in a 4-hour window using Cloud Bigtable.

Rooms to Go transformed its business with data and machine learning

completely
Google Analytics
Rooms Premium designed room
to Go Collect landing pages, Combine
data views data packages
BigQuery
Analyze

CRM
Customer Relationship Manager
customer demographics, past purchases

In summary, Google Cloud offers you ways to…

Spend less Incorporate real- Apply machine Become a truly
on ops and time data into learning broadly data-driven
administration apps and and easily company
architectures

Module Review

Module review

Google Cloud Platform is:
(select all of the correct options)

Operated by Google on the same Most cost-effective if you pre-
infrastructure it uses purchase instances on a yearly
basis

A set of modular services from A platform on which to host
which you can compose cloud-based scalable and fast distributed
applications applications

Module review

Google Cloud Platform is:
(select all of the correct options)

Operated by Google on the same Most cost-effective if you pre-
infrastructure it uses purchase instances on a yearly
basis

A set of modular services from A platform on which to host
which you can compose cloud-based scalable and fast distributed
applications applications

Resources
Google Cloud Platform

Datacenters

Google IT security
CommonSecurity-WhitePaper-v1.4.pdf
Why Google Cloud
Platform?

Pricing Philosophy

Compute & Storage

Fundamentals

Agenda

CPUs on demand + Demo

A global filesystem + Demo

Google Cloud provides an earth-scale computer

Networking
Data storage

Compute power

Custom/changeable machine types, preemptible machines,
and automatic discounts lead to simplicity and agility

Demo:

Create a Compute Engine instance

Demo : Create a Compute Engine Instance

In this demo, we will :

1. Create a Compute Engine instance

2. SSH into the instance

3. Install the software package git
(for source code version control)

Agenda

CPUs on demand + Demo

A global filesystem + Demo

Use Cloud Storage for persistent storage and as staging
ground for import to other Google Cloud products

1 2 3 Cloud SQL
Ingest/ Extract Transform Store/Stage

BigQuery
Cloud Storage
Compute
Engine + Disk Raw data (any format)

Dataproc

Create a bucket and copy the data over using the Cloud
SDK; blobs are referenced through a gs://.../ URL

Google Cloud Platform Project

Bucket Bucket

Copy

Objects Objects
Data and Data and
metadata metadata

gsutil cp sales*.csv gs://acme-sales/data/

Cloud Storage gives you durability,
reliability, and global reach

Control access at project,
bucket and/or object level
Publish
Transfer Services
are useful for ingest

Ingest Store Import
Cloud
Storage

Use Cloud Storage
Compute as staging area
Engine Cloud
SQL

Control latency and availability
with zones and regions

Choose the closest Distribute your apps
zone/region so as and data across zones
to to reduce latency. to reduce service
disruptions.

Region: North America Region: Europe Region: ...
Zone: us-central1-a Zone: europe-west1-b Zone: ...
... ... ...

Distribute your apps and data across
regions for global availability.

Demo:

Interact with

Cloud Storage

Demo : Interact with Cloud Storage

In this demo, we carry out the steps of an ingest-
transform-and-publish data pipeline manually

1. Ingest data into a Compute Engine instance

2. Transform data on the Compute Engine instance

3. Store the transformed data on Cloud Storage

4. Publish Cloud Storage data to the web

Ingest-Transform-Publish
Step 4
using core infrastructure

Step 1 Step 2 Step 3
Publish

Ingest/ Import
Store
Extract Cloud
Storage

Compute
Engine Cloud
SQL

Cloud Shell gives you an easy command-line

Click

Do Now

Cloud Shell comes pre-installed with the tools, libraries,
and so on you need to interact with Google Cloud Platform

Module Review

Module review (1 of 2)

Compute nodes on GCP are:
(select the correct option)

❏ Allocated on demand, and you pay for the time that they are up.

❏ Expensive to create and teardown

❏ Pre-installed with all the software packages you might ever need.

❏ One of ~50 choices in terms of CPU and memory

Module review answers (1 of 2)

Compute nodes on GCP are:
(select the correct option)

➔ Allocated on demand, and you pay for the time that they are up.

❏ Expensive to create and teardown

❏ Pre-installed with all the software packages you might ever need.

❏ One of ~50 choices in terms of CPU and memory

Module review (2 of 2)

Google Cloud Storage is a good option for storing data that:
(select all of the correct options)

❏ Is ingested in real-time from sensors and other devices

❏ Will be frequently read/written from a compute node

❏ May be required to be read at some later time

❏ May be imported into a cluster for analysis

Module review (2 of 2)

Google Cloud Storage is a good option for storing data that:
(select all of the correct options)

❏ Is ingested in real-time from sensors and other devices

❏ Will be frequently read/written from a compute node

➔ May be required to be read at some later time

➔ May be imported into a cluster for analysis

Resources

Google Cloud Platform
Datacenters

Pricing

Cloud Launcher
Pricing Philosophy

Data Analysis

on the Cloud

Agenda

Stepping stones to transformation

Your SQL database in the cloud + Demo
Managed Hadoop in the cloud + Demo

Google Cloud Platform began in 2008, with App Engine,
a serverless way to run web applications

App Engine

2 Your code

Upload
1 3
Develop Autoscales Reliable

App Engine
App Engine
Flex

Container
Engine

Compute
Engine
There [was] something fundamentally
wrong with what we were doing in 2008
… We didn't get the right stepping
stones into the cloud …
-- Eric Schmidt, Executive Chairman, Google

GCP now consists of a suite of products that together provide these
stepping stones in a business' transformative journey

Flexibility, scalability
Change where you compute and reliability Change how you compute

Cost effective virtual machines, Reliable, autoscaling messaging, Fully managed products for data
storage, Hadoop, and MySQL to data processing, and storage. warehousing, data analysis,
migrate your current workloads to streaming, and machine learning.
the public cloud.

Machine learning. This is the next

transformation … the programming

paradigm is changing. Instead of

programming a computer, you teach a

computer to learn something and it

does what you want.

Eric Schmidt,
Executive Chairman,
Google

WIRED's headline

"If you want to teach a neural network to
recognize a cat, for instance, you don't
tell it to look for whiskers, ears, fur,
and eyes. You simply show it thousands
and thousands of photos of cats, and
eventually it works things out."

Machine Learning is not new,
but it is now mainstream

Search
People who bought ...
Spam filtering
Suggest next video
Route planning
Smart Reply

What's common to all of
? these use cases of Machine
Learning?

There are three components in a recommendation system

Rating Training Recommending

Users rate a few houses A machine learning model is For each user, the model is
explicitly or implicitly created to predict a user's applied to every unrated
rating of a house house and the top 5 houses
for that user are saved.

? What else is needed?

The ML algorithm essentially clusters users and items

1 Who is like this user? 2 Is this a good house?

How often do you need to compute
Predict rating
3 ? the predicted ratings?
Is this house similar to houses that
Where would you save them? people similar to this user like?

Predicted rating = user-preference *
item-quality

In addition to the ML algorithm, you also need
sophisticated data management

Data Collection Scalable front end to collect customer actions

Data Analysis Data that is accessible and not silo-ed

Machine Learning (Re-)training and experimentation

Serving Scalable, real-time system to serve
recommendations

Agenda

Stepping stones to transformation

Your SQL database in the cloud + Demo
Managed Hadoop in the cloud + Demo

Choose your storage solution based on your access pattern

Cloud
Storage Cloud SQL Datastore Bigtable BigQuery

Capacity Petabytes + Gigabytes Terabytes Petabytes Petabytes

Access Like files in a Relational Persistent Key-value(s), Relational
metaphor file system database Hashmap HBase API
Have to copy to Filter objects
Read SELECT rows scan rows SELECT rows
local disk on property
Write One file INSERT row put object put row Batch/stream

Update An object Field Attribute Row Field
granularity (a "file")
No-ops, high
No-ops SQL Structured Interactive SQL*
Usage Store blobs database on data from throughput, querying fully
the cloud AppEngine apps scalable, managed warehouse
flattened data

Cloud SQL is a fully managed database service

Flexible pricing

Familiar

Managed backups
Cloud SQL
Automatic replication
Google-managed
MySQL or Postgres
Fast connection from GCE & GAE

Connect from anywhere

Google Security

Demo:

Set up rentals data

in Cloud SQL

Demo: Setup rentals data in Cloud SQL
External
machine
In this demo, we populate rentals data in Cloud
SQL for the recommendation engine to use:

1. Create Cloud SQL instance
2. Create database tables by importing .sql
files from Cloud Storage
3. Populate the tables by importing .csv
Cloud Import
files from Cloud Storage
Storage
4. Allow access to Cloud SQL
5. Explore the rentals data using SQL
statements from Cloud Shell
Cloud SQL

Agenda

Stepping stones to transformation

Your SQL database in the cloud

Big Data & Machine Learning Cloud OnBoard 1 2 1 Dataproc reduces the cost and complexity associated with 3 Spark and Hadoop clusters 2 5 3 6 5 Image Versioning 7 6 8 Familiar 7 9 8 10 Dataproc Resize in seconds 9 Google-managed: 11 10 Hadoop Automated cluster mgmt 12 11 Pig 13 12 Hive Integrates with Google Cloud 14 13 Spark 15 14 Flexible VMs 16 15 1716 Google Security 17 18 Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo: 7 7 8 8 Recommendations ML 9 9 10 10 11 with Dataproc 11 12 12 13 13 14 14 15 15 16 16 17 31

Big Data & Machine Learning Cloud OnBoard 1 2 1 Demo: Recommendations ML with Cloud Dataproc 3 2 5 3 In this demo, we implement 6 5 machine learning recommendations 7 6 using Cloud Dataproc: 8 2 Train 7 model 1. Launch Dataproc 9 8 2. Train and apply ML model 10 9 written in PySpark to create 11 10 1 Dataproc product recommendations 12 11 3. Explore inserted rows in 13 12 Cloud SQL 14 13 3 Cloud SQL 15 14 Show 16 recommendations 15 1716 17 18 Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 7 7 8 8 Module Review 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 32

Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review (1 of 2) 3 2 5 3 Relational databases are a good choice when you need: 6 5 (select all of the correct options) 7 6 8 7 ❏ Streaming, high-throughput writes 9 8 ❏ Fast queries on terabytes of data 10 9 11 ❏ Aggregations on unstructured data 10 12 ❏ Transactional updates on relatively small datasets 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review (1 of 2) 3 2 5 3 Relational databases are a good choice when you need: 6 5 (select all of the correct options) 7 6 8 7 ❏ Streaming, high-throughput writes 9 8 ❏ Fast queries on terabytes of data 10 9 11 ❏ Aggregations on unstructured data 10 12 ✓ Transactional updates on relatively small datasets 11 13 12 14 13 15 14 16 15 1716 17 18 33

Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review (2 of 2) 3 2 5 3 Cloud SQL and Cloud Dataproc offer familiar tools (MySQL and 6 5 Hadoop/Pig/Hive/Spark). What is the value-add provided by Google Cloud Platform? 7 6 (select all of the correct options) 8 7 9 8 ❏ It’s the same API, but Google implements it better 10 9 ❏ Google-proprietary extensions and bug fixes to MySQL, Hadoop, and so on 11 10 ❏ Fully-managed versions of the software offer no-ops 12 11 13 ❏ Running it on Google infrastructure offers reliability and cost savings 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review (2 of 2) 3 2 5 3 Cloud SQL and Cloud Dataproc offer familiar tools (MySQL and 6 5 Hadoop/Pig/Hive/Spark). What is the value-add provided by Google Cloud Platform? 7 6 (select all of the correct options) 8 7 9 8 ❏ It’s the same API, but Google implements it better 10 9 ❏ Google-proprietary extensions and bug fixes to MySQL, Hadoop, and so on 11 10 ✓ Fully-managed versions of the software offer no-ops 12 11 13 ✓ Running it on Google infrastructure offers reliability and cost savings 12 14 13 15 14 16 15 1716 17 18 34

Big Data & Machine Learning 1 2 Resources 3 5 Cloud SQL 6 7 Cloud Dataproc 8 Cloud Solutions 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning 1 1 2 2 3 Cloud OnBoard 3 5 5 6 6 7 7 Scaling Data Analysis 8 8 9 9 10 10 11 11 12 12 Cloud OnBoard 13 13 14 14 15 15 16 Version #1.1 16 17 35

Big Data & Machine Learning 1 2 Agenda 3 5 Fast random access 6 7 Warehouse and interactively query petabytes 8 Interactive, iterative development + Demo 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Choosing where to store data on GCP 3 2 5 3 6 5 Ne unstructured structured e d 7 MOBILE 6 SDKs 8 7 Transactional Data analytics 9 8 Cloud workload workload 10 Firebase 9 Storage Storage No-SQL Millisecond 11 SQL Latency 10 12 Cloud 11 One Ne Bigtable e d Cloud dat ab ase 13 MOBILE 12 SQL en o ugh SDKs 14 13 Latency in 15 Cloud Horizontal seconds 14 Cloud Spanner scalability Firebase BigQuery 16 15 Realtime DB Datastore 1716 17 18 36

Big Data & Machine Learning Cloud OnBoard 1 2 1 Use cloud spanner if you need globally consistent data or more 3 2 than one Cloud SQL instance 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 Source: 16 15 1716 quizlet-cloud-spanner 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: technical details 3 2 5 3 6 Cloud Cloud Cloud 5 Bigtable Cloud SQL BigQuery 7 Datastore Storage Spanner 6 8 7 Type NoSQL NoSQL Blobstore Relational Relational Relational 9 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 8 10 9 Transactions Yes Single-row No Yes Yes No 11 10 12 Complex No No No Yes Yes Yes 11 queries 13 12 14 Capacity Terabytes+ Petabytes+ Petabytes+ 500 GB Petabytes Petabytes+ 13 15 14 Unit size 1 MB/entity ~10 MB/cell 5 TB/object Determined 10,240 MiB/ 10 MB/row 16 ~100 MB/row by DB engine row 15 1716 17 18 37

Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: use cases 3 2 5 3 Cloud Bigtable Cloud Cloud SQL Cloud BigQuery 6 Datastore Storage Spanner 5 7 6 Type NoSQL NoSQL Blobstore Relational Relational Relational 8 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 7 9 8 Best for Getting “Flat” data, Structured Web Large-scale Interactive 10 9 started, App Heavy and frameworks, database querying, 11 Engine read/write, unstructured existing applications offline 10 applications events, binary or applications (> ~2 TB) analytics 12 11 analytical object data 13 data 12 14 13 Use cases Getting AdTech, Images, User Whenever Data 15 14 started, App Financial large media credentials, high I/O, warehousing 16 Engine and IoT data files, customer global 15 applications backups orders consistency 1716 is needed 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Bigtable is meant for high throughput data where access is primarily 3 for a range of Row Key prefixes 2 5 3 6 5 7 6 Row Key Column data 8 7 9 NASDAQ#1426535612045 MD:SYMBOL: MD:LASTSALE: MD:LASTSIZE: MD:TRADETIME: MD:EXCHANGE: 8 ZXZZT 600.58 300 1426535612045 NASDAQ 10 9 11 ... ... ... ... ... ... 10 12 11 13 12 Tables should be tall and narrow 14 Store changes as new rows 13 15 14 Bigtable will automatically 16 15 compact the table 1716 17 18 38

Big Data & Machine Learning Cloud OnBoard 1 2 1 Short meaningful column names reduce storage and RPC overhead 3 2 5 3 Design row key with most 6 common query in mind 5 Column families is a quick 7 way to get some hierarchy 6 8 7 9 8 Row Key Column data 10 9 11 NASDAQ#1426535612045 MD:SYMBOL: MD:LASTSALE: MD:LASTSIZE: MD:TRADETIME: MD:EXCHANGE: 10 ZXZZT 600.58 300 1426535612045 NASDAQ 12 11 13 12 14 13 15 14 Design row key to minimize hotspots Use short column names 16 Designed for sparse tables 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Can work with Bigtable using the HBase API 3 2 import org.apache.hadoop.hbase.*; 5 import org.apache.hadoop.hbase.client.*; 3 import org.apache.hadoop.hbase.util.*; 6 5 7 byte[] CF = Bytes.toBytes("MD"); // column family 6 Connection connection = ConnectionFactory.createConnection(...) 8 Table table = null; 7 try { 9 table = connection.getTable(TABLE_NAME); 8 Put p = new Put(Bytes.toBytes("NASDAQ#GOOG #1234561234561")); 10 9 p.addColumn(CF, Bytes.toBytes("SYMBOL"), Bytes.toBytes("GOOG")); 11 p.addColumn(CF, Bytes.toBytes("LASTSALE"), Bytes.toBytes(742.03d)); 10 ... 12 table.put(p); 11 } finally { 13 if (table != null) table.close(); 12 } 14 13 15 14 16 15 1716 17 18 39

Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: technical details 3 2 5 3 6 Cloud Cloud Cloud 5 Bigtable Cloud SQL BigQuery 7 Datastore Storage Spanner 6 8 7 Type NoSQL NoSQL Blobstore Relational Relational Relational 9 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 8 10 9 Transactions Yes Single-row No Yes Yes No 11 10 12 Complex No No No Yes Yes Yes 11 queries 13 12 14 Capacity Terabytes+ Petabytes+ Petabytes+ 500 GB Petabytes Petabytes+ 13 15 14 Unit size 1 MB/entity ~10 MB/cell 5 TB/object Determined 10,240 MiB/ 10 MB/row 16 ~100 MB/row by DB engine row 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: use cases 3 2 5 3 Cloud Bigtable Cloud Cloud SQL Cloud BigQuery 6 Datastore Storage Spanner 5 7 6 Type NoSQL NoSQL Blobstore Relational Relational Relational 8 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 7 9 8 Best for Getting “Flat” data, Structured Web Large-scale Interactive 10 9 started, App Heavy and frameworks, database querying, 11 Engine read/write, unstructured existing applications offline 10 applications events, binary or applications (> ~2 TB) analytics 12 11 analytical object data 13 data 12 14 13 Use cases Getting AdTech, Images, User Whenever Data 15 14 started, App Financial large media credentials, high I/O, warehousing 16 Engine and IoT data files, customer global 15 applications backups orders consistency 1716 is needed 17 18 40

Big Data & Machine Learning 1 2 Agenda 3 5 Fast random access 6 7 Warehouse and interactively query petabytes 8 Interactive, iterative development + Demo 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 BigQuery is a fully managed data warehouse that lets you do ad-hoc 3 SQL queries on massive volumes of data 2 5 3 6 5 BigQuery Service 7 6 8 7 9 8 10 Project X Project Y 9 11 10 Dataset A Dataset B Dataset C Dataset D 12 11 13 Table 1 Table 1 Table 1 Table 1 12 14 13 15 14 Table 2 Table 2 Table 2 Table 2 16 15 1716 17 18 41

Big Data & Machine Learning Cloud OnBoard 1 2 1 A demo of BigQuery on a 10 billion-row dataset shows what it is 3 and what it can do 2 5 3 6 5 #standardsql 7 SELECT Familiar, SQL 2011 query 6 language, SUM(views) as views language 8 7 FROM `bigquery-samples.wikipedia_benchmark.Wiki10B` Interactive ad-hoc analysis 9 WHERE of petabyte-scale databases 8 title like "%google%" No need to provision 10 9 GROUP by language clusters 11 ORDER by views DESC 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Three ways of loading data into BigQuery 3 2 5 3 6 5 Files on disk or Cloud Stream Data Federated data source 7 Storage 6 8 7 9 8 10 9 CSV 11 JSON 10 AVRO 12 11 Google 13 Sheets 12 Serverless POST 14 13 ETL 15 14 16 15 1716 17 18 42

Big Data & Machine Learning Cloud OnBoard 1 2 1 With Federated data sources, you can directly query files on 3 Cloud Storage, without having to ingest them into BigQuery 2 5 3 6 5 7 6 8 7 Also: Google Drive, Bigtable 9 8 10 Also: JSON/Avro/Google Sheet 9 11 10 12 11 13 12 14 Can also pass in a schema 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: technical details 3 2 5 3 6 Cloud Cloud Cloud 5 Bigtable Cloud SQL BigQuery 7 Datastore Storage Spanner 6 8 7 Type NoSQL NoSQL Blobstore Relational Relational Relational 9 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 8 10 9 Transactions Yes Single-row No Yes Yes No 11 10 12 Complex No No No Yes Yes Yes 11 queries 13 12 14 Capacity Terabytes+ Petabytes+ Petabytes+ 500 GB Petabytes Petabytes+ 13 15 14 Unit size 1 MB/entity ~10 MB/cell 5 TB/object Determined 10,240 MiB/ 10 MB/row 16 ~100 MB/row by DB engine row 15 1716 17 18 43

Big Data & Machine Learning Cloud OnBoard 1 2 1 Comparing storage options: use cases 3 2 5 3 Cloud Bigtable Cloud Cloud SQL Cloud BigQuery 6 Datastore Storage Spanner 5 7 6 Type NoSQL NoSQL Blobstore Relational Relational Relational 8 document wide column SQL for OLTP SQL for OLTP SQL for OLAP 7 9 8 Best for Getting “Flat” data, Structured Web Large-scale Interactive 10 9 started, App Heavy and frameworks, database querying, 11 Engine read/write, unstructured existing applications offline 10 applications events, binary or applications (> ~2 TB) analytics 12 11 analytical object data 13 data 12 14 13 Use cases Getting AdTech, Images, User Whenever Data 15 14 started, App Financial large media credentials, high I/O, warehousing 16 Engine and IoT data files, customer global 15 applications backups orders consistency 1716 is needed 17 18 Big Data & Machine Learning 1 2 Agenda 3 5 Fast random access 6 7 Warehouse and interactively query petabytes 8 Interactive, iterative development + Demo 9 10 11 12 13 14 15 16 17 44

Big Data & Machine Learning Cloud OnBoard 1 2 1 Increasingly, data analysis and machine learning are carried 3 out in self-descriptive, shareable, executable notebooks 2 5 3 6 5 Share 7 6 Code 8 7 9 8 10 9 A typical notebook 11 contains code, 10 Output 12 charts, and 11 explanations 13 12 14 13 15 14 16 Image Source: 15 Markup Git Logo from 1716 Wikipedia 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Datalab is an open-source notebook built on Jupyter (IPython) 3 2 5 3 6 5 Analyze data in BigQuery, Datalab is free—just pay 7 for Google Cloud resources 6 Compute Engine or Cloud Storage 8 7 9 8 10 9 11 10 12 11 13 12 14 Use existing 13 Python packages 15 14 16 15 1716 17 18 45

Big Data & Machine Learning Cloud OnBoard 1 2 1 Datalab notebooks are developed in an iterative, collaborative process 3 2 5 3 PHASE 5 PHASE 1 6 5 Share and Write code in 2 7 collaborate Python 5 5 6 8 7 1 9 8 Development 10 9 Process in 3 11 Cloud Datalab 10 PHASE 4 PHASE 2 12 11 Write Run cell 13 commentary in (Shift+Enter) 12 markdown 4 14 13 15 14 PHASE 3 16 15 Examine Output 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Datalab supports BigQuery 3 2 5 3 %%sql 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 To Pandas 12 14 13 15 14 16 15 1716 17 18 46

Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo: 7 7 8 8 Create ML dataset 9 9 10 10 11 with BigQuery 11 12 12 13 13 14 14 15 15 16 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Demo: Create ML dataset with 3 2 BigQuery 5 3 6 5 In this demo, we use BigQuery to create a 7 6 dataset that we later use to build a taxi 8 demand forecast system using Machine Learning. 7 9 8 ● What kinds of things affect taxi demand? 10 9 ● What are some ways to measure “demand”? 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 47

Big Data & Machine Learning Cloud OnBoard 1 2 1 Demo: Create ML dataset with BigQuery 3 2 5 3 In this demo, we use BigQuery to create a dataset that we later 6 5 use to build a taxi demand forecast system using Machine Learning. 7 6 8 7 1. Use BigQuery and Datalab to explore and visualize data 9 8 2. Build a Pandas dataframe that will be used as the training 10 9 dataset for machine learning using TensorFlow 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 7 7 8 8 Module Review 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 48

Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the product on the right 6 5 7 6 8 Global consistency needed 1. Datalab 7 9 8 High-throughput writes of wide-column data 2. BigTable 10 9 11 10 Warehousing structured data 3. BigQuery 12 11 13 Develop Big Data algorithms interactively in Python 4. Spanner 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the product on the right 6 5 7 6 8 Global consistency needed (4) 1. Datalab 7 9 8 High-throughput writes of wide-column data (2) 2. BigTable 10 9 11 10 Warehousing structured data (3) 3. BigQuery 12 11 13 Develop Big Data algorithms interactively in Python (1) 4. Spanner 12 14 13 15 14 16 15 1716 17 18 49

Big Data & Machine Learning 1 1 2 2 3 Cloud OnBoard 3 5 5 6 6 7 7 Machine Learning 8 8 9 9 10 10 11 11 12 12 Cloud OnBoard 13 13 14 14 15 15 16 Version #1.1 16 17 Big Data & Machine Learning 1 2 Agenda 3 5 Machine learning with TensorFlow + Demo 6 7 Pre-built machine learning models + Demo 8 9 10 11 12 13 14 15 16 17 50

Big Data & Machine Learning Cloud OnBoard 1 2 1 TensorFlow is an open source library that underlies many Google products 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Demo: Playing with neural networks to learn what they are 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 51

Big Data & Machine Learning Cloud OnBoard 1 2 1 Supervised machine learning requires features and labels 3 2 5 Neural Network 3 6 5 7 6 8 7 9 8 Input … 10 features Prediction 9 11 … 10 12 … 11 13 12 14 13 15 14 Cost 16 15 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Machine Learning with TensorFlow involves four steps: 3 2 5 1 3 Gather Gather training data (input features and labels) 6 Data 5 7 6 8 7 2 Create model 9 Create 8 10 9 11 10 3 12 Train Train the model based on input data 11 13 12 14 13 4 Use the model on new data 15 14 Use 16 15 1716 17 18 52

Big Data & Machine Learning Cloud OnBoard 1 2 1 Gather training data and select input features 3 2 Input features 5 3 6 5 1 7 6 Gather 8 Data 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 discard target 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 All input features need to be numeric 3 2 5 3 Use as-is One-hot encoding 6 5 1 7 6 Gather 8 Data 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 53

Big Data & Machine Learning Cloud OnBoard 1 2 1 Create a neural network model, defining the number of feature columns 3 and hidden units 2 5 3 6 5 nhidden 7 2 6 Create 8 7 9 8 10 9 noutputs 11 npredictors 10 12 … 11 13 … 12 14 13 15 14 estimator = DNNRegressor(hidden_units=[5], feature_columns=[...]) 16 15 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Train the model on the collected data 3 2 5 3 6 model 5 7 3 Predicted 6 Train npredictors value of 8 taxicab 7 … … demand 9 8 10 9 Update Cost 11 model based 10 on Cost 12 11 13 12 True value of 14 taxicab 13 demand 15 14 16 15, targets, steps=1000) 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 54

Big Data & Machine Learning Cloud OnBoard 1 2 1 Train the model on the collected data 3 2 5 3 6 model 5 7 4 6 Use 8 7 9 8 rain Predicted value 10 9 of taxicab 11 Max temp demand 10 … … 12 11 13 12 Cost 14 13 15 14 Update model based on True value of 16 15 Cost taxicab demand 1716 Neural network imageby Dake, Mysid [CC BY 1.0], via Wikimedia Commons 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Train the model on the collected data 3 2 5 3 6 5 input = pd.DataFrame.from_dict(data = 7 4 6 Use {'dayofweek' : [4, 5, 6], 8 'mintemp' : [60, 15, 60], 7 'maxtemp' : [80, 80, 65], 9 8 'rain' : [0, 0.8, 0]}) 10 9 11 10 12 # read trained model from /tmp/trained_model 11 estimator = DNNRegressor(model_dir='/tmp/trained_model', 13 12 hidden_units=[5]) 14 13 15 14 pred = estimator.predict(input.values) 16 print pred 15 1716 17 18 55

Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo 2 Part 2: 7 7 8 8 Carry out ML 9 9 10 10 11 with TensorFlow 11 12 12 13 13 14 14 15 15 16 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Demo 2, Part 2: Carry out ML with TensorFlow 3 2 5 3 In this demo, we build a neural network to predict taxicab demand 6 5 on a day-by-day basis using TensorFlow. 7 6 8 7 9 8 10 9 11 Inputs Prediction 10 12 Neural Network 11 13 12 14 13 15 14 16 15 1716 17 18 56

Big Data & Machine Learning 1 2 Agenda 3 5 Machine learning with TensorFlow + Demo 6 7 Pre-built machine learning models + Demo 8 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 The accuracy of a ML problem is driven largely by the size and quality 3 of the dataset; this is why ML requires massive compute 2 5 3 6 Scale of Compute Problem 5 7 6 8 7 9 Accuracy 8 10 9 11 10 12 11 13 12 14 13 15 14 16 Size of dataset 15 1716 17 18 57

Big Data & Machine Learning Cloud OnBoard 1 2 1 CloudML Engine simplifies the use of Distributed TensorFlow 3 2 5 3 ... 6 5 7 6 ... 8 7 9 . 8 . . 10 . 9 Size of . 11 . 10 dataset 12 11 ... 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 ML APIs are pre-trained ML models (trained off Google’s data) for common 3 tasks; they are accessible through REST APIs 2 5 3 6 Use your own data to train models Machine Learning as an API 5 7 6 8 7 9 8 10 9 Cloud Cloud 11 Vision API Speech API 10 12 TensorFlow Cloud Machine 11 Learning Engine 13 12 14 13 15 14 16 15 Cloud Cloud Cloud Video Natural Language Translation API Intelligence 1716 API 17 18 58

117 Big Data & Machine Learning 1 2 3 5 6 7 8 9 10 11 Logo Detection 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Face detection 3 2 "faceAnnotations" : [ 5 { 3 6 "headwearLikelihood" : "VERY_UNLIKELY", 5 "surpriseLikelihood" : "VERY_UNLIKELY", 7 rollAngle" : -4.6490049, 6 "angerLikelihood" : "VERY_UNLIKELY", 8 "landmarks" : [ 7 { 9 "type" : "LEFT_EYE", 8 10 "position" : { 9 "x" : 691.97974, 11 "y" : 373.11096, 10 "z" : 0.000037421443 12 } 11 }, 13 12 ... "detectionConfidence" : 0.93568963, ], 14 13 "boundingPoly" : { "joyLikelihood" : "VERY_LIKELY", "vertices" : [ "panAngle" : 4.150538, 15 14 { "sorrowLikelihood" : "VERY_UNLIKELY", 16 "x" : 743, "tiltAngle" : -19.377356, 15 "y" : 449 "underExposedLikelihood" : "VERY_UNLIKELY", 1716 }, "blurredLikelihood" : "VERY_UNLIKELY" ... 17 18 59

Big Data & Machine Learning Cloud OnBoard 1 2 1 Web annotations 3 2 5 { 3 "entityId": "/m/0gff2yr", 6 "score": 5.92256, 5 "description": "ArtScience Museum" 7 } 6 8 { 7 { "entityId": "/m/0h898pd", 9 "entityId": "/m/016ms7", "score": 7.4162, 8 "score": 1.44038, "description": "Harry Potter (Literary Series)" "description": "Ford Anglia" } 10 9 } 11 10 12 11 13 12 14 13 15 14 16 15 1716 CC-BY 2.0 Rev Stan: 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Try it in the browser with your own images 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 60

Big Data & Machine Learning Cloud OnBoard 1 2 1 The Translation API supports 100+ languages 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Wootric uses the Cloud Natural Language API (entity and sentiment) to 3 make sense of qualitative customer feedback 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 61

Big Data & Machine Learning Cloud OnBoard 1 2 1 Extracted entities are tied into a knowledge graph 3 2 5 { 3 "name": "Joanne 'Jo' Rowling", 6 "type": "PERSON", 5 "metadata": { 7 "mid": "/m/042xh", 6 "wikipedia_url": "" 8 } 7 9 8 10 9 Joanne "Jo" Rowling, pen names J. K. Rowling and Robert Galbraith, 11 is a British novelist, screenwriter and film producer best known as 10 12 the author of the Harry Potter fantasy series 11 13 12 { { 14 13 "name": "British", "name": "Harry Potter", "type": "LOCATION", "type": "PERSON", 15 "metadata": { 14 "metadata": { "mid": "/m/07ssc", "mid": "/m/078ffw", 16 "wikipedia_url": 15 "wikipedia_url": "" "" } } 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 When you analyze sentiment, you get a score (positive/negative) as well 3 as a magnitude (how intense?) 2 5 3 6 5 7 The food was excellent, I would definitely go back! 6 8 7 { 9 8 "documentSentiment": { 10 9 "score": 0.8, 11 "magnitude": 0.8 10 12 } 11 } 13 12 14 13 15 14 16 15 1716 17 18 62

Big Data & Machine Learning Cloud OnBoard 1 2 1 The Cloud Speech API can be used to transcribe audio to text 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Like the Vision API, the Video Intelligence API can identify labels in a 3 video, along with a timestamp 2 5 3 { 6 5 "description": "Bird's-eye view", 7 6 "language_code": "en-us", 8 7 "locations": { 9 8 "segment": { 10 9 "start_time_offset": 71905212, 11 10 "end_time_offset": 73740392 12 11 }, 13 12 "confidence": 0.96653205 14 13 } 15 14 } 16 15 1716 17 18 63

Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo 2 Part 3: 7 7 8 8 Machine Learning APIs 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 3 2 5 3 Demo 2, Part 3: Machine 6 5 Learning APIs 7 6 8 7 Use several of the Machine Learning 9 8 APIs (Vision, Translate, Natural 10 9 Language Processing, Speech) from 11 Python 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 64

Big Data & Machine Learning 1 “How much is this car worth?” 2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning 1 “Thanks to the Google Cloud Platform, Ocado was able to use 2 the power of cloud computing and train our models in parallel.” 3 5 6 7 “Hi Ocado, 8 I love your website. I have children so it’s 9 easier for me to do the shopping online. 10 Many thanks for saving my time! 11 Regards” 12 Improves natural 13 language processing 14 of customer service Feedback Customer is happy 15 claims 16 17 65

Big Data & Machine Learning 1 2 3 5 6 50% 7 8 of enterprises will be 9 spending more per annum 10 on bots and chatbot creation than traditional 11 mobile app development by 12 2021 – Gartner 13 14 15 16 17 Big Data & Machine Learning 1 2 Custom image Build off NLP Use Vision Use 3 model to API to route API as-is to Dialogflow to 5 price cars customer find text in create a new emails memes shopping 6 experience 7 8 9 10 11 12 13 14 15 16 17 66

Big Data & Machine Learning Introducing Cloud AutoML A technology that can automatically create a Machine Learning Model 1 2 3 5 6 7 8 9 DADATATA ML MODEL TUNE ML MODEL 10 ML MODEL DESIGN TUNE ML MODEL EVEVALUATEALUATE DEPLOY UPUPDATEDATE PREPROCESSINGPREPROCESSING DESIGN PARAMETERS DEPLOY PARAMETERS 11 12 13 14 15 16 Confidential & Proprietary 17 Big Data & Machine Learning 1 Cloud AutoML Vision 2 3 Train your model Evaluate Upload and label in minutes or one day 5 images 6 7 8 9 10 11 Cloud AutoML 12 13 Handbag Shoe Hat 14 15 Model is now trained and ready to make prediction. 16 This model can scale as needed to adapt to customer demands. 17 67

Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo: 7 7 8 8 Module Review 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the 6 5 7 product on the right 6 8 7 9 8 1. Vision API 10 Create, test new machine learning methods 9 2. TensorFlow 11 No-ops, custom machine learning applications at scale 10 12 Automatically reject inappropriate image content 3. Speech API 11 Build application to monitor Spanish twitter feed 13 12 Transcribe customer support calls 4. Cloud ML 14 13 5. Translation API 15 14 16 15 1716 17 18 68

Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the 6 5 7 product on the right 6 8 7 9 8 1. Vision API 10 Create, test new machine learning methods (2) 9 2. TensorFlow 11 No-ops, custom machine learning applications at scale (4) 10 12 Automatically reject inappropriate image content (1) 3. Speech API 11 Build application to monitor Spanish twitter feed (5) 13 12 Transcribe customer support calls (3) 4. Cloud ML 14 13 5. Translation API 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Resources (1 of 2) 3 2 5 3 6 5 7 Cloud Spanner 6 8 7 Cloud Bigtable 9 8 10 9 Google BigQuery 11 10 12 11 Cloud Datalab 13 12 14 TensorFlow 13 15 14 16 15 1716 17 18 69

Big Data & Machine Learning Cloud OnBoard 1 2 1 Resources (2 of 2) 3 2 5 3 6 5 Cloud Machine Learning 7 6 8 7 Vision API 9 8 10 9 Translation API 11 10 12 Speech API 11 13 12 14 13 Video Intelligence API intelligence 15 14 16 15 1716 17 18 Big Data & Machine Learning 1 1 2 2 3 Cloud OnBoard 3 5 5 6 6 7 7 8 Data Processing Architecture 8 9 9 10 10 11 11 12 12 Cloud OnBoard 13 13 14 14 15 15 16 16 17 70

Big Data & Machine Learning 1 2 Agenda 3 5 Message-oriented architectures 6 7 Serverless data pipelines 8 GCP Reference Architecture 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Asynchronous processing is useful for 3 P1 P2 P3 Producers 2 long-lived tasks or to have loose 5 3 coupling between two systems 6 5 7 6 8 7 9 Potential use cases: 8 Message 10 9 Queue 1. Send an SMS 11 2. Train ML model 10 12 3. Process data from multiple sources 11 4. Weekly reports … 13 12 14 13 15 14 16 15 C1 C2 C3 Consumers 1716 17 18 71

Big Data & Machine Learning Cloud OnBoard 1 2 For robust asynchronous processing, you need: 1 3 2 P1 P2 P3 5 3 6 5 1. A global, highly available queue 7 6 8 7 3. Queue 9 8 must be 10 9 interoperable 2. Scale without over-provisioning 11 10 12 11 13 12 14 13 15 14 4. Reliable delivery of messages 16 15 1716 C1 C2 C3 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Pub/Sub provides a no-ops, serverless global message queue 3 2 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 72

Big Data & Machine Learning 1 2 Agenda 3 5 Message-oriented architectures 6 7 Serverless data pipelines 8 GCP Reference Architecture 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 Dataflow offers NoOps data pipelines in Java and Python 1 3 Open-source API (Apache 2 p = beam.Pipeline(options=options) Beam) can be executed on 5 Input 3 Flink, Spark, etc. also 6 5 Read lines = p |‘gs://…’) 7 6 8 Transform 1 traffic = lines | beam.Map(parse_data).with_output_types(unicode) 7 9 8 Transform 2 | beam.Map(get_speedbysensor) # (sensor, speed) Map 10 9 11 | beam.GroupByKey() # (sensor, [speed]) Group-By 10 Group 12 11 Transform 3 | beam.Map(avg_speed) # (sensor, avgspeed) Reduce 13 12 14 | beam.Map(lambda tup: '%s: %d' % tup)) 13 Transform 4 15 14 Write output = traffic |‘gs://...]’) Each of these steps is run 16 15 in parallel and autoscaled 1716 Output; by execution framework 17 18 73

Big Data & Machine Learning Cloud OnBoard 1 2 1 Same code does real-time and batch 3 2 5 options = PipelineOptions(pipeline_args) 3 6 options.view_as(StandardOptions).streaming = True 5 7 p = beam.Pipeline(options=options) 6 lines = p | 8 BigQuery 7 traffic = (lines 9 | 8 10 Cloud beam.Map(parse_data).with_output_types(unicode) 9 Pub/Sub | beam.Map(get_speedbysensor) # (sensor, 11 10 Cloud speed) 12 Cloud | beam.WindowInto(window.FixedWindows(15, 0)) 11 Dataflow Pub/Sub | beam.GroupByKey() # (sensor, [speed]) 13 12 | beam.Map(avg_speed) # (sensor, avgspeed) 14 | beam.Map(lambda tup: '%s: %d' % tup)) 13 Cloud 15 Storage traffic | 14 Cloud 16 15 Storage 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Dataflow does ingest, transform, and load; consider using it 3 2 instead of Spark 5 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 74

Big Data & Machine Learning 1 2 Agenda 3 5 Message-oriented architectures 6 7 Serverless data pipelines 8 GCP Reference Architecture 9 10 11 12 13 14 15 16 17 Big Data & Machine Learning Cloud OnBoard 1 Choosing where to store data on GCP 2 1 3 2 5 3 unstructured structured 6 5 7 6 8 7 Transactional Data analytics 9 workload workload 8 10 Cloud Millisecond 9 SQL No-SQL Storage Latency 11 10 12 11 One Cloud 13 database 12 Cloud enough Bigtable 14 13 SQL Latency in 15 seconds 14 Horizontal 16 scalability 15 Cloud 1716 Spanner Cloud BigQuery 17 Datastore 18 75

Big Data & Machine Learning Cloud OnBoard 1 2 1 Run Spark/Hadoop jobs on Cloud Dataproc 3 2 5 3 Input and Output 6 Data Sources 5 Direct 7 access 6 8 Cloud 7 Storage 9 API Cloud Input and 8 Client output 10 Dataproc connectors Cloud 9 Bigtable 11 10 12 BigQuery 11 Applications on 13 cluster 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 On GCP, you can have the same data processing pipeline for 3 2 processing both batch and stream 5 3 6 Events, 5 metrics, Cloud and so on Cloud Stream Datalab 7 BigQuery 6 Pub/Sub 8 7 Cloud ML 9 Raw logs, Cloud Data Studio 8 files, assets, Dataflow Engine Dashboards/BI 10 Google 9 Analytics data, and so on Cloud 11 Bigtable 10 Storage Batch 12 11 Co-workers 13 12 14 13 B C A 15 14 Applications 16 and Reports 15 1716 17 18 76

Big Data & Machine Learning 1 1 2 2 3 3 5 5 6 6 Demo: 7 7 8 8 Module Review 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the product on the right 6 5 7 6 8 A. Decoupling producers and consumers of data 1. Cloud Dataflow 7 9 in large organizations and complex systems 8 10 9 B. Scalable, fault-tolerant multi-step 11 10 processing of data 2. Cloud Pub/Sub 12 11 13 12 14 13 15 14 16 15 1716 17 18 77

Big Data & Machine Learning Cloud OnBoard 1 2 1 Module review 3 2 5 3 Match the use case on the left with the product on the right 6 5 7 6 8 A. Decoupling producers and consumers of data 1. Cloud Dataflow 7 9 in large organizations and complex systems 8 10 9 B. Scalable, fault-tolerant multi-step 11 10 processing of data 2. Cloud Pub/Sub 12 11 13 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Resources (1 of 2) 3 2 5 3 6 5 Cloud Pub/Sub 7 6 8 7 Cloud Dataflow 9 8 10 9 Processing media using 11 10 Cloud Pub/Sub and 12 dia-processing-pub-sub-compute-engine 11 Compute Engine 13 12 14 13 15 14 16 15 1716 17 18 78

Big Data & Machine Learning Cloud OnBoard 1 2 1 Resources (2 of 2) 3 2 5 3 6 5 Reverse Geocoding of 7 6 8 Geolocation Telemetry 7 geocoding-geolocation-telemetry-cloud-maps- 9 in the Cloud Using the 8 ap 10 9 Maps API 11 10 12 Using Cloud Pub/Sub for 11 13 Long-running Tasks ing-cloud-pub-sub-long-running-tasks 12 14 13 15 14 16 15 1716 17 18 Big Data & Machine Learning 1 1 2 2 3 Cloud OnBoard 3 5 5 6 6 7 7 Summary 8 8 9 9 10 10 11 11 12 12 Cloud OnBoard 13 13 14 14 15 15 16 Version #1.1 16 17 79

Big Data & Machine Learning Cloud OnBoard 1 2 1 An Evolving Cloud 3 2 5 3 6 5 1st Wave 7 6 Your kit, someone 8 else’s building. 7 Yours to manage. 9 8 2nd Wave 10 9 Standard virtual 11 kit,for rent. 10 Still yours to manage. 12 11 3rd Wave 13 12 14 Invest your energy 13 in great apps 15 14 16 15 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Google Cloud provides a way to take advantage of Google’s 3 investments in infrastructure and data processing innovation 2 5 3 6 5 7 6 8 7 9 Cloud DataStore Pub/Sub Cloud 8 Storage Spanner 10 9 11 10 12 11 Cloud DataProc Bigtable BigQuery DataFlow DataFlow ML Engine Auto ML 13 Storage 12 14 13 15 14 16 2002 2004 2006 2008 2010 2012 2014 2016 2018 15 1716 17 18 80

Big Data & Machine Learning Cloud OnBoard 1 2 1 Typical Big Data Processing 3 2 5 Monitoring Programming 3 6 5 7 6 8 Performance Resource 7 tuning provisioning 9 8 10 9 11 10 12 11 13 Utilization Handling 12 improvements growing scale 14 13 15 14 Deployment & Reliability 16 15 configuration 1716 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 1 Big Data with Google: Focus on insight, not infrastructure. 3 2 5 Programming 3 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 1716 17 18 81

Big Data & Machine Learning Cloud OnBoard 1 2 1 In summary, GCP offers you ways to... 3 2 5 3 6 5 7 6 8 7 9 8 10 Spend less on ops Incorporate real-time Apply machine Create citizen 9 and administration data into apps and learning broadly data scientists 11 architectures and easily 10 12 We make it simple and Transform your 11 We’ve “automated To get the most out” the complexity out of data and practical to organization into 13 12 of building and secure competitive incorporate machine a truly data driven 14 maintaining data advantage. learning models company. Putting 13 within custom tools into hands of and analytics applications. domain experts. 15 14 systems. 16 15 16 17 17 18 Big Data & Machine Learning Cloud OnBoard 1 2 Next Steps on your Google Cloud learning journey 1 3 2 5 3 1 2 3 6 5 7 Today Tomorrow Future 6 Google Cloud Platform Complete hands-on labs: Find more training online 8 7 Fundamentals: Big Data Baseline: Data, ML, AI quest 9 and Machine Learning 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 16 17 17 18 82

Big Data & Machine Learning Cloud OnBoard 1 Complete 10 hands-on labs free on Qwiklabs 2 1 by 30 April, and receive $200 in GCP credits 3 2 5 [Only for Cloud OnBoard Attendees] 3 6 5 7 1 Receive a follow up email after event 6 8 7 2 Create Qwiklabs account with the email 9 you used to register for Cloud OnBoard 8 10 Open your email and confirm account 9 3 11 Username 10 12 4 Return to Qwiklabs and log in 11 13 Password 12 5 Enroll in the Baseline: Data, ML, AI quest and 14 take your first lab! 13 15 14 6 Complete all 10 labs and we will send you an 16 email after 30 April with instructions to redeem 15 the $200 credits. Make sure you opt-in to receive 16 17 emails from Qwiklabs. 17 18 Big Data & Machine Learning Cloud OnBoard 1 To help you get started 2 1 3 Activate your voucher now for a free course worth $99! 2 5 3 Go to 6 1 5 7 6 8 7 9 Activate voucher and sign 8 2 up for a free account 10 9 11 10 12 11 Enroll in Serverless Data 13 Analysis with Google BigQuery 12 3 and Cloud Dataflow for Free 14 13 -Limited period offer! 15 14 Explore other Courses at 16 15 16 17 17 18 83

Big Data & Machine Learning Cloud OnBoard 1 2 1 Make Google Cloud certification your goal! 3 2 5 3 6 Find study guides, tips, practice 5 exams, and testing sites 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 16 17 17 18 $3,000 Google Cloud Startup Program in credits Google Cloud is a perfect fit for launching and scaling your early-stage startup. What’s an eligible A special offer for Cloud Onboard Singapore attendees: startup? Visit before May 18th to enroll, and eligible • Raised no more than a Series A startups* receive $3,000 in Google Cloud and Firebase credits. • Less than 5 years old • Are located in our approved countries • Have not participated in [email protected] the Google Cloud Startup program before Confidential & Proprietary 84

Big Data & Machine Learning 1 Be part of the 2 3 GCP User Group SG Community! 5 6 7 Network, share, learn - 8 Connect all about Google Cloud 9 10 11 Learn Learn from leads, users, and tech experts 12 13 14 Gain access to the or 15 Access Google Cloud team and the latest 16 capabilities 17 Big Data & Machine Learning Cloud OnBoard 1 2 1 Resources 3 2 5 3 6 5 7 6 8 Big data and machine learning blog 7 9 8 Google Cloud Platform blog 10 9 11 10 Google Cloud Platform curated articles 12 11 13 12 14 13 15 14 16 15 16 17 17 18 85

