Skip to content
5 min read·Lesson 9 of 10

Operations Suite: Monitoring and Observability

Learn how Google Cloud's operations suite — Cloud Monitoring, Cloud Logging, Cloud Trace, Error Reporting, and Cloud Profiler — provides full-stack observability.

The Google Cloud Operations Suite (formerly Stackdriver) provides integrated monitoring, logging, tracing, and diagnostics for applications running on GCP, other clouds, or on-premises. A healthy production environment requires visibility across all four signals: metrics, logs, traces, and errors.

Cloud Monitoring

Cloud Monitoring collects and visualises metrics from GCP resources, custom applications, and third-party systems.

Key capabilities:

  • Metrics Explorer: Query and chart any metric in real time
  • Dashboards: Custom dashboards combining metrics from multiple resources
  • Uptime checks: Synthetic monitoring — verify that URLs, TCP ports, or services respond correctly from multiple global locations
  • Alerting policies: Define conditions (threshold, rate of change, absence) and notification channels (email, PagerDuty, Slack, Pub/Sub)
  • Service Monitoring: SLO/SLI tracking for GKE, Cloud Run, and App Engine services

Cloud Logging

Cloud Logging ingests, stores, and analyses log entries from GCP services and custom applications. Logs are automatically collected from most GCP services — no agent needed for managed services.

Key capabilities:

  • Log Explorer: Query logs using Logging Query Language (LQL) — filter by resource, severity, time range, and custom fields
  • Log-based metrics: Create metrics from log patterns (e.g., count of 500 errors)
  • Log sinks: Export logs to Cloud Storage (archival), BigQuery (analysis), or Pub/Sub (real-time processing)
  • Log buckets: Control retention (default 30 days for most logs, configurable)

Cloud Trace

Cloud Trace is a distributed tracing system that tracks request latency across microservices. When a user request flows through multiple services, Trace shows the full call chain and identifies slow segments.

  • Automatically integrated with App Engine, Cloud Run, and Cloud Functions
  • Instrumented with OpenTelemetry or the Trace client libraries for custom services

Error Reporting

Error Reporting automatically groups and aggregates application exceptions from logs, counts occurrences, and alerts when new errors appear. It supports Node.js, Python, Go, Java, Ruby, .NET, and PHP.

Cloud Profiler

Cloud Profiler continuously analyses CPU and memory usage of your application in production with minimal overhead. It identifies which functions consume the most resources, enabling targeted optimisation.

Cloud Audit Logs

Every admin action in GCP generates an audit log entry. Types:

Log TypeWhat It CapturesEnabled By Default
Admin ActivityAPI calls that modify configuration (create, delete, modify)Yes
Data AccessAPI calls that read or write data (e.g., reading a Cloud SQL row)No (except BigQuery)
System EventAutomated GCP system actions (e.g., live migration)Yes
Policy DeniedRequests denied by a VPC firewall or org policyYes

Audit logs are essential for security investigation, compliance (PCI DSS, HIPAA), and forensics. Admin Activity logs cannot be disabled.

Key Takeaways

  • Cloud Monitoring collects metrics and hosts dashboards, uptime checks, and alerting policies.
  • Cloud Logging ingests logs from GCP services, Kubernetes, VMs, and custom applications.
  • Cloud Trace provides distributed tracing to identify latency bottlenecks across microservices.
  • Error Reporting automatically groups and surfaces application exceptions.
  • Cloud Audit Logs provide an immutable record of all admin and data access activity.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →