From 78e6d3eda7dcfd8047417a592dd658a1a65108a9 Mon Sep 17 00:00:00 2001
From: giraffekey <giraffekey@tutanota.com>
Date: Mon, 19 Aug 2024 07:25:28 -0700
Subject: [PATCH 1/2] Add detailed guide for telemetry

---
 docs/telemetry.md | 196 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 163 insertions(+), 33 deletions(-)

diff --git a/docs/telemetry.md b/docs/telemetry.md
index 2c4cd0fc9..6620e6b76 100644
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -1,17 +1,17 @@
 ---
-title: GraphQL Telemetry
+title: Using the @telemetry Directive
 description: "Learn how to configure observability support using OpenTelemetry for insights into logs, metrics, and traces. Discover practical integration examples for platforms like Honeycomb.io, New Relic, and Datadog."
 slug: graphql-telemetry-guide
 sidebar_label: Telemetry
 ---
 
-This guide will walk you through observability support in Tailcall i.e. how to collect and analyze telemetry data with different observability backends. In this guide you'll learn:
+Observability is a critical aspect of maintaining and optimizing modern applications. In this guide, we will explore how to enable and configure observability in Tailcall, focusing on the collection and analysis of telemetry data using different observability backends. By the end of this guide, you will learn how to:
 
-- How to enable generation of telemetry data in Tailcall?
-- How to update config to forward telemetry data to your chosen observability platforms?
-- See some examples of integration with existing observability tools?
+- Enable telemetry data generation in Tailcall.
+- Configure Tailcall to forward telemetry data to various observability platforms.
+- Integrate Tailcall with popular observability tools using real-world examples.
 
-Let's get started!
+Let’s get started!
 
 ## What is Observability
 
@@ -21,17 +21,19 @@ Observability is essential for maintaining the health and performance of your ap
 - **Metrics** are numerical data that measure different aspects of your system's performance, such as request rates or memory usage.
 - **Traces** show the journey of requests through your system, highlighting how different parts of your application interact and perform.
 
-Tailcall provides observability support by integrating OpenTelemetry specification into it with help of provided SDKs and data formats.
+### OpenTelemetry
 
-[OpenTelemetry](https://opentelemetry.io) is a toolkit for collecting telemetry data in a consistent manner across different languages and platforms. It frees you from being locked into a single observability platform, allowing you to send your data to different tools for analysis, such as New Relic or Honeycomb.
+Tailcall integrates with the [OpenTelemetry](https://opentelemetry.io) specification, a standardized toolkit for collecting telemetry data across different platforms and languages. OpenTelemetry allows you to avoid being locked into a single observability platform, enabling you to send your data to various tools for analysis, such as New Relic or Honeycomb.
+
+**Screenshot Suggestion**: Include a diagram showing the flow of telemetry data from Tailcall to different observability platforms using OpenTelemetry.
 
 ## Comparison with Apollo Studio
 
-While [Apollo studio](./apollo-studio.md) telemetry also provides analytics tools for your schema but when choosing between it and OpenTelemetry integration consider next points:
+While [Apollo Studio](./apollo-studio.md) offers telemetry and analytics tools specifically for GraphQL schemas, there are key differences when compared to OpenTelemetry integration in Tailcall:
 
-- OpenTelemetry is more generalized observability framework that could be used for cross-service analytics while Apollo Studio can provide insights related purely to graphQL
-- OpenTelemetry is vendor-agnostic and therefore you could actually use different observability platforms depending on your needs and don't rely on single tool like Apollo Studio
-- OpenTelemetry integration in Tailcall can provide more analytical data that is out of scope of graphQL analytics provided by Apollo Studio
+- **Generalized vs. Specific Insights**: OpenTelemetry provides generalized observability capabilities that can be applied across multiple services, whereas Apollo Studio focuses solely on GraphQL insights.
+- **Vendor-Agnostic Flexibility**: OpenTelemetry is vendor-agnostic, allowing you to choose different observability platforms as needed, unlike Apollo Studio, which is tied to its own ecosystem.
+- **Broader Analytical Scope**: Tailcall’s integration with OpenTelemetry allows for more comprehensive analytical data, extending beyond the scope of GraphQL-specific metrics provided by Apollo Studio.
 
 ## Prerequisites
 
@@ -73,11 +75,11 @@ We will update that config with telemetry integration in following sections.
 
 ## GraphQL Configuration for Telemetry
 
-By default, telemetry data is not generated by Tailcall since it requires some setup to know where to send this data and also that affects performance of server that could be undesirable in some cases.
+By default, telemetry data is not generated by Tailcall since it requires setup to know where to send the data. It also affects performance of the server that could be undesirable in some cases.
 
-Telemetry configuration is provided by [`@telemetry`](/docs/directives.md#telemetry-directive) directive to setup how and where the telemetry data is send.
+Telemetry configuration is provided by [`@telemetry`](/docs/directives.md#telemetry-directive) directive to setup how and where the telemetry data is sent.
 
-To enable it we can update our config with something like config below:
+To enable it, we can update our config with something like the config below:
 
 ```graphql
 schema
@@ -90,44 +92,172 @@ schema
 }
 ```
 
-Here, `export` specifies the format of generated data and endpoint to which to send that data. Continue reading to know more about different options for it.
+In this configuration:
+
+- The `export` option specifies the format and endpoint where the telemetry data will be sent. 
+- Replace `http://your-otlp-compatible-backend.com` with the URL of your observability platform that supports OTLP.
+
+## Exporting Telemetry Data
 
 ### Export to OTLP
 
-[OTLP](https://opentelemetry.io/docs/specs/otlp/) is a vendor agnostic protocol that is supported by growing [number of observability backends](https://opentelemetry.io/ecosystem/vendors/).
+#### What is OTLP?
+
+[OTLP (OpenTelemetry Protocol)](https://opentelemetry.io/docs/specs/otlp/) is a vendor-agnostic standard for exporting telemetry data. It is widely supported by a growing [number of observability backends](https://opentelemetry.io/ecosystem/vendors/).
+
+#### Using OpenTelemetry Collector
+
+[OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) is a robust solution for receiving, processing, and exporting telemetry data in OTLP format. While Tailcall can directly send data to OTLP-compatible platforms, using the OpenTelemetry Collector provides additional benefits:
+
+- **Scalability**: The Collector is designed to handle high loads and complex setups.
+- **Flexibility**: It can export data in multiple formats, such as Jaeger or Datadog, and is well-suited for large-scale environments.
 
-#### OpenTelemetry Collector
+**Configuration Example**:
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+      http:
+        endpoint: 0.0.0.0:4318
 
-[OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) is a vendor-agnostic way to receive, process and export telemetry data in OTLP format.
+exporters:
+  otlp:
+    endpoint: otelcol:4317
 
-Although, tailcall can send the data directly to the backends that supports OTLP format using Otel Collector could be valuable choice since it's more robust solution well-suited for a high-scale, more flexible settings and ability to export in different formats other than OTLP.
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      processors: [batch]
+      exporters: [otlp]
+```
+
+### Export to Prometheus
+
+Prometheus is a popular open-source monitoring solution focused on metrics. It is well-suited for applications that require detailed metrics monitoring.
+
+To export metrics to Prometheus, Tailcall needs to expose metrics in a format that Prometheus can scrape. This is typically done by adding a special route to the GraphQL server:
+
+```graphql
+schema
+  @telemetry(
+    export: {
+      prometheus: {path: "/metrics"}
+    }
+  ) {
+  query: Query
+}
+```
 
-In summary, if you're gonna to use OTLP compatible platform or [prometheus](#export-to-prometheus) and your load is not that massive you could send the data directly to platforms. From the other side, if you need to export to different formats (like Jaeger or Datadog) or your application involves high load consider using Otel Collector as an export target.
+### Export to stdout
 
-### Export to prometheus
+Tailcall can also output telemetry data to stdout, which is ideal for testing or local development environments.
 
-[Prometheus](https://prometheus.io) is a metric monitoring solution. Please note that prometheus works purely with metrics and other telemetry data like traces and logs won't be sent to it.
+```graphql
+schema @telemetry(
+    export: {
+      stdout: {
+        pretty: true
+      }
+    }
+  ) {
+  query: Query
+}
+```
 
-Prometheus integration works by adding a special route for the GraphQL server's router that outputs generated metrics in prometheus format consumable by prometheus scraper.
+## Using Telemetry Data
 
-## Data generated
+### Data Generated
 
-You can find a reference of type of info generated by Tailcall in the [`@telemetry` reference](/docs/directives.md#telemetry-directive) or consult examples in the next section, in order to gain some understanding.
+Tailcall generates various types of telemetry data, including:
 
-### Relation with other services
+- **Metrics**: Such as request counts, error rates, and latencies.
+- **Traces**: Showing the flow of requests across different services.
+- **Logs**: Detailed event logs for specific actions or errors.
 
-Tailcall fully supports [Context Propagation](https://opentelemetry.io/docs/concepts/context-propagation/) functionality and therefore you can analyze distributed traces across all of your services that are provides telemetry data.
+### Context Propagation and Distributed Tracing
 
-That may look like this:
+Tailcall fully supports [context propagation](https://opentelemetry.io/docs/concepts/context-propagation/) functionality. Context propagation allows you to track distributed traces across multiple services, providing a comprehensive view of how a request flows through your system.
+
+Here's an example of using context propagation with Honeycomb:
 
 ![honeycomb-propagation](../static/images/telemetry/honeycomb-propagation.png)
 
-Where Tailcall is a part of whole distributed trace
+### Customization
 
-### Customize generated data
+In some cases you may want to customize the data that was added to telemetry payload in order to have more control over the analyzing process. Tailcall allows you to customize metrics by using properties like [`requestHeaders`](/docs/directives.md#requestheaders), which can be used to segment data by specific headers.
 
-In some cases you may want to customize the data that were added to telemetry payload to have more control over analyzing process. Tailcall supports that customization for specific use cases described below. For eg. the metric [`http.server.request.count`](/docs/directives.md#metrics) can be customized with the [`requestHeaders`](/docs/directives.md#requestheaders) property to allow splitting the overall count by specific headers.
+**Example**:
+```graphql
+schema @telemetry(
+    requestHeaders: ["X-User-Id"]
+  ) {
+  query: Query
+}
+```
 
 :::important
-The value of specified headers will be sent to telemetry backend as is, so use it with care to prevent of leaking any sensitive data to third-party services you don't have control over.
+The value of specified headers will be sent to the telemetry backend as is. Be cautious when including sensitive information in telemetry data to avoid unintentional data leaks.
 :::
+
+## Troubleshooting and Debugging Telemetry Issues
+
+Despite the robust telemetry capabilities in Tailcall, you may encounter issues that require troubleshooting and debugging. This section provides guidance on how to diagnose and resolve common telemetry problems.
+
+### Common Telemetry Issues
+
+1. **No Telemetry Data is Being Collected**
+   - **Check Configuration**: Ensure that the `@telemetry` directive is correctly configured in your GraphQL schema. Verify that the export endpoints are correctly specified and reachable.
+   - **Network Connectivity**: Confirm that there is network connectivity between your Tailcall server and the observability backend. Check firewall rules, DNS settings, and endpoint URLs.
+   - **Telemetry Data Volume**: If the telemetry data volume is too low, you may not see data immediately. Generate additional traffic to the application to verify data collection.
+
+2. **Incomplete or Missing Traces**
+   - **Context Propagation Issues**: Verify that context propagation is correctly configured. In distributed systems, missing traces often result from improper context propagation across services.
+   - **Instrumentation Gaps**: Ensure all necessary services and components are properly instrumented with OpenTelemetry. Missing or incomplete instrumentation can lead to gaps in tracing data.
+
+3. **High Latency or Performance Degradation**
+   - **Resource Overhead**: Telemetry collection can introduce overhead, especially if you are exporting a large volume of data. Consider optimizing the frequency of data collection, reducing the amount of data exported, or using more efficient telemetry backends.
+   - **Collector Bottlenecks**: If using OpenTelemetry Collector, monitor its performance to ensure it is not a bottleneck. Adjust configuration settings or scale the Collector to handle larger volumes of telemetry data.
+
+4. **Incorrect Metrics or Logs**
+   - **Validation of Telemetry Data**: Compare the telemetry data against expected values. Discrepancies could be due to misconfiguration of metric instruments or log formats.
+   - **Export Configuration**: Ensure that the export configuration, such as Prometheus scraping paths or OTLP endpoints, matches the telemetry backend's requirements. Misconfigured paths or formats can lead to incorrect data being collected.
+
+### Debugging Steps
+
+1. **Enable Debug Logging**
+   - OpenTelemetry has a debug logging mode that can be enabled to provide more detailed output. Use these logs to identify where issues are occurring in the telemetry pipeline.
+
+2. **Use Local Development Tools**
+   - For quick testing, configure Tailcall to export telemetry data to `stdout`.
+
+3. **Validate Export Endpoints**
+   - Test the telemetry export endpoints directly (e.g., using `curl` for HTTP endpoints) to ensure they are reachable and correctly configured. This helps rule out network issues or misconfigured endpoints.
+
+4. **Check Compatibility with Observability Tools**
+   - Ensure that the observability tool you are using is fully compatible with the telemetry format being exported. Refer to the observability tool's documentation to confirm compatibility with OTLP, Prometheus, or other formats you are using.
+
+5. **Monitor Resource Usage**
+   - Use system monitoring tools to observe resource usage by the OpenTelemetry Collector. High CPU, memory, or network usage could indicate inefficiencies in your telemetry setup that need to be addressed.
+
+### Advanced Debugging
+
+If the above steps do not resolve the issue, consider using more advanced debugging techniques:
+
+- **Trace Logs**: Enable trace-level logging in OpenTelemetry to get detailed logs of each step in the telemetry data flow.
+- **Distributed Tracing**: Use distributed tracing to follow the telemetry data as it passes through different services and components in your architecture. This can help identify where data is lost or delayed.
+- **Profiling**: Use performance profiling tools to identify bottlenecks in your telemetry pipeline, particularly if you suspect high overhead or resource contention.
+
+By following these troubleshooting and debugging strategies, you can resolve common telemetry issues in Tailcall and ensure that your observability setup functions optimally.
+
+## Conclusion
+
+In this guide, we have covered the essentials of enabling and configuring observability in Tailcall. You should now be able to:
+
+- Generate telemetry data and forward it to your preferred observability platforms.
+- Customize telemetry configurations to meet specific needs.
+- Troubleshoot common issues related to telemetry in Tailcall.
+
+As a next step, we encourage you to experiment with these features in your own projects and explore further integration with advanced observability tools. Happy monitoring!

From eb1ebf203dd0b0a7b16ed4a1c5911193d353c31e Mon Sep 17 00:00:00 2001
From: giraffekey <giraffekey@tutanota.com>
Date: Tue, 27 Aug 2024 11:16:44 -0700
Subject: [PATCH 2/2] Format telemetry guide

---
 docs/telemetry.md | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/docs/telemetry.md b/docs/telemetry.md
index 6620e6b76..3a961656c 100644
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -94,7 +94,7 @@ schema
 
 In this configuration:
 
-- The `export` option specifies the format and endpoint where the telemetry data will be sent. 
+- The `export` option specifies the format and endpoint where the telemetry data will be sent.
 - Replace `http://your-otlp-compatible-backend.com` with the URL of your observability platform that supports OTLP.
 
 ## Exporting Telemetry Data
@@ -113,6 +113,7 @@ In this configuration:
 - **Flexibility**: It can export data in multiple formats, such as Jaeger or Datadog, and is well-suited for large-scale environments.
 
 **Configuration Example**:
+
 ```yaml
 receivers:
   otlp:
@@ -142,11 +143,7 @@ To export metrics to Prometheus, Tailcall needs to expose metrics in a format th
 
 ```graphql
 schema
-  @telemetry(
-    export: {
-      prometheus: {path: "/metrics"}
-    }
-  ) {
+  @telemetry(export: {prometheus: {path: "/metrics"}}) {
   query: Query
 }
 ```
@@ -156,13 +153,7 @@ schema
 Tailcall can also output telemetry data to stdout, which is ideal for testing or local development environments.
 
 ```graphql
-schema @telemetry(
-    export: {
-      stdout: {
-        pretty: true
-      }
-    }
-  ) {
+schema @telemetry(export: {stdout: {pretty: true}}) {
   query: Query
 }
 ```
@@ -190,10 +181,9 @@ Here's an example of using context propagation with Honeycomb:
 In some cases you may want to customize the data that was added to telemetry payload in order to have more control over the analyzing process. Tailcall allows you to customize metrics by using properties like [`requestHeaders`](/docs/directives.md#requestheaders), which can be used to segment data by specific headers.
 
 **Example**:
+
 ```graphql
-schema @telemetry(
-    requestHeaders: ["X-User-Id"]
-  ) {
+schema @telemetry(requestHeaders: ["X-User-Id"]) {
   query: Query
 }
 ```
@@ -209,15 +199,18 @@ Despite the robust telemetry capabilities in Tailcall, you may encounter issues
 ### Common Telemetry Issues
 
 1. **No Telemetry Data is Being Collected**
+
    - **Check Configuration**: Ensure that the `@telemetry` directive is correctly configured in your GraphQL schema. Verify that the export endpoints are correctly specified and reachable.
    - **Network Connectivity**: Confirm that there is network connectivity between your Tailcall server and the observability backend. Check firewall rules, DNS settings, and endpoint URLs.
    - **Telemetry Data Volume**: If the telemetry data volume is too low, you may not see data immediately. Generate additional traffic to the application to verify data collection.
 
 2. **Incomplete or Missing Traces**
+
    - **Context Propagation Issues**: Verify that context propagation is correctly configured. In distributed systems, missing traces often result from improper context propagation across services.
    - **Instrumentation Gaps**: Ensure all necessary services and components are properly instrumented with OpenTelemetry. Missing or incomplete instrumentation can lead to gaps in tracing data.
 
 3. **High Latency or Performance Degradation**
+
    - **Resource Overhead**: Telemetry collection can introduce overhead, especially if you are exporting a large volume of data. Consider optimizing the frequency of data collection, reducing the amount of data exported, or using more efficient telemetry backends.
    - **Collector Bottlenecks**: If using OpenTelemetry Collector, monitor its performance to ensure it is not a bottleneck. Adjust configuration settings or scale the Collector to handle larger volumes of telemetry data.
 
@@ -228,15 +221,19 @@ Despite the robust telemetry capabilities in Tailcall, you may encounter issues
 ### Debugging Steps
 
 1. **Enable Debug Logging**
+
    - OpenTelemetry has a debug logging mode that can be enabled to provide more detailed output. Use these logs to identify where issues are occurring in the telemetry pipeline.
 
 2. **Use Local Development Tools**
+
    - For quick testing, configure Tailcall to export telemetry data to `stdout`.
 
 3. **Validate Export Endpoints**
+
    - Test the telemetry export endpoints directly (e.g., using `curl` for HTTP endpoints) to ensure they are reachable and correctly configured. This helps rule out network issues or misconfigured endpoints.
 
 4. **Check Compatibility with Observability Tools**
+
    - Ensure that the observability tool you are using is fully compatible with the telemetry format being exported. Refer to the observability tool's documentation to confirm compatibility with OTLP, Prometheus, or other formats you are using.
 
 5. **Monitor Resource Usage**