This procedure is part of a lab that teaches you how to monitor your Kubernetes cluster with Pixie.
Each procedure in the lab builds upon the last, so make sure you've completed the last procedure, Instrument your cluster, before starting this one.
Until now, you've been working with application services that don't have bugs (we hope). You've been able to access TinyHat.me and render Bob Ross with one or more silly hats without a hitch. But it's time to deploy some new code to your cluster.
Change to the
scenario-1 directory and set up your environment:
$cd ../scenario-1$./setup.shPlease wait while we update your lab environment.deployment.apps/fetch-service configureddeployment.apps/simulator configuredDone!
If you're a windows user, run the PowerShell setup script:
Uh oh! You look on social media and see some confused customers:
What's wrong with TinyHat.me? Use Pixie to find out.
Reproduce the issue
You've been notified by your users that they can't see a particular hat on TinyHat.me. Before you start debugging your code, reproduce the issue for yourself.
Look up your frontend's external IP address:
$kubectl get servicesNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEadd-service ClusterIP 10.109.114.34 <none> 80/TCP 20madmin-service ClusterIP 10.110.29.145 <none> 80/TCP 20mfetch-service ClusterIP 10.104.224.242 <none> 80/TCP 20mfrontend-service LoadBalancer 10.102.82.89 10.102.82.89 80:32161/TCP 20mgateway-service LoadBalancer 10.101.237.225 10.101.237.225 80:32469/TCP 20mkubernetes ClusterIP 10.96.0.1 <none> 443/TCP 20mmanipulation-service ClusterIP 10.107.23.237 <none> 80/TCP 20mmoderate-service ClusterIP 10.105.207.153 <none> 80/TCP 20mmysql ClusterIP 10.97.194.23 <none> 3306/TCP 20mupload-service ClusterIP 10.108.113.235 <none> 80/TCP 20m
Paste the IP in your browser:
This looks the same as it did before, except that there's a new hat style, called PIXIE.
Select a number of hats, the PIXIE style, and Hat me:
Oops! Where did Bob go? Instead of rendering Bob Ross with your hat selection, the frontend served no image at all.
The number of hats you choose to display has no effect on the result.
Observe a network request in your browser's developer tools:
Here, the response from the image request says, "This hat style does not exist! If you want this style - try submitting it."
You know that the PIXIE hat style most certainly does exist because you chose it from the selector. But for some reason, the application can't render the image.
For good measure, try to render a different hat style:
It worked! Your users were right. There's something wrong with your application.
Solve the mystery with Pixie
The bad news is that you've confirmed there's an error in your application. The good news is that you recently instrumented your cluster with Pixie! Go to New Relic and sign into your account, if you haven't already.
From the New Relic homepage, go to Kubernetes:
Choose your tiny-hat cluster:
Then click Live Debugging with Pixie:
This is Pixie's live, code-level debugger:
You use it to drill down and learn more about the services in your cluster.
When you go to the live debugger, you may see an error saying your cluster is disconnected. This is normal, as it takes a little while for New Relic to start seeing your Pixie data. Wait a few more minutes and refresh the debugger.
Notice the script dropdown menu at the top of the debugger:
Pixie's live debugger renders data based on open source scripts written in PxL, Pixie's proprietary query language. The default script is
px/cluster, which shows cluster-level information including:
- A service map
Scroll down to see the error rates for your services:
Yikes! You have three services returning a high percentage of errors:
You know that the website lives at
frontend-service. You can reasonably rule this out as the culprit because you know it renders other hats just fine. That leaves two potential problem services:
To decide which service to look at first, scroll up to NAMESPACES and choose
Your app services live in the default namespace. This helps you filter the service map to show more useful nodes.
Scroll up to the service map:
Here, you see that the
frontend-service requests data from
gateway-service. In turn,
gateway-service requests data from
fetch-service. So, following that order of operations, focus first on the
You can click and drag the nodes in the service map to make it more readable.
Double click the gateway-service in your service map to learn more about it:
Notice that Pixie's live debugger has seamlessly replaced namespace data with service data by changing the script:
You're now using the
px/service script filtered down to
default/gateway-service. In this service, you see a graph with the errors, but not much about where those errors are coming from.
Click the script selector to switch to the
px/service_stats script. Filter the svc to
This gives a better picture of the errors in your service.
Scroll down to the Incoming Traffic and Outgoing Traffic tables:
The gateway service is returning the same amount of errors on inbound requests as it's receiving from outbound requests to the fetch service. This is a good indicator that you need to look at what's happening upstream.
Switch to the
px/http_data_filtered script, targeting
default/fetch-service and requests with a 400 response status code:
Click on a row to learn more about a request to the fetch service that resulted in an error:
Here, you see that the request path looks like
/fetch?style=PIXIE&number=1. This looks right, because the hat style you chose is called
PIXIE. So if the fetch service is still returning 400s, something wrong is happening when it tries to find the hat.
Switch to the
px/mysql_data script and add a source filter for
Many of these queries returned no results.
Click on one with no results, and look at the
req_body to see the query:
SELECT * FROM main.images WHERE BINARY description='pixie' AND approve='true'
There's the problem! The
BINARY type cast effectively makes the
WHERE condition case sensitive. Since the hat's style is called
PIXIE, this condition fails to find it. Now that you know, you can fix this query in your fetch service.
To recap, you observed an error in your application and used Pixie in New Relic to:
- Understand your services' relationships
- Review the error percentages for each of your services
- Look at individual response bodies
- Find a semantic error in a query within one of those services
And you didn't even need to individually install agents in any of your services. Pixie was able to deliver all the information you needed!
This lesson is part of a lab that teaches you how to monitor your Kubernetes cluster with Pixie. Next, try to figure out why some APIs have high latency.