Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: context deadline exceeded using Go driver on bblfshd #209

Open
erizocosmico opened this issue Nov 8, 2018 · 11 comments
Open

perf: context deadline exceeded using Go driver on bblfshd #209

erizocosmico opened this issue Nov 8, 2018 · 11 comments
Labels

Comments

@erizocosmico
Copy link
Contributor

We're using bblfshd extensively to process files on gitbase, which is a multi-threaded process. After a while processing, bblfshd becomes unresponsive and starts returning "Context deadline exceeded errors" after that and it does not recover from there. Sometimes, it even makes docker itself hang.

time="2018-11-08T13:12:44Z" level=error msg="request processed content 2631 bytes error: rpc error: code = Canceled desc = context canceled" elapsed=15.091048125s language=go

Using a single core does not fully mitigate the problems but it does recover after a few errors and continues.

How to reproduce this:

  • Make a folder with 2 repositories: gitbase and gitbase-spark-connector
  • Run a bblfshd instance with the go driver installed at port 9432
  • Run a gitbase instance
gitbase server -v --host=127.0.0.1 --port=3306 -d path/to/repos --index=indexes

Run the following script:

package main

import (
	"database/sql"

	_ "github.com/go-sql-driver/mysql"
)

const connectionString = "root:@tcp(127.0.0.1:3306)/db"

const query = `
SELECT f.file_path, uast(f.blob_content, language(f.file_path))
FROM ref_commits r
INNER JOIN commit_files c ON r.commit_hash = c.commit_hash
AND r.repository_id = c.repository_id
INNER JOIN files f ON c.file_path = f.file_path AND c.tree_hash = f.tree_hash
AND f.blob_hash = c.blob_hash
AND f.repository_id = c.repository_id
WHERE r.ref_name = 'HEAD' AND language(f.file_path) = 'Go' AND NOT is_binary(f.blob_content)
`

func main() {
	db, err := sql.Open("mysql", connectionString)
	if err != nil {
		panic(err)
	}

	rs, err := db.Query(query)
	if err != nil {
		panic(err)
	}

	for rs.Next() {
                 // no need to do anything with the rows
	}

	if err := rs.Err(); err != nil {
		panic(err)
	}
}
@dennwc dennwc added the bug label Nov 8, 2018
@bzz
Copy link
Contributor

bzz commented Nov 19, 2018

Did not get to reproduce this yet, but it would be interesting to see the output of the bblfshctl status and bblfshctl instances under such conditions.

@bzz bzz changed the title Context deadline exceeded using Go driver on bblfshd perf: context deadline exceeded using Go driver on bblfshd Dec 3, 2018
@ajnavarro
Copy link

@erizocosmico could you give to @bzz the requested info please?

@erizocosmico
Copy link
Contributor Author

Sure @ajnavarro

@erizocosmico
Copy link
Contributor Author

erizocosmico commented Dec 12, 2018

> docker exec -it bblfshd bblfshctl status
+----------+----------------+---------------+---------+--------+
| LANGUAGE | SUCCESS/FAILED | STATE/DESIRED | WAITING | EXITED |
+----------+----------------+---------------+---------+--------+
| go       | 439/6          | 1/1           | 0       | 0      |
+----------+----------------+---------------+---------+--------+
Response time 4.2804ms
> docker exec -it bblfshd bblfshctl instances
+-------------+-------------------------+---------+-----------+------+
| INSTANCE ID |         DRIVER          | STATUS  |  CREATED  | PIDS |
+-------------+-------------------------+---------+-----------+------+
| 01cyhbkec5  | bblfsh/go-driver:latest | Running | 2 minutes | 15   |
+-------------+-------------------------+---------+-----------+------+

This was just after it started happening. If you wait a little bit longer, not even those commands are responsive.

After this happens, not even docker rm -f bblfshd can stop the container or make it responsive and I have to restart the docker runtime.

/cc @bzz

@smola
Copy link
Member

smola commented Dec 12, 2018

@erizocosmico Could you do du -csh /var/lib/bblfshd/* inside bblfshd container?
Docker hanging could be related to resource exhaustion: #168

@erizocosmico
Copy link
Contributor Author

Sure @smola

@erizocosmico
Copy link
Contributor Author

@smola docker exec -it bblfshd du -csh /var/lib/bblfshd/* does not respond either

@smola
Copy link
Member

smola commented Dec 13, 2018

I can reproduce this by sending N (= NumCPU) concurrent requests for huge files which will take forever to parse, all subsequent requests will fail with context deadline exceeded. It might be happening to gitbase that huge files accumulate in parse requests and end up clogging all driver instances.

The unexpected timeouts were supposed to be fixed here: #227

I think the other part would be to limit the file size we send from gitbase to bblfsh. That is, any file above X size immediately returns NULL UAST. The limit would be configurable. This is something I already discussed with @ajnavarro, not sure if it's done.

Other than that, I don't think there's anything else to do here. What do you think?

EDIT: I'm using bblfshd 2.11.0 which was supposed to fix the timeouts problem, so either the issue is not fixed or this is a different issue.

By the way, for quick reproduction, you can use this code (warning: it will staturate all your CPUs!):

package main

import (
	"context"
	"fmt"
	"io/ioutil"
	"net/http"
	"runtime"
	"sync"
	"sync/atomic"

	bblfsh "gopkg.in/bblfsh/client-go.v3"
)

func main() {

	url := `https://raw.githubusercontent.com/src-d/enry/master/data/frequencies.go`
	resp, err := http.DefaultClient.Get(url)
	if err != nil {
		panic(err)
	}

	bs, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}

	code := string(bs)

	client, err := bblfsh.NewClient("0.0.0.0:9432")
	if err != nil {
		panic(err)
	}

	wg := sync.WaitGroup{}
	var count int32
	for i := 0; i < runtime.NumCPU()*2; i++ {
		wg.Add(1)
		go func() {
			for {
				_, _, err = client.
					NewParseRequest().
					Context(context.Background()).
					Language("go").
					Content(code).
					UAST()
				if err != nil {
					fmt.Println(err)
				}

				newCount := atomic.AddInt32(&count, 1)
				if newCount%100 == 0 {
					fmt.Printf("Requests: %d\n", newCount)
				}
			}
			wg.Done()
		}()
	}
	wg.Wait()
}

@creachadair
Copy link
Contributor

CPU saturation can cause structural problems for gRPC even if there are no explicit deadlines. If the machine is so CPU starved that it can't keep up with channel maintenance (e.g., health checks) the channel can be closed.

I haven't tried it in this case—it's very difficult to arrange in isolation—but this is why it's important not to let the client demand unbounded work, but to shed load, proxy, or queue. The fact that it happens under heavy CPU demand suggests this is something we should be thinking about in the design of our servers.

@smola
Copy link
Member

smola commented Dec 18, 2018

Actually, if CPU is saturated, we may want to setup bblfshd with less driver instances in the pool. Other than that, I'd see as reasonable that if bblfshd cannot accept more work (e.g. a fixed-size request queue is full), the request returns an error right away and that error is easily identifiable by the client. So they can decide whether to retry with backoff, reduce parallelism, etc.

I recommended the gitbase team to limit file size on their side for UAST parsing. This can mitigate some problems in a predictable way for the user. The user can control this limit and then they know that UAST will return NULL for too big files, where big is configurable.

It may also be the case that some files, such as minified/obfuscated code are too expensive to process even within the size limits. I think it would be ideal if bblfshd errors in this situation were deterministic, not based on time. For example, if we consider that processing an UAST with more than 10k nodes is too much, then once we processed 10k nodes we could stop and return an error. My point is that for applications like gitbase it's important that results are deterministic. If, given a certain setup, bblfshd always returns an error for a given file, that's ok. If it depends on the load, it's a mess, the only way for gitbase to provide determinism would be to just fail the full query whenever there is is a timeout after a few retries, and timeouts are happening often enough for that not to be usable.

@creachadair
Copy link
Contributor

Actually, if CPU is saturated, we may want to setup bblfshd with less driver instances in the pool. Other than that, I'd see as reasonable that if bblfshd cannot accept more work (e.g. a fixed-size request queue is full), the request returns an error right away and that error is easily identifiable by the client. So they can decide whether to retry with backoff, reduce parallelism, etc.

Agreed. I think there are two pieces we need: One is for the server to push back when there is too much work, so that the client can tell that it's requesting too much work and back off; the other is for the client to make use of that signal.

For the short term we can probably just work around by forcing the clients to rate-limit themselves (e.g., to some fixed QPS). Making the server behaviour more deterministic will be an important OKR for LA team in Q1, and then we should be able to do something less hacky.

I recommended the gitbase team to limit file size on their side for UAST parsing. This can mitigate some problems in a predictable way for the user. The user can control this limit and then they know that UAST will return NULL for too big files, where big is configurable

Yes, limiting file size is probably a good start. Generated or transpiled code often has weird properties. But I think we should also consider how to make the API more forgiving. As far as I know this limit is not that Babelfish runs out of RAM, but that gRPC has some artificial limits on message size in the channel.

I think the "right" solution is to make an API that allows chunked requests/responses, so that large requests can be passed without having to worry about gRPC. However that will take time, so I think for now @smola's suggestion is the right practical step.

I think it would be ideal if bblfshd errors in this situation were deterministic, not based on time.

I agree completely. I'd rather have results that are deterministic and wrong than non-deterministic. It's hard to program clients when you have to port backoff and timing heuristics from one to another. I feel like we should iron out these issues ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants