Data Portraits

This portrait is a sketch on The Stack v2. Enter a query to check if parts of your code appear in the portion of the stack used to train StarCoder2 Use long strings for best results.

Note: to facilitate exact string matches, whitespace is normalized before checking for overlap.

Enter your own text or use a prefill button.

Matching Text

Found spans are in grey. The longest span is in blue. Hovering over a character highlights the longest span that includes that character (there may be overlapping shorter spans). Clicking shows the component substrings below.

Hashes for each of these strings appears in the sketch. Click above to select a new span.

The top 20 longest chained matches.


Matches are at least 50 characters. If a match starts or ends at an unusual boundary, it probably means the adjacent 50-character prefix or suffix doesn't fully match. Resolution can be customized for specific usecases.

By Marc Marone and Ben Van Durme. See our paper, project pitch, and other datasets back at