Talking Points: Describing and Localizing Pixels

Abstract

We introduce Talking Points — a framework that generates rich, free-form descriptions of individual pixels and localizes them back with high precision. Instead of relying on templated prompts or keypoint names, we generate natural language descriptions that uniquely identify pixel locations. We evaluate our descriptions not by comparing text, but by testing whether they can accurately guide localization.

Our approach consists of two complementary components: a Point Descriptor that generates detailed natural language descriptions of keypoints, and a Point Localizer that takes these descriptions and predicts the corresponding pixel coordinates. We further introduce a novel reinforcement learning approach that uses the localizer as a reward model to improve the descriptor's ability to generate localizable descriptions.

To evaluate our results, we establish a new evaluation protocol: instead of comparing the text descriptions to ground truth, we use the localizer to determine how close the predicted point is to the ground truth point. Since no dataset exists to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. Experiments demonstrate superior performance compared to baseline models.

Method

Architecture overview. The Point Descriptor (left) generates natural language descriptions of keypoints, while the Point Localizer (right) predicts pixel coordinates from descriptions.

Point Descriptor

Given an image and keypoint coordinates (x, y), we generate a Gaussian mask centered at the keypoint. This mask constrains the attention mechanism in each decoder layer by defining a boolean attention mask for the cross-attention operation: queries can only attend to image features within the Gaussian region around the keypoint. These fixed masks force the queries to gather information exclusively from the keypoint's immediate neighborhood, producing rich descriptions that capture semantic context, spatial relationships, and local visual attributes.

Point Localizer

The Point Localizer inverts the description task: given an image and textual description, it regresses keypoint coordinates. We encode the image via the vision encoder and project it to the language space. We perform a single forward pass and extract the hidden state corresponding to a special <SEG> token. This representation passes through a text-to-vision projection layer, followed by an MLP that maps to normalized coordinates.

Reinforcement Learning with Localizer Reward: We introduce a novel training paradigm where the Point Localizer provides reward signals to the Point Descriptor. This creates a feedback loop that optimizes descriptions specifically for localizability, resulting in more precise and unambiguous point descriptions.

Qualitative Results

Comparison of keypoint descriptions generated by different methods, and localizations using our Point Localizer. The red star is the ground truth keypoint. Our Talking Points (TP) produces detailed, localizable descriptions, while our baseline, OMG-LLaVA, generates generic statements that fail to identify the keypoint.

GT

The point is located on the cat that spans most of the image, specifically on its paw, which is situated to the right and close to the top edge of the cat, and within this paw, the point is slightly to the right and above the center, in a region featuring a small, dark, circular area with a lighter center, possibly a pupil or a small opening.

TP (Ours)

The point is located on the cat, which spans most of the horizontal and vertical axis of the image, specifically on the cat's paw, which is near the right and top edge of the cat, and within the paw, the point is slightly to the left and above the center, in a region that features a small, dark, curved shape with a smooth texture.

ChatGPT-5

The keypoint is located on the raised front paw of the cat, precisely at the outer edge of the paw’s upper surface where the fur is dark and contrasts with the bright light shining from the window behind. It sits just above the slightly curved claws that are silhouetted against the background, directly opposite the cat’s open mouth and extended tongue. The paw is angled upward, and the keypoint lies on the side facing the camera, slightly to the right of the cat’s nose and whiskers, framed by the surrounding purple feather boa that drapes across the cat’s chest and lower body. This placement marks a distinct point on the paw’s contour, highlighted by the sharp outline against the sunlit backdrop.

Human

The point is located on the hand of the cat in the iamge, on the same horizontal line that crosses the middle of the cat's nose, on the middle part of the paw.

OMG-LLaVA

The window is clear.

GT

The point is located on the rightmost dog in the image, which is close to the right and bottom edges of the picture, and it is positioned on the dog's nose, specifically slightly above and to the right of the center of the nose, with the region around the keypoint being a small, dark, and somewhat circular shape, likely the nose itself.

TP (Ours)

The point is located on the dog that is positioned closer to the right and bottom edges of the image, and it is the second dog from the left, on its eye, which is situated slightly to the left and above the center of the dog's head, and within the eye, the point is slightly to the left and below the center, in a region that features a small, dark, circular area with a lighter center, possibly a pupil.

ChatGPT-5

The keypoint is located on the smaller brown puppy lying in the snow, specifically at the center of its head where the snout meets the forehead, just above the nose. It sits directly opposite the larger light-colored dog, which is bending its head downward as if sniffing toward the smaller one. The keypoint is positioned slightly right of the image's center, on the puppy whose body is stretched out horizontally on the snow. Surrounding cues include the textured snow surface beneath both dogs, the large blue metal gate in the background, and the clear contrast between the smaller puppy's reddish fur and the white snow. This placement highlights the midpoint of the puppy's face, precisely where its head is directed toward the approaching dog.

Human

the point is located on the tip of the nose of the right (and smaller) dog

OMG-LLaVA

The dog is brown.

Video

Video coming soon

BibTeX

@article{rusanovsky2025talking,
  title={Talking Points: Describing and Localizing Pixels},
  author={Rusanovsky, Matan and Malnick, Shimon and Avidan, Shai},
  journal={arXiv preprint arXiv:2510.14583},
  year={2025}
}