ReNikud improves Hebrew grapheme-to-phoneme conversion by leveraging weak audio supervision from unlabeled speech and a pseudo-vocalization architecture with character-level alignment as an inductive bias.
NAMESAKES is a benchmark of over 1,000 public figures with a black-box behavioral probe for detecting whether text-to-image models memorize and reproduce individuals' likenesses, requiring no reference photos or model internals.
CWoMP (Contrastive Word-Morpheme Pretraining) treats morphemes as atomic units, contrastively learning
morpheme representations to automate interlinear glossed text (IGT) production for language documentation.
Phonikud is an open-source Hebrew grapheme-to-phoneme system outputting fully-specified IPA transcriptions including stress. We also introduce ILSpeech, a Hebrew audio-text-IPA corpus, a G2P benchmark, and audio-to-IPA models for TTS evaluation.
ConlangCrafter leverages LLMs to generate constructed languages through a multi-hop pipeline that decomposes language design into modular stages, with components encouraging consistency and typological diversity.
WildCAT3D generates novel views of scenes learned from diverse 2D scene image data captured in the wild, enabling appearance-controlled novel view synthesis from a single image.
WAFFLE is a multimodal floorplan understanding dataset of nearly 20K diverse floorplan images and metadata, enabling progress on new building understanding tasks.
A framework for addressing open-vocabulary hallucinations in image captioning models, including a new benchmark and a reinforcement learning-based method to reduce such hallucinations.
We show that foundation VLMs like CLIP model visual-semantic hierarchies, proposing the Radial Embedding framework for probing and optimizing this knowledge, along with the HierarCaps dataset of ground-truth image caption hierarchies.
We quantify image caption concreteness using information loss in foundation vision-language models,
and use this score to filter web-scale multimodal datasets.
By generating images using prompts containing pseudowords (nonsense words) and analyzing their
shapes,
we show that AI image generation models show sound-shape associations similar to those known from
human psychology.
We model human-human interaction understanding in images as free text
generation, provide a new benchmark and show how to learn this with weak supervision from Internet
image captions.