1). Large Language and Speech Model - proposes a large language and speech model trained with cross-modal conversational abilities that supports speech-and-language instruction enabling more natural interactions with AI systems. (paper | tweet)
2). SAM-Med2D - applies segment anything models (SAM) to medical 2D images; collects 4.6M images and 19.7M masks to construct a large-scale medical image segmentation dataset with different modalities and objects; SAM is fine-tuned on the dataset and evaluated on medical image segmentation across various modalities, anatomical structures, and organs. (paper | tweet)
3). Vector Search with OpenAI Embeddings - suggests that “from a cost–benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern “AI stack” for search since such applications have already received substantial investments in existing, widely deployed infrastructure.” (paper | tweet)
4). Graph of Thoughts - presents a prompting approach that models text generated by LLMs as an arbitrary graph; it enables combining arbitrary "thoughts" and enhancing them using feedback loops; the core idea is to enhance the LLM capabilities through "network reasoning" and without any model updates; this could be seen as a generalization of the now popular Chain-of-Thought and Tree-of-Thought. (paper | tweet)
5). MVDream - a multi-view diffusion model that can generate geometrically consistent multi-view images given a text prompt; it leverages pre-trained diffusion models and a multi-view dataset rendered from 3D assets; this leads to generalizability of 2D diffusion and consistency of 3D data. (paper | tweet)
6). Nougat - proposes an approach for neural optical understanding of academic documents; it supports the ability to extract text, equations, and tables from academic PDFs, i.e., convert PDFs into LaTeX/markdown. (paper | tweet)
7). Factuality Detection in LLMs - proposes a tool called FacTool to detect factual errors in texts generated by LLMs; shows the necessary components needed and the types of tools to integrate with LLMs for better detecting factual errors. (paper | tweet)
8). AnomalyGPT - an approach for industrial anomaly detection based on large vision-language models; it simulates anomalous images and textual descriptions to generate training data; employs an image decoder and prompt learner to detect anomalies; it shows few-shot in-context learning capabilities and achieves state-of-the-art performance benchmark datasets. (paper | tweet)
9). FaceChain - a personalized portrait generation framework combining customized image-generation models and face-related perceptual understanding models to generate truthful personalized portraits; it works with a handful of portrait images as input. (paper)
10). Qwen-VL - introduces a set of large-scale vision-language models demonstrating strong performance in tasks like image captioning, question answering, visual localization, and flexible interaction. (paper | tweet)
Reach out to hello@dair.ai if you would like to sponsor the next issue of the newsletter. We can help promote your AI tool, research, or company to ~10K AI researchers and practitioners.