Large language models (LLMs) have matured to the point where practical application development is now possible, across a wide variety of tasks and domains. Nevertheless, many unresolved questions requiring original investigation remain.
Chroma’s applied research team is investigating these questions in a practical, product-oriented way. We aim to design and execute practical experiments, prototype and iterate on new approaches as well as those presented in literature, evaluate and integrate the results into the core Chroma product, and share our investigation with the broader AI community.
<aside> 💡 Interested in working with us? email [email protected]
</aside>
These are some of the research directions we’ve identified as important for our users, and the further development of retrieval for LLMs. We are always interested in learning about other potentially interesting directions,
There are several competing embedding model vendors, as well as open-source models, for most data modalities, especially text.
Which embedding model should be used for a particular task and dataset? What are the main sources of variation? How should users evaluate the different models for their use-case? Can a model be selected automatically?
Chunking is often necessary in retrieval augmented systems, to ensure that data is within the limits of the context windows of the embedding model and target LLM, as well as to ensure only relevant information is retrieved.
It’s important that sufficient context is preserved in each chunk such that the resulting embedding is meaningful, and that sufficient information is available in the associated document to give context to the target LLM, without contaminating the context with irrelevant information.
What is an optimal chunking strategy given a task, a dataset, and a set of constraints? How can the inherent structure of certain documents be used to augment chunking? Is it possible to find ‘semantic boundaries’ in text using token prediction / perplexity?
In order for retrieval to be robust, it is necessary to be able to understand the degree to which a given result is relevant to the query. For example, even though for any query vector there are always ‘nearest neighbors’ in embedding space, it’s possible that there is no relevant information in the dataset. What ‘relevance’ means may differ between datasets, tasks, and even users.
Is it possible to algorithmically compute a useful relevance ‘score’? Can the embedding space be conditioned on the fly, according to the query, user, and task context? Can an LLM in the loop be used to evaluate relevance, or perform re-ranking? How can we incorporate human feedback on relevance to improve retrieval?
Much of the world’s information is stored as structured data, for example as tables in relational databases. We would like to interact with this type of data with the same flexibility offered by retrieval-augmented generation, but embedding models and LLMs currently do not handle structured data as effectively as natural language.
Several strategies for handling structured data in an embeddings context have been proposed, including structured query generation, and natural language summarization. We are interested in evaluating these strategies, and developing others, in order to access structured data for retrieval.