GraphStuff.FM: The Neo4j Graph Database Developer Podcast

GenAI solutions with LangChain: Lance Martin on LLMs, agents, evals, and more!

Episode Summary

Welcome back, graph enthusiasts, to GraphStuff.fm, a podcast all about graphs and graph-related technologies! I'm your host, Jennifer Reif, and I am joined today by Andreas Kollegger….AND we have a great guest today. Joining us is Lance Martin, a software engineer at LangChain working on the open source library.

Episode Notes

Episode Transcription

Jennifer Reif: Welcome back, graph enthusiasts, to Graphstuff.fm, a podcast all about graphs and graph related technologies. I'm your host, Jennifer Reif, and I am joined today by Andreas Kollegger and we have a great guest today. Joining us is Lance Martin. Lance is a software engineer at LangChain working on the open source library. He's done a lot of work on RAG and open source LLMs. Before LangChain, he spent several years working on computer vision for self-driving cars and got a PhD from Stanford. So welcome, Lance.

Lance Martin: Great to be here. Thank you so much.

Jennifer Reif: Yeah, thrilled to have you on today. Could you just maybe, just to get us rolling, I guess, tell us a bit about your backstory, maybe how you ended up at LangChain?

Lance Martin: Yeah, absolutely. Yeah, so kind of like you had mentioned, I spent quite a few years working on self-driving cars. A lot of folks who were kind of doing machine learning in the late 2010s, early 2020s, computer vision and self-driving was a really interesting place to be. I think with the advent of kind of ChatGPT in the fall of 2022, it kind of changed a lot of people's views of LLMs and what they can do. And it certainly got me excited, it got many people excited. And in early 2023, I started just hacking on the side. And I encountered a really cool app called Paper QA, which is a RAG app for scientific literature. And I said, "Oh, this is pretty neat." Allowed you to do question answering over your papers. And I read a lot of papers, so I thought was pretty cool. And it was written with LangChain, so I said, "Now, what's this LangChain thing?"

So I kind of picked it up through that and started playing with LangChain and ended up building a few apps myself, which I just put out there and hadn't been on Twitter in many years and kind of spun my old Twitter account back up and put out this thing called Lex GPT, which is a RAG app over all Lex Fridman's podcasts. So I used Whisper to transcribe them to text, and then I embedded them. And it was a lot of fun and that got some fun attention. And I basically built a number of these kind of RAG apps. I did like Lex GPT, and then I did the Tim Ferriss podcast.

And then I got to know Harrison through the course of building these apps just via Twitter. And we had some nice conversations. I met him at a few hackathons and then he finally said, "Would you like to join LangChain?" So I joined in last spring, as one of the earlier folks. At the time, we were still in a venture capitalist office. Then we moved to our space, we hired up, we now have about 20 people, but I was kind of there from those very early days and it's been a lot of fun. And I still work on RAG, so RAG hasn't gone away, but it has evolved, but it's been a fun ride for sure.

Andreas Kollegger: That sounds amazing. And the journey to describe there, I feel like it's kind of similar to maybe everybody's exposure to RAG. You kind of start with, okay, you discover something that somebody's done that's interesting and like, okay, how do they do that? And it's the classic curious engineer mindset like, "Okay, let me dig into the details and kind of unpack this a little bit."

Lance Martin: Yeah, that's right. Yeah, that's right. And I think one of the reasons why RAG kind of hasn't quite gone away is, I think Andrej Karpathy, he had a really nice Twitter post talking about LLMs are kind of like the kernel of a new kind of operating system, as opposed to just thinking about them as thing you pass text into and you get text out. It's more like a... it's kind of like a kernel or a general compute engine that can be connected to lots of external resources like tools, like sensors.

And one of the big ones, of course, is databases. And so just in the same way, we have a kernel, we have disk, we have LMS and data sources, graph DBS being one big one. And so RAG is, it's kind of a very central component. If you think about these things as kernels, a new kind of operating system, you're always going to want to connect that kernel to external data sources, and that's what RAG is.

So it's a very central idea. And so I think that's why it wasn't just a flash in the pan and go RAG. It's a very central component of these systems that of course will do many things beyond RAG, but it's kind of like a centerpiece. In the same way every operating system has the kernel connected to disk. Every LLM app is very likely going to connect to some external data sources to get information that's current or otherwise not present in the pre-trained weights. So that's kind of it. And so it's become a very powerful general idea.

Andreas Kollegger: I love that framing that you're describing there. And it's a great way to think about it that it's... because sometimes in a sort of fear of extinction kind of perspective, people think, "Oh my gosh, AGI is coming. It's going to do everything and we'll just be sitting there asking it to do things," or something. But the practical reality is there's still going to be a bunch of systems that exist and this thing has got to be integrated with those systems. You still have to build the systems.

So all the patterns you're describing are a great way of thinking about it. It's not like you're not going to build a UI and ask the LLM, "Hey, what's the state of this toggle switch?"

Lance Martin: Right.

Andreas Kollegger: Right? You're still going to have memory and you're still going to have disk, you're still going to have databases, and you've got to make all these things fit together, right?

Lance Martin: Right. Exactly. And that's why I think it's also the case that LLMs are going to get increasingly more powerful without question. But it's also true that it'll never... the pre-trained models will never encompass all the data that you want them to encompass. Of course, they won't have your private data, they won't have recent data. And so this is also essentially why RAG and the ability to connect it to external data sources is always going to be of high interest and importance, I think.

Andreas Kollegger: Right.

Jennifer Reif: Yeah. For some users who maybe aren't super familiar, could you maybe just step back a second and explain kind of what LangChain is and why it's there?

Lance Martin: Yeah, absolutely. Yeah. I think the way I think about LangChain in simple terms, it's basically an application development framework for LLMs. So for example, right now there's hundreds of different LLMs, there's dozens of different, maybe even hundreds in this point, of different types of databases and vector stores. Dozens, maybe hundreds of different embedding models. And LangChain is a very general application development framework that lets you compose these things very easily into chains.

So the most simple abstraction or idea in LangChain is that you have a chain. So, a RAG app is just a chain that's composed of a prompt, a retriever, like a vector store, or a graph, and an LLM. So that's the core of the app. So you basically have this retriever. And again, that can be numerous different types of databases, but say it's a retriever that retrieves documents. You have a prompt that you push those documents into.

You have an LLM that those documents then, or that prompt is then fed to. You also have embedding models that you're used to index of documents as well. But LangChain makes it very easy to compose these things. And I think one of the reasons why LangChain got really popular is there is kind of this paradox of choice. There's so many different LLMs, there's so many different vector stores. LangChain kind of created a common abstraction on top of all of them.

So if you're building application, and I do this a lot, so say I have a RAG app that's using Pinecone for a vector store and Open AI as my LLM, I want to switch that to use a graph DB as my data store. And I wanted to use a different LLM, I want to use a local LLM. Those are very simple plug and play things you can just swap within LangChain.

And so I think that composability and swappability are things that became really popular in terms of the open source library. I will mention LangChain is not just that because we kind of have this open source library, but we get into the nuts and bolts of our actual business is we sell observability. So we actually have this thing called LangSmith, which is an observability system that sits really on top of LangChain. Now, it doesn't need to be used with LangChain, but it's a very general system that has prompt management, it has tracing and evaluation for chains. So it's more of an observability tool. And that lives on top of the open source library, but it's independent.

And that's actually kind of our product out there today. And we're building new things on top of them. Indeed, we have something called LangServe, which is basically a way to very easily deploy chains. So if you have a chain built in LangChain, LangSmith gives you observability like monitoring evaluations over that chain. LangServe gives you the ability to deploy it as a web app very easily. So that same code you use for prototyping, you can basically in very, very simple steps, turn that into an endpoint that's exposed to the web. And so then these chains become like microservices that can connect with other chains. And that's kind of the idea. So those are the two big things that sit on top of our open source library.

Jennifer Reif: Okay.

Andreas Kollegger: So then as developers, you don't have to worry about thinking about, I need to build a whole application around this stuff. It's like, "Okay, just let me worry about the chain." And then that gets deployed straight to the service.

Lance Martin: That's the key point. That's exactly the point. And actually I'll share in the show notes a video on this that I recently did. But the idea is very simple. We have this expression language called LangChain expression language, a very easy way to compose these elements. So you can kind of compose a RAG chain using LCEL, is what we kind of abbreviate it as. So you base it, it's kind of like bash and basically defining a pipe, right? You have your retriever, your prompt, your LLM, and then that's a little pipe, right?

Now every chain you define in LangChain in this way has a common set of invocation methods. Stream, batch, invoke. And so what happens is, if you've defined this chain, you can run it locally in a notebook, you can stream, you can batch, you can invoke, no problem. You can use LangServe with very few steps that turn that into a web app such that those invocation methods, stream, batch, invoke, just become HTTP endpoints.

So then you have an endpoint that is stream, an endpoint that is batch, an endpoint that is invoke, that then you can call, and we actually with that, we ship a little playground that's just like a UI. So it's like the endpoints of your chain, the invocation methods of your chain just become endpoints in a web service. And that's it. Super simple. And then you can interact with that via many different ways. With a playground, you can curl it. So it's like the same code. Your chain then becomes like a service and that's it. It's pretty simple.

Andreas Kollegger: That's really awesome. That's really cool. That's such a great thing. You can then just focus on doing the actual chains themselves and not like all the, "I've got to build the app and then I've got to put it into a container, figure out how do I deploy to Amazon and Google, whatever." It's going to be hosted by you guys.

Lance Martin: That's right.

Andreas Kollegger: And I guess there's options then for scaling it up as you need to and you got all those kinds of opportunities?

Lance Martin: Yeah. There's actually two things to highlight on it that's kind of nice. So one is, we have a whole bunch of preexisting templates that you can just use. So in fact, with Neo4j, we have a few of them. So you guys have actually done a lot of really nice ones. And actually I did a video, I don't know if I didn't mention this, but I actually have a video highlighting one of your templates and how to turn into a web app that just works out of the box.

And so what's pretty cool is we have this library of existing templates you can just use. So, basically chains for different functions that just work out of the box. It has graph DB RAG, it has regular RAG, private RAG, summarization, extraction, a bunch of templates. We have a template library. You can very easily just take those, basically download them, have this template, modify it locally if you want, and then just turn it into a LangServe app and run that.

And like you were saying, so maybe point one is we have this template library of different templates, but point two, we also have what we call Hosted LangServe. So basically, you can spin up this LangServe app, it can run locally. If you want to host it, we can just do that for you very simply. And actually the video I have with Neo4j goes through exactly that process. It's like two clicks basically, inside the Lang Smith, basically-

Andreas Kollegger: That's so nice.

Lance Martin: ... page. Yeah. You can just click deployments, pick the repo, which has your LangServe app in it, and then it just is deployed and we host it for you.

Andreas Kollegger: So, so nice.

Lance Martin: So actually, the Neo4j semantic template is a hosted LangServe app. I'll share in the show notes.

Jennifer Reif: Cool.

Andreas Kollegger: Jen, I'm going to expose our kind of Java background here a little bit. I mean, the tooling you're describing sounds so awesome. And it makes me think back to things like Apache Camel and even the spring world a little bit where those are frameworks that were built around doing data integrations on the spring side. It was very OGM-ish, kind of based on that. And then it had things that came out around it.

And it was a great step forward for Java land, but there was still so much heavy lifting to be done. It was just a lot to get through. I love how all the things you guys have done, it's been a year, right?

Lance Martin: Yeah.

Andreas Kollegger: And that you guys have covered so much ground. Just the developer experience out of the box is like, you've been thinking about it, you are already kind of ahead of where people are going to need things, which is fantastic.

Lance Martin: Yeah. I mean, we've been working pretty hard and I think there are still a lot of opportunities and places we need to do better. I'll give one example and what we're trying to do to make it easier. I think that... Well, we saw one big pain point was this idea of, how do I go to production? So, I'm in LangChain, I built my little prototype in a notebook, how do I productionize it? And that's kind of what motivated, okay, cool, LangChain expression languages is like, we kind of standardize the way of building chains. And then if you do it that way, you can turn into production really easily with LangServe. So that whole arc of work was motivated by the need to like, how do you use the same code for prototyping production? So that was one big pain point. We saw that. We tried to fix that.

But I will say LangChain expression language is not the easiest to pick up for everyone. For simple stuff, it's pretty easy, right? But then if you get into weird exotic chains, it's not obvious. And I get confused sometimes. And so actually one thing I've been working on is, it's kind of like a code assistant for LangChain expression language. Basically text. It's actually related to your world. We have text to cipher, right? Because cipher can be confusing. Text to LCEL basically. Right? It's the same idea.

Yeah, I mean I'm working on it. We hope to put it out next week. And of course, we'll keep iterating on it, but that's an example of like, you solve one problem, the problem of build a standard way of architecting these chains, which has a lot of advantages, then the invocation methods are common. You can easily support production, but it comes at a cost of, well now people have to ramp up on this new way of building things.

And we got rid of a lot of our old methods, which were these nasty abstract things like some class like vector DB QA chain. And we've moved all that to LangChain expression language just laid out as this chain using LCEL. But people need to learn that. And so we're trying to solve that problem with LCEL kind of coding assistant. And it's similar to what you guys have had to deal with maybe with Cypher and it can be hard to ramp up on Cypher. So what are ways to onboard people to Cypher? Domain specific languages are tricky.

Jennifer Reif: Yeah.

Andreas Kollegger: Totally.

Lance Martin: Right? I would say kind of like, there's trade-offs. You push to solve one problem, you can introduce new problems and you have to go fix those. So, we still have a lot of work to do, I would say.

Jennifer Reif: The reason why technology is never boring, right?

Lance Martin: It's never boring. It's never boring. Yeah.

Andreas Kollegger: The goal posts always move away.

Lance Martin: The goalposts always shift. Exactly. So I spend all this time, okay, let's solve the production problem. Now we have the problem of LCEL is tricky and so we need to solve that problem. And so it keeps it interesting for sure. And I actually really like working on code assistance. In fact, I don't know if you saw, there's a really interesting paper, Alpha Codium, that came out recently.

Andreas Kollegger: I haven't seen it.

Lance Martin: It's like a really interesting idea for building a coding assistant that incorporates basically feedback in the loop where you ask a coding question, it produces a few answers, it auto generates tests for itself, it evaluates those tests, and it feeds back on itself, then it maybe retries. And so this idea of building assistants that are more than chains that have some degree of feedback, that brings up agentic workflows and agents.

And we have a new thing, LangGraph, that is meant to support cyclic workflows with agents where you can enumerate every step in very precise terms that you want your agent to take. And so that might be one big idea we explore with this coding assistant. For example, not just like, one shot, generate one answer, but look at the answer, evaluate it, go and try to generate a better answer. So, kind of run the agent or the assistant in a loop rather than just as a one shot chain. So I think that's just another thread that I'll highlight that we think is pretty interesting.

Andreas Kollegger: I'd love your feedback about a thought here that you're provoking a little bit. There's kind of an evolution of the expressiveness that you're kind of trying to work through here. And it reminds me of, so with Neo4j, back in the dark ages, it was just a Java library. And so the only way you could do stuff with the graph was literally writing Java code to do stuff. And kind of the step past that was there was more of a DSL approach, which was kind of done in a language called Groovy, which still fundamentally runs on the JVM. It's kind of Java extended in some ways.

And that was nicer than just straight to Java because it was more succinct, it had better composability for doing graph stuff because it was purpose-built as an API. And that got us a little bit further. And then we realized, okay, actually we really want a purpose-built language that will be either interpreted or compiled. And that got us all the way to Cypher and bringing in elements of SQL, bringing in the elements of functional programming. How far along that continuum do you think LCEL... Is there an evolution here that you could imagine that actually at some point, it's going to be a language itself that is not a Python-based language, it's a language?

Lance Martin: Yeah, that's an interesting point. I'm actually not sure we want it to go that far because I think that we really want it to be just a way to compose building blocks together in a hopefully as simple as possible way. And I think that with template apps and with the LCEL teacher, the coding assistant, I think we want to make the load on users very minimal. I think we want to try to move away from having anyone need to learn a DSL for LangChain. Because I completely understand the need for a DSL for a database, like SQL database. Of course graph as well, because there's many different ways you'd want to access data from such data sources and you need a very expressive language to capture all of that.

Whereas with LangChain, we're just an app development framework. We basically want to give you easy ways to glue Lego pieces together. I don't think we want to build something that has a very, very high degree of complexity. I think we want to manage that complexity both with maybe tools to just automate the process for you. If you need to compose a really complex chain or templates that just have pre-baked chains, you can just use out of the box. So I think we want to move in the direction probably of making it easier would by my-

Andreas Kollegger: Yeah. So, sort of the out of the box experience will be enough that's there. It's a bit of a parental principle like you'll end up with what you've got. It'll cover 80% of what people need.

Lance Martin: I think so.

Andreas Kollegger: And then maybe there's some effort for customization, and some extreme cases, but kind of the out-of-the-box experiences, everything's there for you. It's not that many variations. Even across different data sources, you'll be able to identify the problems. So, somebody new will be able to come in, get up to speed pretty quickly by just focusing on the tooling you've got and the LCEL language, right?

Lance Martin: I think so. I think so. Yeah, I should think about that more. I mean, my sense is the expressiveness you need in a DSL for a query database like SQL or Cypher is very high. So that's very high in the expressiveness spectrum. For something like what we're trying to establish, I don't have great comps. That'd be an interesting thing to think about, but we want... It's really just, it's like a means of composing chains. There's certainly a lot of creativity in the way people will lay out these chains, but there's probably a pretty common set of design patterns that people will... I'll give an example.

A lot of people want to build a RAG chain and I think a lot of the diversity will come from like, they want to connect with this database, they want to connect with this LLM or this embedding model. So you get a lot of diversity in terms of each component, but the layout of the chain is the same for everyone, if you see what I'm saying.

Andreas Kollegger: Yeah.

Lance Martin: I think the diversity more comes from the blocks rather than the nodes or connected between them. And we could get into the different ways of laying out RAG. There is some variation there, but I think a lot of it is just swapping in different blocks rather than different exotic connections between them, if you see what I'm saying.

Andreas Kollegger: Yeah, that does make sense. Right. And so actually if you could, thinking back to Jennifer's earlier prompting on this, because I keep forgetting as well. So if somebody's new to this stuff and is just getting started, okay, what does RAG, kind of understand the general flow of that, the first step is always a little QA chain. That's the kind of first step for people. Do you think there's a natural progression from there? What do you recommend is like, okay, you've done that. If you want to get a little more sophisticated, what's a good next step for people and where do they elaborate?

Lance Martin: Yeah, definitely. I can give you an overview of the various themes in RAG that we see people work on. So when you think about RAG, you start on one end you say, "Okay, I have a question that I'm trying to ask this system." So the first stop on this journey is, is the question well-posed? And you can have all sorts of issues with users ask bad questions, they have some intent behind the question, but it's poorly worded, that the question can require lots of maybe sub-questions first to build up to the final idea. And so there's a bunch of papers, what we call on this topic of query transformation. So it's taking a question and either rewriting it to be a better question, breaking it up into a bunch of sub-questions, defining a step back question that's like, "Okay, wait, what's the bigger concept here? Answer that first."

So that's like, bucket one. Step one on this journey is query transformation. Can I modify my question to make it easier or better aligned with my documents? That's step one. And then step two on this journey is like, "Okay, I have a better question. Routing. So where does it need to go?" You mentioned previously, like in the real world, you don't just have one database, you might have several. A great example is you have a graph database, a relational database, and a vector store. So you need to route the questions to the right place. So routing is another whole big area.

We have query transformation, then routing, then you have query construction. So you have a natural language question, you need to map that to the DSL of each data source. That's your text to SQL. Text to Cypher. Or even with vector stores you have metadata filter so it's a text to metadata. So I think of our flow, we have query transformation, we have routing. Then for every data source, we have query construction, like writing it in the DSL necessary. So then that's it. Then there is indexing. So, how have you built your data source and which have you chosen? There's a bunch of tricks in indexing.

One that's really popular that we see a lot of is what I kind of call multi-representation indexing. There's bunch of cool papers on this, but basically if you have a document, typically you think I just chunk that document and embed it, but the document can have lots of redundant things that aren't essential to the meaning. So one trick we've seen is, take a document and just summarize it, embed the summary, use the summary of the document for retrieval, but then return the raw document to the LLM. So decouple what you embed for retrieval versus what you ultimately pass to the LLM. That's a really nice trick and it works really well-

Andreas Kollegger: That's cool.

Lance Martin: Yeah. For lots of different types of content. Actually, I've used it a lot for images. And the idea there is, if I have an image, I can actually use a multimodal LLM to summarize it and embed the image summary. And it's easier then because then I'm just using text embeddings to look up an image summary and then I retrieve the right image, pass that to my multimodal LLM, as opposed to needing multimodal embeddings, if you see what I'm saying. So that's one good example of this idea.

Andreas Kollegger: That's awesome.

Lance Martin: So then we have query transformation, routing, query construction, index construction, ideas like multi-representation indexing. Post-processing, so then you have a bunch of documents back from retrieval. There's a bunch of cool things out there, re-ranking, document compression, so maybe you can summarize them. And then the final thing is generation. So there's transformations, routing, construction, building your index, post-process, the what you get back, generation. Pass those to the LLM. And that's the whole RAG flow. In terms of entry points, the thing we see a lot of success with is designing your index in a smart way. A lot of people start with just chunk optimization.

So think about how you're chunking your documents. And a lot of vector stores and graphs as well, I'm sure. You have to think about the minimum stored unit that you're using for retrieval. And that's where there's a lot of interesting ideas related to how you chunk and split your documents that people play with. And we could dig into that. There's some really good videos on that, but I'll link in the show notes.

And this multi-representation indexing trick that I mentioned previously is another good one that we have some nice templates out in the box for I can share. So anyway, that was a mouthful, but the whole RAG flow is like transformations, routing, construction, indexing, post-processing generation, and then in each of those buckets you can try different things.

Andreas Kollegger: Right. And so this is what's so great about anyone who's coming into this fresh is that you've kind of walked these paths before. These are well-trodden paths. Each time you kind of walk down the path is a bit more well defined. And so even if people are a bit intimidated, like okay, all the things you just talked about for people who are new to it, that seems like, "Oh my god, this is a lot of complexity." Within the LangChain setup of this stuff, you don't have to do it all at once. You can start with one bit and add on, right?

Lance Martin: Yeah, of course. And that's a very important point. We actually have existing templates that work out of the box for a lot of these things individually. We actually don't... And in fact, we don't even have templates that do them all. And I don't think... It depends on your application, it's not necessary to do all these things, but we have templates for just for your transformation, just in some industry indexing trick, just post-processing. And then you can easily, if you like one of those techniques, you can really easily compose them. You can combine, yeah, you think about laying out your chain, it's just like another block in your chain you just add. So you say, "Okay, I want post-processing. I like that." So, definitely we have a lot of templates that let you test each one of these things individually, which is probably the right way to do it.

And then of course, you can really easily compose them all together if you want. I will flag, though, one of the most important things, like this is one... And I've actually just hit this doing that LangChain expression language teacher project. Running evals is really one of the most important things you can do in the layout of any of these applications. And so it can be as simple as, if you're building a RAG app, just define five question-answer pairs, and you don't have to use LangSmith to do this. We do support evaluations and I would encourage you to use LangSmith just because I work at LangChain, but of course you can try their other frameworks as well. But running an evaluation is extremely helpful because I've actually found, even very recently with this coding assistant, the simplest method actually can sometimes work the best.

And I'll give you that in particular for that case. What I'm seeing right now is actually retrieval altogether does not do better than just taking... And so maybe I'll walk back. It's a coding assistant for LangChain expression language, which is this syntax we have for building chains, right? We have a bunch of documentation on it online, of course. And so the idea was two different possibilities. One, just scrape all the documentation, feed it into the prompt with LLM as just part of the context, no retrieval, just stuff it all in there and then let it do QA.

The second option was build a RAG app. We already had out of the box, we had an index of all our documents. So, build a RAG app with our index documents and do retrieval. And you would think the fancy RAG app's going to do better, it's more sophisticated, versus just the take the docs stuff them into the prompt, let GPT-4 128k do the rest. That works better. And why is that? And then you get into the details. I've done a lot of... I ran evals to show that and it's like, well, retrieval is not always trivial.

Retrieval has precision and recall problems. You introduce a lot more surface area for mistakes. And in that case, the docs were small enough that they would all just fit in context. 128k is a big context window. That's hundreds of pages, right? So you can stuff them all in. And actually, GPT-4 is very performant at reasoning over a very large context. So that's a good example of this point, when you're building the applications, I mean, start simple because actually we found that RAG in that particular case did not outperform the most trivial thing possible, which is context stuffing.

And I also found-

Andreas Kollegger: Right.

Lance Martin: Yeah, another fun one there is I found that we tried a more sophisticated RAG strategy still, which was like, I don't want to bore you, but basically generate a bunch of sub questions, retrieval, answer each one, and then pass those QA pairs into an LLM and let it do the final thing. The point is, performance was not good and it was because through all those little retrieval and LLM calls, you just blow up the surface area for making mistakes and I had a lot more hallucinations. And you think about, well, you're making God knows how many LLM calls in that chain. It's just higher complexity, more service error mistakes, more opportunity for hallucinations, context stuffing. Simple, clean, push the context in.

That's all you do, no retrieval. And it is performing notably better than the retrieval-based methods. So I understand that does not work for all applications. In many cases, your data far exceeds what you can put in context. But again, it's a good example of build your eval set, run your evals, think about your problem from first principles. Sometimes the simplest methods actually work better. And in that case, we found that the simplest was actually the most performant. We have pretty good reasons as to why. Retrieval introduces lots of new surface area for mistakes. And I believe the many different LLM calls in the multi-vector approach just introduced lots more opportunity for hallucination.

Andreas Kollegger: So it's like there's some sort of a... You're almost describing like you should start with the minimal viable chain.

Lance Martin: Minimal viable chain-

Andreas Kollegger: Do the least amount possible, right?

Lance Martin: The least amount possible. I think that honestly, that is really a harder lesson. I think it's even step zero, build a little eval set. Five question answer pairs if you're building a RAG app. Super simple. Build an eval set, build a minimum viable chain and MVC and evaluate it. And then ramp the complexity, evaluate the...And sometimes indeed, you need complexity. In this case, less was more. And if I hadn't run the evals, I would've... In fact, our intuition was to jump to the more sophisticated multi-query thing. And I actually built that first and then I kind of went back and went back to the MVC and found, "Oh, actually, that's the one."

Jennifer Reif: I mean, I think that's a very common thing for developers to do in general, though. They want to do fancy stuff. They want to-

Lance Martin: That's right.

Jennifer Reif: ... pull all these complex pieces together and over complicate things. And in reality, you don't always need that. Now, yes, some of those features are super nice and necessary for certain things, but I think we add more work and stress onto ourselves sometimes by trying to over complicate.

Lance Martin: Yeah. I've fallen into that trap way too many times myself.

Jennifer Reif: Same.

Lance Martin: Exactly. I think what happens is a developer-

Andreas Kollegger: Guilty.

Lance Martin: Yeah, you start with intuitions, good or bad, about what you think what or how projects are going to go. And this is why evaluation is just so important. It's obvious. I don't know if people recognize that, but you just need an evaluation to cut through those intuitions and biases you may have. And it can lead you to non-obvious results like, oh, in this case, less was more.

Andreas Kollegger: I think people probably picked up on this through the conversation here, but let's take one step back on like, so for software developers, when you hear evaluation, when you hear people, Jenn and I talking about eval, we're talking about basically running tests on how this stuff is working. So if you're used to unit testing or integration testing, it's kind of, you end up with the same levels of things that you can test out, see if it's getting the answers you need. That's a fair way to think about it, right?

Lance Martin: That's right. And so for LangSmith and RAG for example, it's super simple. Build a CSV of question answer pairs for your domain. For me it was code, so it was like, coding question. Answer. That was it. Say, take five. Upload that CSV into, in our case LangSmith, and then you have an eval set. And here's where it's a little bit interesting. You have your chain laid out and what you do is, you basically call your chain. We have some boilerplate code that lets you say, define basically some eval config that says, "Here's my chain, here's my eval set, run it." So, whatever. It's a couple lines of code.

But what's interesting is you can specify the evaluator you want. And so this is where it's a little bit non-obvious. So what happens is, you take the question, you send it to your chain, the chain spits out an answer, and then you have your ground truth answer, right? What we'd use is, we use different... We actually use LLM as the evaluator itself. So an LLM looks at your generated answer and the ground truth answer and reasons about the similarity of correctness.

And there's a whole bunch of work on this, and Open AI actually put a bunch of stuff out on this, Anthropic as well on kind of LLM guided evals. It's a deep topic and it's really interesting. The high level intuition is that LLMs are pretty good discriminators, so it's given two pairs like, "Here's a student, here's the ground truth answer." It's pretty good reasoning about it. And so that's what we do. And then we have a nice dashboard, we can go back and look through your result and you can look at like, here's all my questions, here's what the LLM grader graded it, good, bad, and this is all in Lang Smith, it's really easy to use.

But I will say that LLM mediated evaluation requires attention. You should manually review the results to convince yourself that it's correct or not. I wouldn't just blindly accept what the LLM grader says, but at least it gets you really far down this road of a systematic way to evaluate the outputs of these things, which is otherwise non-trivial, right? Because a lot of times, you're evaluating language. It's even hard to believe eval sets a lot of times, like what is the correct answer?

And so that's why these LLM based semantic evaluators aren't bad. I've even found, honestly for code, it's not bad. I manually went through... I built a large eval set. I manually went through and tested code execution on every single answer and compared that to just the semantic grader and it's pretty good. So it's not perfect, but it's directionally, right? It gets you probably 80% of the way there and then you should manually review. So anyway, that's kind of how these things work. It's still a very active area of research.

Andreas Kollegger: Do you have some intuition about I guess how much you need to eval? Is there a proportional to the size of the data set? If you've got a hundred gigs worth of PDFs, do you need one gig to test or what's your intuition say there?

Lance Martin: That's actually kind of a tough one. Yeah. You know what? There's a lot of different views on this. In classical ML, you have test train splits and there's kind of good heuristics in those domains as to how big you want your train versus test splits to be. But in this domain we're talking about like, I have this chain that I've constructed. I'm just trying benchmark it on a bunch of pairs of examples. Honestly, I don't want to say a number because I don't have a great sense for precisely what. I can give you in the case of like, a lot of things we've built, on the order of 25 to 50 questions is kind of what we've used. That's not systematic though, and it's not particularly rigorous. It's almost what's kind of practical. And this is actually an interesting topic. It gets into, you can also have LLMs generate your eval sets, like auto generation of eval sets.

I think maybe the key point I would highlight is like, you want something that you can actually manually interpret. I'll say it like this. Let's say we had an LLM build a thousand question eval set for us. You could do that feasibly. You run your eval, you get a score of 75%. How do you actually validate that? You can't really, you have to go through. It would be incredibly tedious, right? So I actually kind of like to say, just even if it's underpowered in terms of number, build a small eval set that I can actually manually inspect, 20, Inspecting 20 answers took me two or three hours last night. So it's feasible, right? It's tedious, but completely doable, and at least I have real confidence as to what's going on. So I think that's kind of the practical thing here. You could build an arbitrary large eval set, but can you actually believe the results?

Can you actually interpret it? Can you go through? And look, I argue it would be worse to build a huge eval set that you can't actually... You don't have the bandwidth to go actually review and you're just believing it. Oh. And you blindly believe it versus a small evil set that you actually look at every single answer yourself and manually verify is better. Again, it's kind of less is more thing. You know what I mean? And also I would encourage, build your own eval set, build it yourself. Don't use an auto-generated thing.

So then you have a few really highly curated examples that you actually believe, that you can actually manually review, and you learn a lot more. For the coding example, that's how I really learned like, oh, this hallucination is a real problem when you do the multi-query thing. And there's lots of reasons why there might be. Oh, retrieval is a real issue for the base case RAG precision recall issues and retrieval. And then, oh, the context stuff is actually pretty good. It misses a few small mistakes here and there. So I was able to actually completely understand what's happening with a smaller eval set that I actually curated myself. Yeah. So that's probably the way I would encourage.

Andreas Kollegger: Awesome.

Lance Martin: Less is more.

Andreas Kollegger: That's a lot of good thoughts, a lot of good things to keep in mind.

Jennifer Reif: Oh yeah. And this is something we could probably continue on for hours more.

Lance Martin: Yeah, I know. I know we're tight on time.

Andreas Kollegger: I'm resisting that. There's a bunch of things I want to follow up on.

Jennifer Reif: I know. It's fascinating.

Lance Martin: Yeah. Absolutely. I'll leave all this in the show notes. I'll put all this in the show notes.

Jennifer Reif: Great. For those of you not maybe watching, it's getting really dark outside where I am. I'm losing my daylight.Really fast here.

Lance Martin: You're losing light. Yep, absolutely.

Jennifer Reif: So we'll just kind of segue, I guess, now into the tools of the month. And we like to highlight things that each of us have been working on and are really enjoying playing with, and it's not necessarily graph related or even LangChain related, but just everyday tools. So does anybody want to go first?

Lance Martin: Go ahead. Andreas? Or Jennifer, yeah.

Andreas Kollegger: This is new to me because I'm not natively a Pythonista, and so I'm learning a lot about the Python ecosystem. And so if I was going to build UIs in the past, I would just pick up TypeScript and do some React stuff and that'd be my life. But I've come to appreciate this library called Streamlit that you can use familiar within Python apps and just build really nice looking UIs pretty simply. And it's pretty nice.

I don't know enough about it to be able to compare it to versus building a React app, where the trade-offs are, but the out of the box, you can just get something put together that's nice, has been really lovely. So if you're like me and you're new to Python but you still need to build some UIs, definitely go check out Streamlit. I think at this point there's a bunch of other something-lits out there. There's almost like a family of something lit, UI building tools within Python. That's what I've come to see. So that's my recommendation for this month.

Jennifer Reif: Yeah. And I can actually back up Andreas tool of the month here. I also am not a Pythonista, nor do I have very much experience building UIs. And I've also been kind of playing a little bit with some of the Streamlit stuff. And Andreas is right, there's some things that are pretty nice and it's... Once you get into it, it's much less daunting than what you anticipate, as a non UI person here.

Lance Martin: Yeah. I'll add another plus one for Streamlit. I'll throw one other out there for you. If you guys are interested in all in local LLMs, like running models on your computer, on your laptop, for example, there's a tool called Ollama, which I'm a big fan of. It is very easy. One click to download the app itself and then to grab a large number of open source models and just run them. It's like a one liner like, Ollama, pull model name. You can do all sorts of things like Ollama2. And a lot of the newer models are in Ollama and then you can just... There's a new SDK for it. You can just interact with it on your computer. It works really well with like Macs, although other environments as well. Highly recommend Ollama. Really cool for local models.

Andreas Kollegger: I love that. That sounds fantastic. I want to go play with that a bit more. This is, again, for people who are new to this area like, okay, large language models are large, but actually you can run some of this stuff on your laptop. You don't need to have Google's resources or Microsoft's resources to do this kind of work. To build it you need the heavier hardware, but to run this stuff, you can kind of get away with this.

Lance Martin: Absolutely.

Jennifer Reif: Yeah. And mine is just going kind of back to basics. So I'm choosing Git for this month. Of course, as most developers, if you interact with some sort of version control, Git is a very, very popular one, whether you're using GitHub or not. But I've been playing a lot with GitHub and actually working on a collaboration project with several people with GitHub. And just a lot of the commands to fork and merge branches and collaborate on code projects.

You forget sometimes how cool it really is and how powerful it can really be. And just pulling in everybody's changes and merging that with your code and pushing stuff up for people to look at, creating PRs for people to review before you push code into a critical branch or something like that. It has kind of reminded me and refreshed me on a lot of the commands that I don't use every day, as well as just how really cool and powerful and collaborative this tool really is. So if you're not using it, check it out or maybe just sit back and take some appreciation for some of the stuff.

Andreas Kollegger: I'm going to give a plus one for that, even though it seems like okay, Git is kind of everywhere, but I think it's so everywhere that people don't maybe appreciate when you weren't using Git. If you're using, I don't know, some version or something, even older models of doing version control, git is so nice and forgiving. There's so many commands you need to do. And I find whenever I screw up, Git's super helpful in helping me be like, "Hey, you probably meant this."

Jennifer Reif: Yep.

Andreas Kollegger: Yes. Thank you, Git. Thank you for-

Jennifer Reif: Guiding me in that right direction. Yeah.

Andreas Kollegger: Yeah, totally.

Jennifer Reif: All right, well this is the part of the episode where we allow our guests to kind of exit. And as we're recording late on a Friday, well late for me, I suppose. And if Andreas we're in his normal time zone, it'd be really late for him. But yeah, so thank you so much, Lance, for coming on and chatting and all of the excellent information you provided. I will be sure to ping you about your show notes, links and resources that you want to provide and those will all be included for anyone interested.

Lance Martin: Absolutely. Yeah, it was a lot of fun. Thanks so much. Sorry I probably talked too much, so sorry about that.

Jennifer Reif: Never.

Andreas Kollegger: You were super-

Lance Martin: Yeah, you kind of got me going. You got me going on some stuff.

Jennifer Reif: That's okay.

Lance Martin: Yeah, it was good.

Jennifer Reif: That's what this platform is for.

Lance Martin: Yeah, for sure. All right.

Andreas Kollegger: All right.

Jennifer Reif: Thank you. Have a good rest of your weekend.

Lance Martin: Yep. All righty.

Andreas Kollegger: Time for Neo4j News?

Jennifer Reif: Yes. So just a couple of things I'm going to highlight. We won't keep everyone on here for too much longer, but there are a couple of articles that are sitting out there. I know it's January, so not a ton of progress yet. That'll be later. You'll get hit with the fire hose, maybe. But a lot of things out there actually on RAG and LLM stuff, which is still quite popular. So the first one is why Vector Search didn't work for your RAG solution. This is out on the Neo4j developer blog and I worked through that. I read it. It's a really good article that kind of takes the movies data set and pings different questions with the LLM and shows why a question worked really well, some hallucinations LLMs might have, trying to negate or add negative data inside the data set and seeing how the LLM responds to that.

So it was really cool just to kind of get an idea of different, I guess, kind of variations of things with an LLM when it works really well and when maybe it hallucinates or tends to do worse. The next one was taking YouTube transcripts into Knowledge Graph for RAG applications. Another really cool one, just kind of showing you how to ingest data. I think it used a GCP to push the YouTube videos up and pulled the transcripts out into chunks and then looked at that inside a graph.

NeoDash did some graph dashboard stuff. There's some new things going on there. So if you're in the NeoDash space, definitely check that out. There's a new article on Medium for that talking about graph and LLM powered RAG application from PDFs. So, taking a lot of articles that this author has written before, a lot of conceptual information and actually building an application and showing how this works in practice was really cool.

And then the last but not least was unlocking DAGs in Neo4J, which is directed acyclic graphs. So there's some new GDS procedures and showing just basically how those operate. As always, there's some new stuff coming out in the NODES 2023 playlist. So if you're more of a video type of person, be checking on that. I know there's at least one or two that have been added this month. Also keep an eye out for that.

And then just a couple of events, just to wrap up the whole segment here. JFokus in Stockholm, Sweden is on February 5th. Just a quick shout out from my personal past is I've been to this conference a couple of different times. It's a really awesome event, so if you have an opportunity to go, definitely check that out. There's a meetup on February 8th in Kansas City, Missouri. Actually, I'll be presenting that one. It's on improved results with vector search and knowledge graphs on February 13th.

There's also a meetup in Brisbane, Australia. And on the 21st there's a virtual meetup that I'm presenting. Hallucination-Free Zone, LLMs, and graph databases. Got your back. February 22nd has a Graph Summit London conference and a Developer Week conference in San Francisco. So you've got London and San Fran on February 22nd. There's a meetup on the 23rd in Bangalore, India with a hands-on workshop on GenAI stack. And wrapping us up, we have a conference in Amsterdam, Netherlands called DevWorld 2024, and that's on the 28th. So anything else, Andreas?

Andreas Kollegger: That was plenty. I don't think we need much more. I was a little disappointed that DAG wasn't something to do with LLMs. DAGs and RAGs, I don't know.

Jennifer Reif: That's true, that's true. Yeah. Yeah, that's initially when I pulled up the article, that's what I thought of as well, but it was not. It's GDS procedures. All right, well thank you everyone for joining us on this session and we will talk to you in the next month.

Andreas Kollegger: Goodbye, y'all.

Jennifer Reif: Bye.