GraphStuff.FM: The Neo4j Graph Database Developer Podcast

RAG Databases with Johannes Jolkkonen: When to Choose a Graph Database vs Alternative Vector or Relational Stores

Episode Summary

Welcome back, graph enthusiasts, to GraphStuff.FM, a podcast all about graphs and graph-related technologies! I'm your host, Jennifer Reif, and I am joined by Alison Cossette and guest Johannes Jolkkonen. Johannes is an independent consultant, using his background in data engineering to help companies build their RAG-applications, and he creates tutorials on YouTube around RAG, Knowledge Graphs and LLMs.

Episode Notes

Speaker Resources:

Johannes Jolkkonen: https://www.linkedin.com/in/johannesjolkkonen/
Johannes's YouTube channel: https://www.youtube.com/@johannesjolkkonen
Mar 12 Neo4j Live session: https://www.meetup.com/neo4j-online-meetup/events/299526466/

Tools of the Month:

PDF-bot chunker (GenAI stack): https://github.com/docker/genai-stack/blob/main/pdf_bot.py
Spring AI: https://spring.io/projects/spring-ai
Instructor (library): https://jxnl.github.io/instructor/

Community Projects:

Knowledge Graph for Social Science https://youtube.com/live/wBHgTheV08Q

Articles:

Langchain v0.1 - Updating GraphAcademy Neo4j & LLM Courses https://neo4j.com/developer-blog/langchain-graphacademy-llm-courses/
A GenAI-Powered Song Finder in Four Lines of Code https://neo4j.com/developer-blog/genai-powered-song-finder/
Object Mapping in the Neo4j Driver for .NET https://neo4j.com/developer-blog/object-mapping-neo4j-driver-net/
Slow Cypher Statements and How to Fix Them https://neo4j.com/developer-blog/slow-cypher-statements-fix/
Using LangChain in Combination with Neo4j to Process YouTube Playlists and Perform Q&A Flow https://medium.com/neo4j/using-langchain-in-combination-with-neo4j-to-process-youtube-playlists-and-perform-q-a-flow-5d245d51a735
PyNeoInstance: A User-Friendly Python Library for Neo4j https://neo4j.com/developer-blog/pyneoinstance-python-library-neo4j/

Videos:

NODES 2023 playlist https://youtube.com/playlist?list=PL9Hl4pk2FsvUu4hzyhWed8Avu5nSUXYrb&si=8_0sYVRYz8CqqdIc

Events:

(Mar 5) YouTube Series (virtual): Going Meta Episode 26 https://neo4j.com/event/going-meta-a-series-on-graphs-semantics-and-knowledge-episode-26/
(Mar 6) Meetup (virtual): Exploring Graphs and Generative AI: Unlocking New Possibilities https://neo4j.com/event/exploring-graphs-and-generative-ai-unlocking-new-possibilities/
(Mar 6) Meetup (virtual): Pass or Play: What Does GenAI Mean for the Java Developer? https://neo4j.com/event/pass-or-play-what-does-genai-mean-for-the-java-developer/
(Mar 7) Meetup (Bangkok, Thailand): GraphDB Bangkok meetup w/ GraphQL BKK https://neo4j.com/event/graphdb-bangkok-meetup-w-graphql-bkk/
(Mar 8) Conference (virtual): WeAreDevelopers Women In Tech Day https://neo4j.com/event/wearedevelopers-women-in-tech-day/
(Mar 10) Conference (Orlando, Florida, USA): Gartner Data & Analytics Summit Orlando https://neo4j.com/event/gartner-data-analytics-summit-orlando/
(Mar 11) Training (virtual): Knowledge Graphs & Large Language Models Bootcamp https://neo4j.com/event/knowledge-graphs-large-language-models-bootcamp/2024-03-11/
(Mar 11) Workshop (Bengaluru, India): Neo4j and GCP Generative AI Workshop https://neo4j.com/event/neo4j-and-gcp-generative-ai-workshop-bengaluru/
(Mar 12) Conference (Brussels, Belgium): AWS Public Sector Symposium https://neo4j.com/event/aws-public-sector-symposium-brussels/
(Mar 13) Workshop (San Francisco, CA, USA): Google Gen AI Workshop https://neo4j.com/event/google-gen-ai-workshop-san-francisco/
(Mar 13) Conference (Singapore): Singapore Data Innovation Summit 2024 https://neo4j.com/event/data-innovation-summit/
(Mar 14) Conference (virtual): Data Next Engineering Summit https://neo4j.com/event/data-next-engineering-summit/
(Mar 14) Training (virtual): Intro to Neo4j https://neo4j.com/event/training-series-intro-to-neo4j-2/
(Mar 14) Workshop (Mountain View, CA, USA): Google Gen AI Workshop https://neo4j.com/event/google-gen-ai-workshop-mountain-view/
(Mar 15) Meetup (Delhi, India): Pythonistas and Graphistas: Navigating the World of Graph Databases with Python https://neo4j.com/event/pythonistas-and-graphistas-navigating-the-world-of-graph-databases-with-python/
(Mar 15) Meetup (Bengaluru, India): Graph Genesis: Building Tomorrow’s Insights Today https://neo4j.com/event/graph-genesis-building-tomorrows-insights-today/
(Mar 18) Training (virtual): Knowledge Graphs & Large Language Models Bootcamp https://neo4j.com/event/knowledge-graphs-large-language-models-bootcamp/2024-03-18/
(Mar 18) Conference (Paris, France): KubeCon 2024 https://neo4j.com/event/kubecon2024/
(Mar 18) Workshop (Singapore): Neo4j and GCP Generative AI https://neo4j.com/event/neo4j-and-gcp-generative-ai-workshop-singapore/
(Mar 19) Conference (virtual): AI42 Conference https://neo4j.com/event/ai42-conference/
(Mar 19) Workshop (virtual): Tame Your Graph with Liquibase for Neo4j https://neo4j.com/event/training-series-tame-your-graph-with-liquibase-for-neo4j/
(Mar 20) Meetup (Melbourne, Australia): GraphDB Melbourne March Madness https://neo4j.com/event/graphdb-melbourne-march-madness/
(Mar 20) Meetup (London, UK): The Perfect Couple: Uniting Large Language Models and Knowledge Graphs for Enhanced Knowledge Representation https://neo4j.com/event/the-perfect-couple-uniting-large-language-models-and-knowledge-graphs-for-enhanced-knowledge-representation/
(Mar 21) Training (virtual): Mastering Neo4j Deployment for High-Performance RAG Applications https://neo4j.com/event/training-series-mastering-neo4j-deployment-for-high-performance-rag-applications/
(Mar 21) Meetup (virtual): Neo4j & Haystack: Graph Databases for LLM Applications https://neo4j.com/event/neo4j-haystack-graph-databases-for-llm-applications/
(Mar 21) Workshop (Los Angeles, CA, USA): Google Gen AI https://neo4j.com/event/google-gen-ai-workshop-los-angeles/
(Mar 26) Meetup (Sydney, Australia): GraphSyd March Meetup: Unraveling Connections https://neo4j.com/event/graphsyd-march-meetup-unraveling-connections/
(Mar 26) Conference (Las Vegas, NV, USA): Microsoft Fabric Community Conference https://neo4j.com/event/microsoft-fabric-community-conference/
(Mar 26) Workshop (virtual): Large-Scale Geospatial Analytics with Graphs and the PyData Ecosystem https://neo4j.com/event/training-series-large-scale-geospatial-analytics-with-graphs-and-the-pydata-ecosystem/
(Mar 27) Meetup: Graphs & Vectors: Navigating the Future with Neo4j and Vector Search https://neo4j.com/event/graphs-vectors-navigating-the-future-with-neo4j-and-vector-search/

Episode Transcription

Jennifer Reif: ...back, graph enthusiasts, to GraphStuff.fm, a podcast all about graphs and graph related technologies. I'm your host, Jennifer Reif, and today I am joined by Alison Cossette. And I'm really looking forward to chatting with our guest today. I hope you are as well. Joining us is Johannes Jolkkonen. Johannes is an independent consultant using his background in data engineering to help companies build their RAG applications. He also creates tutorials on YouTube around RAG, knowledge graphs, and LLMs. Welcome Johannes.

Johannes Jolkkonen: Thank you, Jenn. Thank you. It's great to be here.

Jennifer Reif: So I did just a tiny bit of research, but I was hoping you would kind of fill in for the audience. I saw you have done some things on Azure, as well as some LLM applications. How did you get into that mix?

Johannes Jolkkonen: Yeah, it's quite a journey, as it often is. Basically I started my career in IT, in tech, back in business intelligence, very front facing business intelligence, dashboarding, Power BI, this kind of stuff. But then I pretty quickly started to just climb my way up the data pipeline into more and more data engineering type of work, integrations and databases and data modeling, that kind of stuff. And then that pretty easily led into more platform engineering work, working with cloud platforms, mainly Azure has been my go-to, infrastructure as code, deployment pipelines.

And that's basically been the journey, that's what I've mostly been doing up until last summer. Mostly working, doing consulting. So a lot of variety, different projects for different types of enterprise clients.

And the turning point in terms of the shift to more of a focus to language model applications, which is now has been the thing, happened last summer, 2023. Sure, at that point I had done some playing around with these language models that had taken the world by storm, but there was more of a sideshow at that point, trying to figure out if they can help me do my work better. But still my main focus was still on the more conventional data engineering stuff that I had been doing.

But then I was talking with a friend of mine, and we came up with an idea for a startup that we got pretty excited about. The short version is that it was related to working with travel information, especially administrative things like visa processes and becoming a resident in a new country, this kind of stuff. Working with that kind of information to help people navigate that. And we also saw a big part of the idea was using language models and something like RAG to make kind of information more accessible.

And we really got excited about that idea, and that's really when I first got seriously building something hands-on and getting into all of this. And that was something we worked on pretty intensely for several months. Although eventually me and my friend, we decided we had other priorities and we kind of put that project on ice, but the end result for me was still that I got hooked, basically, on LLMs and RAG and decided to shift my whole focus on doing more of that.

So now I'm still doing independent consulting, like you said. But instead of the pure data engineering work, now it's figuring out how to best create data applications, but with the LLM twist.

Jennifer Reif: That's pretty cool, actually. Because we've talked to some people in the past, and so far a lot of who we've talked to has been building startups around LLM platforms or technologies. We haven't really talked to too many people who are using RAG and LLM applications to do other industries outside of tech. Things like travel, as you mentioned your startup was kind of focused around, and maybe other general business use cases that you would see outside of just the tech sphere, if you will. So that's really interesting.

And I also saw you have a YouTube channel that seems to be well populated or has a lot of content there. How did you get into that?

Johannes Jolkkonen: Yes, it is getting there, becoming more and more populated. That I also started doing last summer. And mainly the main motivation was for personal learning, because as you probably understand as community creators yourself, it's just such a good way to learn. When you sit down and you decide, okay, I'm going to create a blog post or a YouTube video about this topic, you really have to learn something very deeply, do a lot of research for a 10-minute video. If you want it to be good, you need to do enough research for a 200-minute crappy video. And then setting yourself on some kind of a schedule, okay, I'm actually going to put this out there every two weeks or once a month even, doing that regularly every time.

Jennifer Reif: Right. That's actually my favorite way to learn.

Johannes Jolkkonen: Yeah, it really adds up. I mean, we tend to be learning a lot constantly as developers, but it can often be superficial and ad hoc. You figure out something just enough to get the code working, but you don't really go deep in the same way as you do when you create content.

So that was the main motivation. I liked the video format because you also learn to speak better, which is a underrated skill, definitely.

Well, the last thing maybe I'll say is, that wasn't part of the initial motivation, it's more of a thing that's surprised me afterwards, is the amount of community engagement that's been coming my way as a result. Still the channel is pretty small and I still have a lot of work to do to do it more regularly. But even then a lot of people have been reaching out, just like people building startups, just wanting to share ideas, share talk about their projects, you guys obviously getting to talk in occasions like this. That's been so much fun. That's been a ton of fun, and it's really been quite surprising as well.

Jennifer Reif: So this started as basically a way to learn the stuff, the technologies or the topics.

Johannes Jolkkonen: Oh, 100%.

Jennifer Reif: Okay, cool. And then it's just kind of propagated, as it so often does, to other people then picking up and learning from you and then kind of engaging back with you, which is really cool. That's actually one of my favorite things about advocacy, is the ability to learn something and then to share what you've learned with other people, hoping that it helps somebody or somebody can avoid the pitfalls you hit along the way or so on. So yeah, I love that about content and I love that about the technology industry in general.

Is there specific thing in your videos you like to focus on, topics, or do you tend to go deeper or do you covering the basics or?

Johannes Jolkkonen: I would say I do like to go quite a deep in doing something actually non-trivial, let's say. And I think the hands-on aspect is the biggest thing that I try and keep up because there's a lot of people doing, let's say, more newsy types of videos. "Google just announced this. OpenAI just announced that." Which it's more like can attract a lot of audience, but what a lot of people have been giving me as feedback from my content is that it's really helpful to also just have a very step-by-step more code with me type of a tutorial to kind of bring it to practice.

So that's what I enjoy doing, that's how I learn the most myself. But that's also seems to be useful for a lot of people.

Jennifer Reif: Yeah. That's really great. I love that.

Alison Cossette: Yeah, I will say that's what actually I was taken, the first time I saw one of your videos, what I was taken with is the fact that it is non-trivial and that it's very... I'm a long time teacher in data science and so I really like the way that you make it very achievable and it's a very welcoming vibe, if that makes sense, in the way that you communicate with people. And it's just really empowering, I think, the way that you present the material. People will watch and say, "Oh, I can follow along with that. I can do that." So that's what I appreciate about your work, for sure.

Johannes Jolkkonen: Beautiful. Beautiful. Thank you.

Jennifer Reif: Yeah. So for those of you listening, if you have not checked out your Johannes's videos, I will put a link in the show notes and send you on over so you can check that out and code along with him, if you so choose.

As far as the core topic that I'm super excited to talk about today, you've done a lot, as you mentioned, with RAG, especially recently in the last few months with your startup and some other applications and consulting maybe. And you wanted to kind of focus on today when it makes sense to choose graph databases for RAG, which is a short term for retrieval augmented generation, versus alternatives like using a traditional vector database. I say traditional, but vector databases aren't that old. And then the even more traditional relational databases.

So if we could talk a little bit more about your thoughts and your opinions and what you've seen.

Johannes Jolkkonen: Yeah, absolutely. Well, I first got into graphs actually also as a result of this startup. Because in our case, it was pretty clear in the early stages of the development and exploration that a simple vector store isn't going to do the trick, because we did clearly need more structure in our data, more links and relations between different categories. So that's how I got into graphs as well. I stumbled on some great material from Tomaz Bratanic from Neo4j. He's been putting out great articles and still does.

Jennifer Reif: Yep. He's amazing.

Johannes Jolkkonen: Yeah, he's amazing.

But what I have also been seeing, especially recently, I'll give you an example, not too long ago I was talking with a guy who reached out and he wanted to do a RAG type application with legal data. And the data he wanted to work with was mostly historical legal cases, case documents, law texts, acts, this kind of stuff.

Jennifer Reif: Which probably can't be public, right?

Johannes Jolkkonen: Actually, I'm not sure about the... I think they didn't have the personified information. But still, a lot of these cases, I do think they are public minus the personal details.

But the main thing about this data was that it did have a lot of links included as part of the metadata. Things like, okay, when there's a legal case, there's a final decision that's made. What's the legal basis on law that that decision is based on? So you have some links on the related laws that are used to making the decision. You have links between laws and other laws. If you have a law set at the national level, that might be related to some directive at the European Union directive, or if you're in the states, a federal legislation might trickle down into some state level law. So interesting linkage was there available in the data.

So what he wanted to talk to me about was the possibility of creating a knowledge graph out of this data to help people navigate the information. Which makes sense at a first glance given that that data is available. But then when we actually talked about the use case that he was trying to enable for users of this application, it really boiled down to mostly lawyers looking for precedent cases. So they're involved in an upcoming case, and they want to do some research, they want to find past cases that are similar that they can then use in preparing for this new case. And this guy thought that would be like 90% of the cases, probably, for the users of the application.

And that boils down to a similarity search problem, really. You can achieve something like that most likely with just embedding text and comparing those embeddings, and you'll find similar cases just like that. And maybe if you also then want to retrieve the related laws for those documents, that's still only one hop away. So then you can maybe bring in a simple relational database schema, something like Postgres, with a pgvector, which adds vector store support to Postgres. So you can do the similarity search and then also do the simple relational search in the same database.

And yeah, I gave him the advice that I didn't think it really made sense to go for the graph approach at that point, because although you can do the same things, you can do similarity search, you can do vector search in Neo4j, and you can of course do the one hop searches in a graph as well, but you do also add a lot of complexity, some very specific challenges that I think we can also talk about a bit later on. And if you can avoid that, I think it is a good idea to get started building something with the simplest possible architecture you can.

Jennifer Reif: Yeah, absolutely. Yeah, there are cases, and I think we have some content on this, but it's a tough topic to talk about as, okay, graphs are great for certain things, but then there's other things they don't handle so well, or they don't handle any better than other technologies out there.

And I think what you kind of recommended is, okay, it's low connected data, which means you don't have a lot of hops between different types of entities. You're not having to do a lot of lookups or joins as you would in a traditional relational database. If you only have one, maybe two joins, it doesn't make a whole lot of sense for you to put that into a graph. You're not going to get massive amounts of value over something else in that respect.

So I would tend to lean in the same way that you recommend it as well with that. And that's one of the interesting things about databases and data in general is, there's all kinds of solutions out there and how do you know when to pick something? And that's a tough decision to make for developers and others.

Johannes Jolkkonen: Yeah, definitely. And I think, obviously I do love graphs for a lot of stuff, otherwise I wouldn't be on the podcast, but the way I presented him is more as a short-term roadmap versus a long-term vision. Because what graphs do enable is a lot of more, let's say, advanced features, things that have to do with something like recommendation features in your system. Those often do involve more complex hops between different entities, doing things like collaborative filtering, looking at what similar users have liked or interacted with in the past and using that to recommend something to a different user.

But again, it's something that might be better placed somewhere further along down your roadmap, rather than the first thing, I'm going to build a RAG application and it's going to go from zero to graph. Which I think happens partially because it's such a gripping concept, even visually, the idea of knowledge graphs, it's very intuitive, which is one of its strengths. But that does, I think, blind people who are just getting into the field to not think about the trade-offs and the challenges until they actually get totally slumped at some point and figure that they can actually do this in the, let's say, early stages.

Alison Cossette: Yeah, I think especially because everything around RAG and GenAI is still relatively new for most people, that there's a lot of information, and not a lot of experience on any given team. And so to have resources like this conversation and what you cover in your YouTube, it's really important to recognize that it's new for people, and developers don't necessarily know where to go. I come from the data science side, so for me it's much more obvious.

But now that we have developers that are really involved building AI in newer ways, I think just taking this very thoughtful approach to the data that's underneath and the database itself, is just such a really important moment. So I'm glad that you're here to share what you've learned.

Jennifer Reif: And just to kind of take a step back for those who are maybe really, really new to RAG and LLM in general. RAG, or retrieval augmented generation, as it's been shortened to to RAG, I mostly think of it as using an LLM, but accessing an external data source of some kind. Usually that's a database, like graph, like relational, like vectors, or vector database, I should say. And it's just using an external data source to add more information being sent to the LLM so that the LLM makes a more informed decision about what the response should look like.

So for instance, if I want a recommendation on where to travel, but I have this personal data store of all the places that I've been, an LLM would not have that information, and so it might recommend a gorgeous beach somewhere, but maybe I prefer mountains and maybe I've already been to two or three beaches and it might recommend someplace I've already been or someplace I don't really want to go.

So that's where you could potentially add the data source of all the places that I like and I've already been, and send that over to the LLM. And the LLM will go, "Oh, don't use these things. Pick something else instead." And so then it can make a more informed decision and give you back a better recommendation for that. So just to take a quick segue out.

But maybe, Johannes, you can talk maybe about your thoughts on when a graph database, maybe an example that you have that where a graph database was really successful or where others might not have worked well. And then maybe something on your thoughts about the same about vector stored as well.

Johannes Jolkkonen: Yeah. Yeah, absolutely. On the first count, I think I can use my own example of the travel data we were working with. There was a very clear hierarchy as well of we were structuring the data by region, first of all. So geographical information, you have a country and a country has cities and each city has their specific types of information, their points of interest, but also their regulatory specifics.

So then just navigating down that hierarchy, it's something that's difficult to do if you're just looking for, let's say, vector similarity. Because a certain type of restaurant for example, what makes those kinds of restaurants a group can be shared across all cities in the world, so there's no differentiation there.

Whereas if you do have this graph, then you can in a better way and you can navigate down the graph into your correct level of detail. And especially if you want to do things related to geographical distance, graphs are also great for that.

And I think a recommendation system is a really good, more of an established use case for graphs as well. Because again, you are looking up connections in user behavior to all kinds of different content that they have interacted with.

And then again, what I mentioned in the past about the social recommendation, that really depends on what other users you are interacting with and what those users then do on the platform, that can trickle down to your recommendations or the way that the system is customized for you. So anything that has to do with social connections, that could also be social connections in a professional network.

Another proof of concept prototype I have been working on was related to finding experts in a company. So finding experts based on not only their skillset, which might be something you can do in a simple vector store with some metadata, you can have each expert as their own entity with some just tags for the skills they have. But as well as the skills, we also wanted to find them based on the projects that they had been involved in the past. So to see, okay, there's a new project coming up. First we want to find some similar projects, and then we want to find the experts that were connected to those projects.

I think that those might be the examples I can think of top of my head.

Jennifer Reif: So that's almost combining a social and a recommendation combination there.

Johannes Jolkkonen: Absolutely. Absolutely. And it's also in general, anytime you are combining different types of searches, it makes sense to go for a graph approach.

Jennifer Reif: Okay. And then I guess maybe briefly on just a vector store, what usually makes sense there? Is it just when you're looking at similarity searches and you're not trying to find a lot of extra data with that?

Johannes Jolkkonen: Yeah, I would say so. If you're looking for isolated pieces of information, that's the way I would put it. So whether that's by similarity or keyword or category or you can also apply things like geographical filters if you have the geographical information.

But the bottom line is that these points, individual points, don't relate to each other in a way that's important for your search. That's really what a simple... The vector store is flat, basically. There's no hierarchy, there's no links, there's no relations. But oftentimes that's enough because you don't need to have links, but you can still have useful metadata like these tags I mentioned. If we were just looking for experts based on skills, we could do that in a vector store because you could just have that as a metadata field and you can filter by that.

But the moment you start to want to have this interconnected search where the results depend on what the people did with other entities or other people in the past, then it starts to reach its limits.

Jennifer Reif: And you also had a note here looking at entity resolution and deduplication when you're looking at a knowledge graph, and maybe what kind of challenges that involves.

Johannes Jolkkonen: Yeah, that ties into this discussion with, okay, what exactly are the challenges in a graph? What's the reason there not to always go for it? Because it's, I would say, the most powerful option that can basically do what a vector store, and it can do and it can do what a relational database can do for the most part.

But there's basically two main challenges that they are involved, and one of them is this entity resolution question. And so what that means is just the problem of determining whether or not the records you get from different data sources are actually referring to the same entity. And that's relevant in this context of what a lot of people are exploring now is extracting entity and graph information from unstructured or semi-structured data, and making these implicit relations in some documents more explicit. And then you can build your graph out of that. It's a very promising application and these language models are actually really good at doing something like that.

But the challenge there is that if I was going to extract some data related to cloud platforms for example, then my data, my graph extracted data, would probably contain things like AWS in lowercase, AWS uppercase, Amazon Web Services. But all of these three are obviously they are referring to the same thing.

So what that means is that essentially every time you are doing this unstructured to graph data extraction, you basically need to add another step there, which is this entity resolution step. So figuring out, okay, which nodes are actually in our raw extracted entities? Which ones we need to merge, combine?

And that's can be very challenging. There are some good methods there. From simple business rules, you might have something like, okay, if two user nodes have the same phone number, then that means that they're the same user and we're going to combine them. But in this AWS example for example, there is no clear identifier, there's no universal identifier you can use.

In that case, one thing I've been looking at for this purpose is the similarity algorithms on Neo4j, which I think have a lot of promise for that. And I'm actually going to plug my upcoming demo on the Neo4j live in about two weeks where I'm going to do a walkthrough of how to do this in practice.

Jennifer Reif: Great. I'll put a link to that in the show notes too.

Johannes Jolkkonen: Excellent. Excellent. But the short and sweet of that is that node similarity, it's a similar concept as something like semantic similarity or vector similarity, but instead of being able to just look at similarity based on text, these Neo4j algorithms allow you to compute similarity of nodes also based on their relationships.

So if we imagine our three different versions of AWS, all of them are probably going to have similar relationships like parent company is Amazon, platform has service Lambda, Amazon S3, EC2. And then looking at those relationships, we can compute a similarity score that's probably going to be pretty high between those nodes and then we can perhaps tag them and then merge them into one.

But it's something that there's no one size fits all solution for that, and you'll probably need a lot of different processes to make sure that all your different duplication problems can be solved. Because if you are going to have these kinds of different versions of same entities, then obviously the consistency of your graph is going to suffer quite a lot. All your data related to AWS is going to be scattered in different regions of the graph, and then you aren't going to be able to get the information as reliably.

Alison Cossette: Yeah. I mean, for me, this idea of the entity resolution as being an integral part of the pre-processing when we're working on graph problems of all different kinds, we definitely see that. And again, anytime you're working on machine learning of any kind, really being able to get that data as tidy and clean as possible.

But especially in graph, to your point, Johannes, the power of graph is in the structure. And so when that structure is not reflective because of maybe missing out on that entity resolution step. So it's really important to have that be part of the pre-processing. So I love that you're pointing that out. Thanks.

Jennifer Reif: The other challenge you had mentioned is the text to Cypher. So doing a knowledge graph with retrieval augmented generation, fine-tuning the LLM in order to generate that Cypher code in order to run against the database itself and get the results back that you need.

Johannes Jolkkonen: Yep. Yeah, absolutely. So that's the second issue. One has to do with actually constructing the graph, as we just discussed. But then once you have the graph, you're also going to want to query it to make some use out of it. And often the obvious RAG approach is to just take natural language questions from the users and then convert them into the appropriate Cypher statements that you can run against the database.

Jennifer Reif: Right. Then the user doesn't have to learn Cypher, right? Or specifically, yeah.

Johannes Jolkkonen: Yeah, exactly. Not an ideal if you're using for a customer service use case or something.

Jennifer Reif: Right.

Johannes Jolkkonen: And again, these language models, especially up to GPT-4, they do have a pretty good understanding of Cypher generically. You can do simple queries pretty well. But if you're going to do more complex ones, which is often the reason why you want to use a graph in the first place, then they do start to struggle pretty fast and pretty hard.

So one solution to this, I think the necessary solution if you want to take this kind of text to Cypher use case into any kind of real life scenario, is fine-tuning. So just using some examples of queries and then the correct Cypher statements and fine-tuning an open source model to be better at the Cypher generation. And I think I've seen some indication that that's definitely, you can make them reliable enough to a production use case.

But the challenge there is the data, of course. Because, well, first of all, you can't use a generic publicly available training dataset of this kind of good examples, for two reasons. One is being that one doesn't exist right now. I know that Tomaz, again, is working on this crowdsource initiative to build a training set like this, which I think is awesome and I think that's going to help a lot of people to do this to improve the performance.

But I think there are also a lot of use cases where these generic Cypher examples aren't going to be enough, but you're actually going to need examples directly from your graph. If you have in your specific use case a lot of time-based queries, for example, and maybe you have a lot of different date fields in your graph, you have users with registration dates and birth dates and you have transaction dates and delivery dates, then these are things that might confuse the model in not knowing exactly which field to filter which query on. So then that brings in the importance of having some examples that are very specific to you. But then getting those, generating, creating those examples isn't easy because you do need thousands, ideally, to make a significant impact. So that would be the other part of the two biggest challenges that I see.

I will add to that though, that that really only applies to situations where you are having the language model generate the Cypher from scratch every time. And that's not always the case, because you might be doing something very complicated, for example, again, a recommendation engine, you might have a very, very monstrous Cypher query to look at relationships between users and their history and all kinds of things, but that recommendation algorithm might actually be always the same, with only things that are changing or maybe the user ID, that's specific the recommendation or specific the user. And maybe if you're recommending movies, there's a genre parameter. But then the rest of it you can have just hardcoded in your application, and then all you need to do is actually just substitute the two parameters. And then suddenly now this is a problem that's a lot easier that you don't necessarily need fine-tuning for.

So it's a difference between having a system that can generate arbitrary Cypher statements and very flexibly any kind of queries that your graph could potentially be asked. If you want to support that, difficult. You're probably going to need fine-tuning. But again, it's important to realize that, okay, maybe in reality we only need to support three, five different types of query types and we can have them actually predetermined. So instead of having the language model do 100% of the complicated query work, you are only leaving the last 2 to 5% to the language model, and then it can actually perform well.

Jennifer Reif: Okay. Yeah. That makes sense. Okay, that's super helpful, I think, for those kind of looking at... I think it sometimes seems a black box on, okay, you have knowledge graph, you have LLM, you have RAG. How do all these pieces kind of fit together? And I think we've kind of covered quite a bit of ground on when you might choose one thing or the other and then the challenges that might come along with that as you make those decisions.

I think this kind of wraps us up pretty well for this section, and we'll kind of jump right into the tools of the month. Alison, do you want to go first?

Alison Cossette: Sure. So my tool of the month is actually the PDF bot chunker script that is part of the GenAI stack that came out with Neo4j and Docker last fall. So if you go into Docker and you look for the GenAI stack, you'll see it's one of the scripts in there. And I know a number of different people that have actually used it as a way to break up their PDFs without having to start from scratch. So that's my tool of the month.

Jennifer Reif: Awesome. That sounds great.

Alison Cossette: What's yours, Jen?

Jennifer Reif: So mine is, I've been focusing lately on Spring AI, and I'm working on kind of building some demos or some applications. This is just exploring what Spring AI has to offer. So I've kind of done some other things with the Spring framework and some of the other tools that they provide, like Spring Data, they have a Neo4j integration. But Spring AI also has an integration with Neo4j, using Neo4j as a vector store. And so I've done some things recently on that.

And there's, as with most AI tools and things going on in that space, there's still a lot that's being learned and being updated and made easier for developers as we move along. But there's been some great response. Some immediate updates made in the Spring AI library and ecosystem. And the documentation keeps improving. And so there's some really neat things there.

And just the application that I've built, if you're already familiar with the Spring stuff, it's pretty easy to get started and just kind of do some really cool things and introduce LLM and RAG without going too far out of your comfort zone, I'll say. So Spring AI is my tool of the month.

Johannes, do you have one?

Johannes Jolkkonen: I do have one. It's an easy choice actually. It's a library called Instructor, developed by a guy called Jason Liu. And basically what it allows you to do, it's a thin layer on top of the OpenAI API. You can also use it with open source models. But what it allows you to do is allow you to define your desired response in the format of a Pydantic class. So what that means is that you have some predetermined structure and attributes to your response, so it's not just a string that you're getting back, but it's actually a Python object.

It's a very simple library, actually. Surprisingly little code. But it's more of a mindset shift, really. Because it's like the guy who developed the library, Jason, puts it, it makes interacting with the language models feel more like coding and less trying to emotionally blackmail the language model into giving you valid Jason, for example, sort of like writing that so many times in the prompt.

But it actually allows you to work with these objects that are more familiar to any programmer in a way that's more readable. You can see exactly what you are supposed to get back. And then also if you don't get back what you're expecting for whatever reason, you can catch that quickly because these Pydantic classes have validators that you can run to see if for example age is greater than zero or if you have full name has a space in between, something like this. And then if those validators fail, you can raise an error or you can retry again.

I'm only getting started using it more. But it's a really wonderful, I think, mindset shift above all. So highly recommend Instructor. And also the guy who developed it, Jason Liu, he also runs a blog, and has some really great ideas in general about RAG and how to work with LLMs.

Jennifer Reif: That's really cool. I'll definitely put a link to that in show notes as well so anybody else interested in the tool can check it out as well.

So Johannes, this is kind of the segment that we can kind of close out and let you take your exit, if you would so choose. So first, just thank you so much for joining us today and kind of providing your wisdom and your experience of all these very complex topics. I know I've really learned from it and been excited about it, and I hope that others will see that as well. So thank you.

Johannes Jolkkonen: No, beautiful. Thank you.

Alison Cossette: Yes. Thank you, Johannes, so much.

Johannes Jolkkonen: It was a lot of fun. It was a lot of fun.

Jennifer Reif: All right. Well, have an excellent rest of your week.

Johannes Jolkkonen: You as well. Ciao ciao.

Jennifer Reif: Bye.

Okay, Alison. Do you want to talk about the community project for this month?

Alison Cossette: Sure. So this month's community project, I was lucky enough to also host the webinar that we did on it recently, and it was called Bridging the Gap, so Neo4j and knowledge graphs for social scientists.

And in this particular project, the researcher was looking to find influential people and documents in policy formation, public policy formation. And we've got a few other, I think there's a blog that's going to be coming out soon, but we can definitely connect you to that webinar.

So the idea is how can you as a novice take something as simple as a CSV, put that into Neo4j, also use the bot chunker for any of the links that you have to PDFs, and put all that into and run some very basic... The data science was actually run in Bloom, and so there was no coding. So it's very almost no code leveraging of Neo4j to find influential people and documents in public policy. So it's really interesting stuff there.

Jennifer Reif: Oh, that's really cool. So even if you're not necessarily familiar with social science or not in that industry, sounds like there's something in it for everybody here.

Alison Cossette: Absolutely. I mean, I think that's one of the things for me as a developer advocate, is how can we get this concept of knowledge graph into the hands of people in ways that are really low code, no code? And we have so many opportunities to do that.

So again, going back to my tool of the month, the GenAI stack. The GenAI stack is a really easy way for folks to go in, use the PDF uploader, load in their documents, and talk to their documents directly. And there's very little code that you need. A basic understanding of Docker Desktop. But an ability to play with RAG and to play with understanding how all of that works and everything Johannes was talking about today. So for me, all roads lead back to the GenAI stack these days.

Jennifer Reif: Well, cool. So shifting then from community projects and looking at articles. So things are starting to pick up for the year of 2024. I've noticed it was relatively quiet in the article and video and just general happening space for the month of January and February, but we're starting to see some pickup now for this month.

So over the course of February, there's been a few things going on. In the article space on the Neo4j developer blog, we have a topic on updating the GraphAcademy LLM courses with the new Langchain v0.1. So Martin, who's the author and on the GraphAcademy team, talks about how the new Langchain update is backward compatible. But you're getting some really nice features now for things like import... Or the import's changed, excuse me. So there were some things there you have to update or fix. Invoking things is a little bit different. And then there are some changes to the agent as well.

So if you want to learn how Martin updated our GraphAcademy course with LLMs, or if you just want to see what the new Langchain version has in store for you if you've been using previous versions, feel free to check out that article.

There was also a really excellent article on a GenAI powered song finder in four lines of code. So this basically the author wrote a Cypher statement to run vector similarity search and pull a related album, to search for a song that talked about X topic, or a song that had this as the theme. And it was really cool. Four lines of code, four lines of Cypher. And it would pull back really excellent results. Now, he just looked at one particular artist and all of their albums over the course of history, but you could find larger libraries or song groupings if you'd like to do a broader search. But just a really neat kind of short and sweet example, but really powerful.

Alison Cossette: The power of tiny code.

Jennifer Reif: Yes, exactly. The power of Cypher, specifically.

Alison Cossette: Yes.

Jennifer Reif: At least in this case.

So there was also an article on the object mapping changes for the Neo4j driver in .NET. A little bit easier mapping. It's a new preview feature, so be aware of that. But you now have things like custom mapping and there's some feedback as well. So if you check out that preview feature and you have some positive or negative things you'd like to input back into our engineering department, feel free to check out the feedback there, play around with that.

Adam Cowley wrote an article on slow Cypher statements and how to fix them. Cypher is one of those things that it's a very challenging puzzle to me, but I find it super interesting and cool. So Adam was looking at debugging a GraphAcademy course. There were some issues with some people running into slow Cypher statements there. And so he talks a little bit about how to understand the Cypher planner, how you deal with dense nodes and how to route around those, and then utilizing specifically an index in your query so that you can kind of high power that and improve the speed there.

There's another article on Langchain to process YouTube playlists and perform Q&A flow. So that basically pulls YouTube transcripts and shows you how to chunk that up and create a graph out of that. So again, Langchain and LLM focus there.

And then there's an article on Py2neo instance, which is a user-friendly Python library for Neo4j. So the article is up on the Neo4j developer blog there if you're interested in more Python related stuff with Neo4j.

As far as videos go, not too much happening in the space, but if you missed NODES 2023 and you want to catch up on that, there's still a few videos trickling out, so definitely refer back to that and keep an eye on that playlist.

Just in closing, the event section for this episode could have been absolutely forever. I probably could spend 20 to 30 minutes just on events this month. It's insane. But I'm just going to highlight a few things and I'll put everything in the show notes so you guys won't miss anything. But just to kind of highlight and point out a few key things that are going on this month.

So on March 5th, if you've been following the Going Meta series with Dr. Jesus Barrasa, there's another installment of that occurring on March 5th. So it talks about a series on graphs, semantics, and knowledge. So episode 26 is the one that we'll be hearing this month. So feel free to check out that live stream.

On March 6th, there's a couple of meetups going on actually, both virtual. One is Exploring Graphs and Generative AI: Unlocking New Possibilities. And the second one is Exploring Graphs and Generative Pass or Play: What does GenAI mean for the Java developer?? And that second one actually, I'm the presenter on, so check that out. So both those events are March 6th, which is next Wednesday.

Okay. March 7th there's a meetup in Bangkok, Thailand. So the Graph DB Bangkok meetup with the GraphQL Bangkok meetup as well. So check that out if you're in the area.

On March 11th, there's a training. So we have a few different training series going on, all virtual. So the first one is going to be Knowledge Graphs & Large Language Models Bootcamp. And then there's also on March 11th a workshop in Bengaluru, India, on Neo4j and GCP generative AI. So there's a couple installments of that workshop occurring. The first one happens in India on March 11th.

March 14th, there's another training that's going to be virtual that's an intro to Neo4j. So if you're just getting started, kind of want to know across the board basics, feel free to check that out on March 14th, 15th.

There's a couple of meetups, both in India, one in Delhi and one in Bengaluru. One in Delhi is on Pythonistas and Graphistas: Navigating the World of Graph Databases with Python. And then the one in Bengaluru is Graph Genesis: Building Tomorrow's Insights Today.

On the 18th, you get another Knowledge Graphs & Large Language Models Bootcamp. That's a virtual training there.

Workshop on the 19th, another virtual one, talking about Tame Your Graph with Liquibase for Neo4j. So if you've worked with Liquibase or are curious about it, feel free to check that one out.

March 20th, there's a meetup in Melbourne, so for their March Madness. So if you're in the area, check that one out.

March 21st, another training, a virtual one, so this is one that's provided by us as well. Training Series - Mastering Neo4j Deployment for High-Performance RAG Applications. This one sounds kind of unique and very interesting to me, so hopefully you all get a chance to check that one out.

On the 26th, there's a meetup in Sydney, Australia. That's Unraveling Connections, is the topic there. There's a workshop that same day that's also virtual. This is going to be Large-Scale Geospatial Analytics With Graphs And The PyData Ecosystem. So if you're interested in those topics, that's March 26th, a virtual workshop.

And that should do it for this month. Again, there's lots more events, but I will link all of those in the show notes. And don't forget to check out our Neo4j events page as well for the full listing.

So Alison, any final thoughts?

Alison Cossette: No. To your point, there's so much going on in the space, it's really exciting. I know a number of folks from our team are also going to be in Las Vegas late this month, around the 25th to the 28th at the Microsoft Fabric out in Las Vegas. So if you want to come by and say hello in person, we'll be out there as well.

But things are cooking and we're really excited to have so many opportunities to connect with the community, virtually and in person. So definitely stay tuned, and we look forward to seeing you out there.

Jennifer Reif: Sounds good. Thanks, Alison.

Alison Cossette: Bye, all.

Jennifer Reif: Bye.