GraphStuff.FM: The Neo4j Graph Database Developer Podcast

Starting Your Knowledge Graph Journey With Special Guest Dr. Julian Grümmer

Episode Summary

Welcome to GraphStuff.FM, your audio guide into the interconnected world of graph databases. We’ll take you on a journey to unravel the mysteries of complex data and showcase the power of relationships. In this episode special guest Dr. Julian Grümmer joins our hosts to explore knowledge graphs, highlighting their semantic nature, applications, challenges, implementation tools, as well as the latest graph news from last month, and upcoming graph technology

Episode Notes

Episode Transcription

Alison Cossette (00:00):

Welcome graph curious minds, to another exciting episode of GraphStuff.FM. I'm Alison Cossette and I'm sharing hosting duties today with my fellow Neo4jers, Jennifer Reif, Jason Koo, and Will Lyon.

 

Will Lyon (00:13):

Hello.

 

Jennifer Reif (00:14):

Hello.

 

Alison Cossette (00:16):

Today, we're diving deep into the fascinating world of knowledge graphs. Knowledge graphs are more than just a data structure. While most folks listening likely have experience with graphs of some kind, knowledge graphs are an interesting type of graph. Rather than say a transactional network graph, such as a social network or a graph of banking transactions, these graphs contain semantic knowledge and their relationships and attributes. Their structure is driven by the meanings in the language they contain. So joining us today on the Semantic journey is Dr. Julian Grümmer. Dr. Grümmer is a tech innovator and researcher in accounting and auditing at Friedrich-Alexander University in Germany. His work focuses on personal networks and their impact on the performance of companies. He is also a featured speaker at the Nodes Conference last year, among other various graph and tech conferences. Dr. Grümmer, welcome. Thank you so much for joining us today.

 

Dr. Julian Grümmer  (01:12):

Hi Allison. And hi everyone else. Thanks for having me today.

 

Alison Cossette (01:15):

Do you prefer that we call you Dr. Grümmer, Julian?

 

Dr. Julian Grümmer  (01:19):

No, Julian is fine.

 

Alison Cossette (01:21):

Excellent, Julian, thank you so much for making time in your busy schedule for us. We're really happy to have you here today.

 

Dr. Julian Grümmer  (01:27):

It's a pleasure to be here.

 

Alison Cossette (01:29):

Why don't you kick us off by giving us how you define knowledge graphs?

 

Dr. Julian Grümmer  (01:34):

How do I define knowledge graphs? Probably by the t-shirt I really love from Neo4j in a relationship, and the t-shirt shows that some things are like cluster together, are related to each other, and that's what I think about when I think about the knowledge graph.

 

Jason Koo (01:54):

So I've got a question. What was the story of your introduction to knowledge graphs?

 

Dr. Julian Grümmer  (02:00):

It's from a practical approach because I used to work at the chair of accounting and auditing. Right now I'm not anymore working at the university, I already have my PhD. And now I'm an analyst at Proventis Partners, which is an M&A consulting company-

 

Jason Koo (02:14):

Correct.

 

Dr. Julian Grümmer  (02:15):

But I'm still thinking about knowledge graphs, but more on that maybe later. And to be honest, I didn't know about knowledge graph, I just had a problem. The problem was that we had a lot of different types of data, information that was mainly about companies and their supervisory board, the management board, so about people, and we wanted to know... In Germany we call it the 'Deutschland [German 00:02:41]', which means that all of the people are related to each other in the business world. I think you have something common in other languages too, and we wanted to show that. And how else can you do that with a knowledge graph? I didn't know about knowledge graphs, but what we did, we were brainstorming and we had all the names on a chart and then we just pick the people and try to show who knows each other. That's how I got to know knowledge graphs.

 

Jason Koo (03:15):

And so when you were working with this data, did you try to use a different system prior or did one of your colleagues introduce you to graph databases?

 

Dr. Julian Grümmer  (03:25):

Not at all. We were totally lost. We didn't know what to do. But then I was pretty lucky because on just at university I met Professor Andreas Harth, he's the chair of technical information systems at the Friedrich-Alexander University. He's also working on knowledge graphs and he already had all those problems long before. And he said, okay, I might have a solution for that. And it was pretty lucky for me because he said, okay, I know about knowledge growth, but I don't have the idea of what to do with knowledge growth and he needed use cases. And that's how we came together. We said, okay, I am the expert in terms of finance about supervisory board, management board, and all the information we need to know. And he was the part of the technical expert.

 

Jason Koo (04:18):

Nice. And so in the process of implementing the knowledge graph that you guys used, what was the biggest challenge that you had to overcome in getting that working?

 

Dr. Julian Grümmer  (04:31):

Talking the same language I guess. Not in terms of German or English, but in terms of they were talking about RDF, about Neo4j, about graph nodes, edges. I really had no clue what it is, but then we figured out, okay, they don't know about all the aspects from the business side and I don't know about the IT side, so we have to work together and set up some competency questions, and that's what we did. We defined some questions, what do I want to answer, and then I showed all the question and they were like, oh, we can never do that. But in the end, we managed pretty good to do that.

 

Jason Koo (05:11):

Cool.

 

Dr. Julian Grümmer  (05:11):

And to make the knowledge graph available for other people as well or the data or the query language or whatever because it's nice to have a graph or some kind of database, but somehow you need to do the analyzes.

 

Jason Koo (05:27):

Were there any resources that your team used to get up to speed very quickly?

 

Dr. Julian Grümmer  (05:31):

Yeah. We were pretty lucky because Professor Harth, he gave us a lot of student power, but in the end it was a total mess because we had a lot of data, a lot of different kind of data, a lot of no idea what to do. So we figured out we had one pretty good student and he just wrote the exam about linked data and then he said, okay, I will probably work at the chair for some time and I will try my best to make the project working. And it's great because he put a lot of effort into it and he never knew about knowledge graphs before, but right now I think I can mention that he's working at Adidas and he's working with knowledge graph doing some pretty interesting stuff with fraud detection and things like that.

 

Jason Koo (06:21):

Wow, good for him. How long was that process do you think, from him going from not knowing anything about knowledge graphs to feeling pretty competent and being able to deliver stuff for you?

 

Dr. Julian Grümmer  (06:33):

He already passed the exam so he knows about the theory, which is pretty good, but in the end it turned out to be completely different. I think the process was three to four months until he knows a little bit what to do with all the information and setting up stuff. But in the end, I think creating a knowledge graph is a never ending story because that's the main... A big advantage of graphs, you can always add information, add data. And when you start thinking about what can we use for data, then you always think about, okay, what can we do next?

 

Jason Koo (07:07):

Yes. So if you had to do this all over again, or if somebody else is trying to implement a knowledge graph, what one piece of advice would you give them?

 

Dr. Julian Grümmer  (07:16):

I think creating competency questions might be pretty good. What do you want to answer in the end? Because knowing the big picture is great or also some kind of data you might have, but what kind of questions do you want to answer? This is mainly the point because we had a lot of ideas, a lot of stuff we could use, a lot of analyzers, but in the end it was basically two or three questions we wanted to answer. Like, who knows each other, who went to the same university, who's working together? That's mainly what we wanted to answer.

 

Will Lyon (07:52):

Julian, you mentioned a Adidas just a minute ago, which got me thinking back to your Nodes talk from last year where you were using some Adidas supply chain data to look at building a knowledge graph of the supply chain and specifically looking at ESG, so that environmental social governance aspects and specifically focusing on the supply chain. And this got me thinking, hey, this is public data is available in this area, this is a great area to dig into. And to the point you were just making of knowing the questions you want to ask of the data and looking at connections between people, and this kind of got me thinking, is there kind of an intersection of use cases here where knowledge graphs really lend themselves to being useful? Is it the combination of the availability of public structured data and having interesting questions to ask of it? What are the things that we look for when we're trying to figure out would this be a good domain for a knowledge graph or is it applicable to any domain really?

 

Dr. Julian Grümmer  (09:06):

That's pretty technical I guess, but I will try to answer the question. I think the use case of a knowledge graph is always relevant when you want to connect different data sources. And when looking at the ESG knowledge graph, the talk from the Nodes Conference last year, it was pretty obvious to use a knowledge graph because we had some information from the companies, which was pretty good already. But then we knew, okay, this doesn't make sense to just combine these data, but we want to know the real big picture and we want to know, okay, what is going on in China, what kind of information can we get about companies in China.

 

(09:51):

The easiest use case in the beginning was, okay, we have the data set from Adidas, from Nike, and H&M because all of them provide information about their suppliers, and we wanted to see, okay, are there any kind of suppliers which are suppliers for all of them or for Adidas and Nike for example. And therefore, you could use an Excel spreadsheet, which is pretty boring I guess. But also in the end you can use a graph which makes it more interesting and more appealing.

 

(10:21):

And that's maybe also one point which is good about knowledge graphs because if you have something appealing to show to people, because mostly it's just about numbers, numbers are good to see, but in the end nobody will create a story about that. But if you have a graph and you can show how those companies, how the people are connected, how the countries, how the environmental effects connect to maybe the factory, then it's interesting, and then you can create a story about that. Even air pollution in some parts of where Adidas sneakers are produced, and then you can say, "Okay, look at the graph, you can see it here," and then you can show some pictures and then it's more appealing to people in the audience to look at it.

 

Will Lyon (11:13):

Totally. That's sort of that aspect of combining data sets, querying across them, and then giving that visual representation is super powerful. I noticed in your Nodes talk that visualization was definitely a key driver for how you were presenting the data in the end.

 

Dr. Julian Grümmer  (11:31):

Yeah, and not just because it's the most powerful graph or anything, even if you just have a few nodes and edges, it's interesting for the audience and people might be like... There are so many tools out there who try to create a nice PowerPoint or presentation, some slides for you, but in the end it just looks similar because it's text and some pictures. Most of the time some iStock pictures because everybody tries to use that. But in the end, if you try to create something new, something different, people will remember it better I guess.

 

Alison Cossette (12:08):

Julian, I have a question for you. Well, two questions. The first question is when you look at what you've done with the graph itself, do you have a story maybe about something that became really obvious when you used graph that you would not have been able to see otherwise that really jumped out at you?

 

Dr. Julian Grümmer  (12:29):

That's a good question. Yeah. In the end, how much connected everything is. We knew, for example, for the people at... If we go back to the use case of supervisory board and management board members, we were looking how connected they are. And we knew that there are some connections, but in the end, the graph, it got so big and so many connections to each other that there was no chance to analyze it by just looking at it. That's something that you might not see if you have a Excel spreadsheet for example, because then you see some numbers, some names, some text, whatever, but you cannot see what is going on, how the connections are and how many connections you have.

 

Alison Cossette (13:15):

I love it. It's always interesting to see whether it was one large graph or it was a number of smaller segments. So the fact that everything is interconnected sounds even more interesting to me personally. The other question I had for you, right now, graph is finding its way into a more common... It's a more common topic, right? I know one piece of advice you had for other researchers was around coming up with your basic questions of what is it that you want to know. How might a researcher know that graph might be something they even want to pursue?

 

Dr. Julian Grümmer  (13:59):

I think whenever you have some kind of databases, some kind of different databases and you want to look at it and you don't know how to start, a graph is good. Because whenever we had some use cases, we were thinking for example about the sustainability ESG graph, we didn't think about a graph in the beginning, we just were thinking about, okay, we have to look at the companies and say, "Okay, there are a lot of ESG ratings for example, and they are all mostly different and you never know where the numbers come from and stuff like that." And that's why we said, "Okay, we want to have a look at it on ourselves, just explore it." So we checked information, what can we get, what kind of free information like... Working at university doesn't bring you a lot of money and resources are limited, so you have to look for free available data and then you will find some databases and then you pretty fast understand, okay, you have some databases, but how can you link the databases?

 

(15:09):

You don't want to link the databases in the beginning, but you want to know, okay, what can I find in what database? That's even a problem we have right now at the company. We have so many databases, but we sometimes don't know in which one to look at because they have so many kind of information, there's so much going on that you don't know where to look at. That's one thing I think which is pretty cool about graphs because you can just try to combine everything you can find. Yeah, this year I started working at the consultancy here and what I learned is that we have really a lot of different types of information about databases and whatever, and we probably use most of the databases because all of them have some kind of... They are meaningful to us. Some have information about the shareholders of a company, others about ESG, others about whatever you need. But the problem is we don't know where to look at and it's a long process to go through all the data to create some kind of company profiles and stuff like that.

 

(16:18):

And that's what we want to make easier in the first step to find the right information, then to process the information because it takes a long time for us to put all the information together. And then the goal, the ultimate goal, would be to have something pretty fast going through where you just type in what you need and then you will find it directly, some kind of cool graph. And also they're pretty easy use cases right now because when you think about a company or some kind of a sector, for example, companies who are trying to... supply chain tech companies, for example, who try to use some kind of AI or whatever to make supply chains better, then in the end you don't know where to look at these companies. There's a lot of data in the databases, but the classification is a huge problem for us because we can't find the right companies, or we can find but it takes a long time. So even analyzing the business description of a company is really, really difficult and a lot of pain and a lot of manpower.

 

Jason Koo (17:31):

Nice. Thank you. Cool. Yeah, so for everyone listening, until Nodes happens later this October, you can definitely check out Julian's talk from last year's Nodes. We'll put a link in the description. And if you want to dive deeper into knowledge graphs, there is two pieces of the content that came out recently that would be of interest. We have a knowledge graph 101 article that came out and also a book, Building Knowledge Graphs: A Practitioner's Guide, which goes into a bit more detail. Alison, did you have a chance to read that book already?

 

Alison Cossette (18:07):

I have spent a little time with that book, not all of it, but I spent a little bit of time with it. My particular interest is always in the area of unstructured data, so in the area of knowledge graphs, how can we bring the semantic knowledge to the knowledge graphs? So there's a little bit in there on some NLP, so if you're into that, we've got a little bit of insight in there as well.

 

Jason Koo (18:30):

Nice. And also the last month and a half, actually probably two or three months, Tomaz Bratanic has produced a lot of knowledge graph and LLM related content. The two most recent articles were on multi-hop questions and answering. Actually, let me take a step back. So if you go to the early part of his series, he did a LangChain and Cypher Search tips and tricks article. And in that article, he was using the Twitch dataset and showed how to use LangChain with OpenAI and create a system that introspects a Neo4j database and looks at its data schema so that it's able to generate Cypher questions from a prompt. And he goes on to show how you can improve the accuracy by giving some Cypher examples, explicit examples, and then also showed how GPT-4 compared with GPT-3.

 

(19:29):

So from that article, he's gone on and added to that with another one called Knowledge Graphs and LLMs, where he talks about some of the limitations with existing large language models and how you can fine tune them and how you can enhance them by using chatGPT plugins to get modern information from the internet to bridge the gap from its training date and beyond, right? Because when you ask it a question that's past its training date, you can get either hallucinations or not correct data, but using this plugin you can kind of fill in that gap.

 

(20:06):

And then that leads to his most recent article, Knowledge Graphs and LLMs: Multi-Hop question and answering, which basically looks at PDF documents and breaking up the text chunks to extract information for putting into a knowledge graph. So this will be of interest to you, Alison, he talks about mixing both structured and unstructured data. And then lastly he talks about using LLM agents, the sort of chain of thoughts systems, which can go through several iterations of questions and prompts. Oh, talking about structured and unstructured data, Julian, in your earlier project, was it mostly structured data or did you also have to deal with unstructured data?

 

Dr. Julian Grümmer  (20:51):

Information about people seemed to be a lot structured on the first side, but it turned out it's not because every CV and resume is different. And if you look at all the company's websites, that's like never the same. You can find some databases which are pretty cool, there are also some databases from the European Union or stuff like that who are pretty much structured, who are also some of them in a really good quality, but most of the time everything was unstructured.

 

Jason Koo (21:23):

Since we're on the topic of GDS plugins, Allison, did any new updates pop up with GDS?

 

Alison Cossette (21:33):

Funny you ask Jason. Yes, I recently did a blog post called Hot Algo Summer, and it was about the four recent algorithms that were released by Neo4j and GDS 2.4. And I was pretty excited about this release, specifically because there were a couple of problems that got addressed. One of them I'm going to talk about a little bit later, but I also really enjoyed, we have something called Context Neighbor Aware Random Walk Algorithm. And what it allows you to do is it allows you to take samples of your graph. So oftentimes we have very large graphs and sometimes you want to work with a sample of the graph, but the challenge can be how do we know that within the little subgraphs or the more modular aspects that we aren't continuing to sample from that same corner of the graph.

 

(22:31):

So in the context aware neighbors, what it does is it takes into account the neighbors of that graph and has it seen those neighbors before, and therefore it will generally give you a higher likelihood that you'll sample across the different subgraphs of your large graph. So that was one thing that I really enjoyed, but we have a number of new ones coming out, so you can definitely check out Hot Algo Summer. It's pretty easy to find on Google. What about you, Jen? Do you have anything in the Java space that's happening right now?

 

Jennifer Reif (23:07):

Yeah, there's a couple of things going on. First up is the release of Java 21, version 21 coming in September. So I think they just kind of closed changes and things going into the Java version. And so I'm sure they're doing testing and preparation for the release coming up in September. So that's going on. There's a few articles I've seen out there talking about features that are coming and things they're working on in the release. So for those who aren't familiar, Java is on a six month release cycle. So we had one back in March that got released, Java 21 that's coming in September is an LTS long-term supported version, so of course that's going to see a lot more kind of coverage because that'll be something that people want to migrate towards. So that's kind of been coming up. I've seen a few things pop up around that.

 

(23:55):

On the spring side of things, which I'm kind of keeping my foot in a little bit, there's the SpringOne conference coming in August in Las Vegas, Nevada in the US. So that's coming up here reasonably shortly. I've seen a lot of advertisements if you're out on the Spring initializer even creating project templates and things to pull down, they've got a little banner ad going for it. So that's a big event. I think it's the first one in person since 2019 is what I've seen. So that's going to be a big deal. I'm not sure about Neo's presence there, but the Spring community will be there for sure.

 

(24:30):

And then the Spring framework 6.1, they just released an M3 version of that. So in the Spring world, most of your Spring releases come in the fall. So usually the first one is the framework release and so that probably will be coming up sometime soon and so they're prepping with some pre-releases for that as well. So that's everything kind of going on that I've seen crop up. I'll keep you posted on that.

 

(24:57):

But on the Neo4j side of things, there's some blog posts and things going on too in preparation for Nodes. Back when the CFP was still open for a couple more weeks, I released a blog post about the Java sessions from 2022's Nodes Conference. So I have kind of a listing and overview of everything that was covered in the Java space at last year's Nodes Conference. So if you want to take a look at that and kind of review and maybe get some hopeful sneak peeks into what might be coming this year as well. We'll see what, with the CFP now closed, ends up coming on the schedule for the 2023 one.

 

(25:35):

I also wrote a blog post for building Java applications with Neo4j. That just got released on Java Code Geeks I think at the end of last week, so that is available there. Feel free to check that out. And then I saw a blog post by a colleague, Gerrit Meier, who worked on a Spring and GraphQL post using Spring data Neo4j. So that one I found really interesting because he walks through a little bit just looking at the spring data, Neo4j side and then using Spring GraphQL, and then towards the end of the post he actually uses the joint integration to pull them both in and kind of shows you what you get with that. So really interesting post if you're curious about gaining the benefits from both Spring data as well as GraphQL, that's a good one to check out. It kind of incorporates all those things. I didn't know if, Will, you had anything else to add there?

 

Will Lyon (26:26):

Yeah, I read that one. That was a good one to give me a bit more insight into GraphQL tooling from the Java side, I've worked a lot with our JavaScript based GraphQL integration, and one of the questions we get a lot is if I'm not interested in building and maintaining a node.js GraphQL API app, if I want to do it in Java, what does this look like? So yeah, it was great seeing how sophisticated the Spring ecosystem is, how you can use lots of the different Spring integrations together, which was a great insight from Gerrit on that one.

 

Jennifer Reif (26:59):

And I think that's it for me. Jason, do you have some updates?

 

Jason Koo (27:03):

Yes. So a lot of things in the Python world. So our Python driver 5.11 was just released earlier today actually. It contains mostly minor fixes from 5.10, especially improving handling of sub millisecond transaction timeouts. So prior, if you chose a really low timeout, it ended up getting rounded down to zero, which of course is probably not what you probably wanted to do so that's been fixed. Also, some things addressing some lazy imports so that any sort of interpreter shutdown crashes will be reduced. Then also asyncio.Lock acquire had some errors that were being caught by the library and not passed along. So that's been changed so that those exceptions will pass all the way through and the developer can react to that.

 

(27:57):

And then with that also, Neomodel was updated. So Neomodel is the community Python OGM package, the object graph mapper. So it abstracts some of the functions of the Python driver for easier use in other applications like say Django. So 5.1.0 was released about two weeks ago and at that time it bumped up the Python driver version, the underlying driver up to 5.10 so that will probably get updated fairly soon to use 5.11. In that release, there was a breaking change, which is a breaking change going from four to five anyways, but if you have Cypher queries that is calling the ID function, that needs to be moved over and called the element ID function. So just be aware that if you haven't already made those changes or if you're using it brand new, just remember to use the element ID function instead.

 

(28:56):

Also in the Python space, my mock graph data generator, which I've been working on and off throughout the year, I've finally taken the core component of it and pulled it out as its own package and uploaded into PyPy just over the weekend. So the initial version is out and if you want to build an app or just use that mock generator component inside another application, you can go to pypy.org and pull it from there.

 

(29:22):

And last thing in the Python space is one of the developer advocates over at Balena. So Balena is an over-the-air IOT fleet management software and service, I guess platform. So the developer advocate over there, Mark, he created a Neo4j Balena integration. So now, if you're using Balena to manage a fleet of IOT devices, you can deploy Neo4j right onto any x86 device and use that as part of whatever workflow or process that you're using IOT devices for. So that was pretty cool.

 

(30:03):

And not Python specific, but NeoDash over the last couple weeks came up with a great new release. So 2.3 was released and it contains updated look, ability to customize styling, a whole host of minor fixes. But the big thing is it now includes an NLP query extension. So if you have an OpenAI key, you can... In creating the widgets in the past, you would put in your Cypher command to basically power the widget. Now you can switch over and just put in just regular English and using that extension, it will convert your question into the Cypher statement that needs to match with that underlying database that you connect with, which allows you to very quickly create custom widgets to see data that you're interested in. So say someone on your team isn't very familiar with Cypher, you can have them go this route and put it in.

 

(31:10):

Now for those who have used NeoDash, sorry, if you've used NeoDash prior, the way to implement this is, there is an extensions button on the left. So you'll click on that, scroll down, you turn on the NLP query extension, and then when you're building the widgets, there's a toggle to go from Cypher to English. So those three steps will get you up and running for NLP. Allison, have you had a chance, I know you were a big supporter of this feature when it was very early on. Did you get a chance to play with this yet?

 

Alison Cossette (31:46):

Yes, I was an early tester. As you were introducing it, it was reminding me of the position Julian was in with his project when I'm always really interested to see how is the technology that we're working on making things accessible and allowing researchers and folks to move things forward. So to have this capability in NeoDash really does a lot to create that opportunity to ask those kinds of questions that might be coming up and explore what's in the graph, everything that you've done, all that hard work to get that knowledge graph built and now to have a much more fluid access to what's in it when you are not primarily a technologist, but a researcher or an inquisitor of any kind. So yeah, I love it. Early fan, very excited.

 

Jason Koo (32:46):

Nice. So I was asking my test database a few questions and what I really like about it is, it doesn't even attempt to hallucinate, right? If it is not even sure it can get your answered question, it's just going to be like, sorry, you'll have to ask it a different way or give Cypher a go. So Julian, were you aware of NeoDash? Did your team get a chance to use NeoDash in the process?

 

Dr. Julian Grümmer  (33:12):

Because it was the time I already quit university, but I were at the graph tour here in Munich and they presented everything about NeoDash.

 

Jason Koo (33:23):

Oh, okay. Yeah, so you're aware of the tool,

 

Dr. Julian Grümmer  (33:26):

Which is pretty cool because I think-

 

Jason Koo (33:28):

Cool. Awesome. Yeah.

 

Dr. Julian Grümmer  (33:30):

Working with RDF data and whatever is fine, but that's something you don't want to do, and it's more a lot easier and makes it a lot more appealing to people who are not IT experts to use something which is easy to handle.

 

Jason Koo (33:46):

Yeah. So did your team end up using Bloom a bit? Because Bloom is-

 

Dr. Julian Grümmer  (33:50):

Yeah.

 

Jason Koo (33:51):

Also doesn't require Cypher. Nice. Cool. If anybody's interested in learning more about NeoDash, we'll of course have a link to the repo. But also earlier this month before the NLP extension was added, a blog was written talking about using NeoDash to explore a lot of chemical information as a full text graph. So if you're interested in that space, definitely check that out. And also, switching gears a little bit, if you are familiar with our sandboxes, we've had some updates done to that recently. Jen, could you tell us a little more about that?

 

Jennifer Reif (34:30):

Yeah, so our colleague, Max, released a couple of blog posts earlier this month on covering the new feature in Sandbox, which is for single sign-on. So if you've been creating your own login or something there, we now have support for single sign-on there in Sandbox. The one blog post is an announcement of the feature and kind of walks you through how to use it to get into Sandbox. And then the other post that he wrote was actually the implementation of how he did that with the Neo4j Sandbox. So he walks you through how they put it together and how they got it in there and using json web tokens, which is JWTs for the tokens in order to identify someone. So walks through all of that, nice screenshots, and good stuff there. So if you're curious about how that got implemented and what we did to do that, kind of the behind the scenes, Max's blog posts are there and a good place to start.

 

Will Lyon (35:26):

That's a great feature to have. It's nice to just do that one click through to get me directly into writing Cypher instead of trying to track down my password. That was always something that tripped me up in the past.

 

Jennifer Reif (35:38):

Yes, agreed.

 

Will Lyon (35:40):

The one blog post I wanted to highlight this month comes from the folks at Margin Research, and they wrote a blog post titled Entity Resolution in Reagent. So Margin Research is a cybersecurity research and consulting firm. They have a tool called Reagent that they describe as bridging the gap between code and people by focusing on the social networks that power open source, which really resonated with me because I think one of the key technical value propositions that that's really clear with graphs in Neo4j is this ability to combine data sets and query across them to find insight. And so anywhere that you're bridging gaps I think is a good key that there may be multiple data sets you're looking to combine to answer some question.

 

(36:30):

So what they look at is basically things like in the open source world we have package managers, these are things like MPM and PyPy, and we know the dependencies between packages so there's a graph there. But then we also know the open source contributions, so these are things like on GitHub developers checking in, pull requests and commits to files that are connected to some package that's a dependency of other packages and so on. So really getting at not just talking about software package dependency, but also what are the actions among developers to build that open source software in the first place. So this is a really neat area to get into, but what I think was really great about this post is they walked through a complete end-to-end of how they added one specific feature and what they were looking at was how to address entity resolution.

 

(37:34):

And so that's the case where they had multiple users that they think may be the same user. So you, you've made a Git contribution to multiple projects. We think this is the same user based on maybe something like an email address or some information from the Git repository. And they talked about how they, well first of all how they use knowledge graphs or how they're using Neo4j to represent and store this information. But then they talk about how they use LLMs to classify or basically to add topics to a repository so they can parse information in things like the commit messages, the read mes, these sorts of things, and leave it up to the LLM to come up with topics are or tags for these.

 

(38:23):

And then the way that they do entity resolution is basically they approach this from a link prediction framing where they're identified potential similar users and the link prediction that they're trying to get at is the case where these are the same user, we would have a same user as relationship that they're trying to predict, and they combine both string embeddings, so where we've generated embedding from node properties but also with graph embeddings where you're generating embeddings of the graph structure and how that feeds into their AI pipeline for entity resolution.

 

(39:03):

So I thought this was a great blog post because it starts at kind of the higher level view of what's all the data that we're working out, what's the problem we're trying to solve here, and then gets down into the nitty gritty to show us how we can actually use knowledge graphs, LLMs, and an AI pipeline all together. So I really liked that one, we'll link it in the show notes. It's called Entity Resolution in Reagent. Well, let's talk about all of our favorite tools of the month. This is section of the podcast we do each episode where we reflect back a little bit on tools that we've used this month and what did we find that gave us joy, what did we find useful so that we can share with others. So I'll go first. My favorite tool of the month this month was the Neo4j GraphQL toolbox.

 

(39:59):

We mentioned earlier some of the JavaScript ecosystems Neo4j GraphQL tooling that's out there. I've been putting some updates on the Graph Academy GraphQL training that's out there. The previous version of that training used the Code Sandbox environment to give you a sort of hosted cloud environment where you can make tweaks to the code and run a couple of different projects in the context of Graph Academy. And I think some challenges with that, I think it was a little bit confusing to keep track of which Code Sandbox environment were you working, what was the beginning of the lesson, what was the end? And so we've restructured that to use just the Neo4j GraphQL toolbox, which is a sort of in-browser IDE for building GraphQL APIs using the Neo4j GraphQL library. But it abstracts away a lot of the basic sort of setup code that you would need to write to build and maintain a node.js API project and just embeds all of that for you.

 

(41:08):

This all just runs in the browser, so there isn't actually any node code that you need to think about, and everything is just driven from GraphQL type definitions, which you can generate from the database. You can see how when you tweak some of the GraphQL type definitions, how that changes the generated API. You can write queries in the browser that are sent directly to Neo4j. So it just gives you a lot faster experience for working with some of the GraphQL tooling that's available with Neo4j if you're just sort of in that building, testing phase of a GraphQL project. And this is public, it's available open source, free for anyone to use, so we'll link that in the show notes, the Neo4j GraphQL toolbox. How about you, Julian? Do you have a favorite tool of the month to share?

 

Dr. Julian Grümmer  (42:00):

Yeah, to be honest, this was a tricky question for me. I knew the question before from Jason and I was thinking about it all the time. What can I say and what kind of super fancy tool can I mention? But to be honest, everybody in Germany is right now on vacation or we go on vacation and everybody tells you about the vacation. And you just want to work and concentrate and focus on your work. So my tool of the months is probably the Pomodoro timer. I don't know if you know the Pomodoro technique, but I just try to tell everybody who just jumps into my room to go out until my 25 minutes are over. It's probably not the most fancy AI driven or whatever, but everything changes and a lot of new tools come out every month. I have a lot of newsletters and everything is super, super fast changing and sometimes you have to stick back to the good old work which is in your head and just think about what you do.

 

Jennifer Reif (43:11):

Well, diving into mine, slightly different tactic here, but I've kind of working... Hope, there's no secrets broken here. I'm kind of working on a joint project with Jason over the next couple of quarters and one thing that we kind of started down is working on some import stuff. And I had this import, I'm trying to pull in some data from a note app, a note taking app, and a lot of the examples were provided in various programming languages, et cetera. And of course the Java one looked pretty cool, but when I pulled it in it was kind of vanilla Java and I had to pull down the GitHub library and all that and I thought, you know what? I wonder if I could take this and convert this over to a Spring Boot app. Actually, I expected it to take a whole lot longer and it did not.

 

(44:00):

So just kind of extra praise to the reduction of boilerplate that Spring Boot provides, plus all the versioning stuff, dependency management that it handles, super awesome and easy to get started. So I was able to take a kind of vanilla Java package with some samples, pull that into Spring Boot, kind of port the code over, and get it to run from the main Spring Boot class, get it to run other things kind of on startup. So I hope to have a blog post or two coming out detailing a little bit more what I did there, but just super easy to get started and really nice tool to help you transition from something that's vanilla of Java or even just short Java app or whatever. You can bring that into this fully supported dependency management reduction of boilerplate in the Spring Boot. So super awesome if you guys have a chance to check that out.

 

Jason Koo (44:54):

Nice. So my favorite tools of this month are also related to work, increasing productivity. So in preparing the graph data generator package for PyPy submission, I needed to kind of un-spaghetti a lot of my prototyping code. And so I wanted to add a lot of tests very quickly. So I used an add-on CodiumAI, C-O-D-I-U-M. And so Codium is a suite of basically GPT powered testing suggestions and code analysis recommendations. So I use this to very quickly create all the top level tests and even write a lot of the test code for me. Now, the code that's involved with creating the mock generator is kind of complex, so it couldn't guess all the time how the really high level functions should be tested. But the lower level functions, it did a really good job of preparing those tests.

 

(45:57):

And then also to kind of support that, I've been trying out cody.ai, which I learned about a few weeks ago. And so Cody is kind of like GitHub's Copilot but with extra features. So its integration with Visual Studio code is quite good. You can go to a piece of code and you can start asking questions about that code in that scope. It's also got a feature you can ask for just generally identifying code smells and another view for just generic questions about your overall code architecture. So the combination of the two really accelerated my ability to take the original prototype code and turn it into a cleaner package that I could upload into PyPy. So those are my two tools of the month. Alison, what was your favorite tool?

 

Alison Cossette (46:54):

So many tools, so many things I love. I talked a little bit before about the new algorithms from the GDS 2.4 release and one of them is the Bellman-Ford Shortest Path. So if you've been using GDS or if you do a lot of work with shortest paths, one of the things that you've sometimes can come up against is negative weights or wanting to use negative weights. So what Bellman-Ford will allow you to do is it'll allow you to use the negative weights. So for example, let's just say you're trying to get from one part of the Star Wars galaxy to another and there's a very dangerous path versus a not so dangerous path, you can actually add different kinds of risk metrics to certain pathways. So I thought that was super interesting.

 

(47:44):

But honestly the thing that has just brought me the most joy this month is actually arXiv.org. For those of you who aren't familiar, arXiv.org, it's research. And one of the things, obviously I'm your data science person here, generative AI as we all know over the last year has exploded. We've talked about it multiple times today, even in Jason's most recent tool, and as a data scientist, it is very hard to keep up with how quickly things are moving. Even just the number of tokens and the research on tokens and chunking and everything's moving so fast. So it's been super fun for me to just stay in arXiv and read what people are putting out and see how you can leverage that moving forward. So the joy in the fun has come for me from arXiv and Bellman-Ford Shortest Path.

 

Jennifer Reif (48:43):

Okay, so to close out, we're going to look at the upcoming events for August. First up is on August 7th or 8th if you're in APAC regions is the Graphversations Knowledge Graph for Personal Medicine with Sixing. He's written a several blog posts that are really amazing and so there's going to be a webinar discussion with him. Again, that'll be August 8th if you're in the APAC region and then August 7th if you're on the other side of the globe. So check that out if you're interested to hear that.

 

(49:15):

On August 8th, we also have the Kansas City Graph Database meetup. That's actually a meetup I've been out to speak once or twice now. It's a really great group of individuals. So if you're in the Kansas City, Missouri area, feel free to check that out. And that'll be on August 8th.

 

(49:30):

On August 10th, Neo4j will be at the National Institute of Health or NIH Tech Day. That is all online. So if you're going to be involved in that in any way, you don't have to commute anywhere for that one. So check out the Neo4j presence there.

 

(49:47):

We have a couple of DataEngBytes coming August 24th at Sydney, August 28th in Brisbane, and then August 30th in Melbourne. So there will be some Neo4j presence there, booth as well as looks like a session or two. So if you're in those areas, feel free to check that out.

 

(50:07):

On August 29th, we're also going to have a presence at Google Next in San Francisco, so opposite side of the globe there. So if you're going to be around in the Bay area. And then also that same day on the opposite side of the pond, Neo4j will be at the Copenhagen Developer Festival. So lots of things going on no matter where you happen to be in the world. So if you happen to be local to something, one of those, feel free to check those out.

 

(50:27):

And finally, we have a Graph Stuff channel on Discord. So if you're listening to the podcast, you have some feedback or comments or even some suggestions for things you'd like to see, we'd love to hear from you and kind of incorporate some of that. We want to present content and provide content that is what you guys want to hear. So feel free to post in our Discord Graph Stuff channel for anything of that.

 

(50:58):

And lastly, I would like to extend a very warm thank you to Julian for joining us today on this podcast session and for filling us in on all of his knowledge graph experience. We appreciate your insight.

 

Will Lyon (51:12):

Yeah, thank you, Julian.

 

Alison Cossette (51:12):

Yes, thank you, Julian.

 

Jennifer Reif (51:13):

All right, we will catch everyone in the next podcast episode for next month.

 

Dr. Julian Grümmer  (51:17):

Thank you. Bye.

 

Alison Cossette (51:18):

Take care all.