GraphStuff.FM: The Neo4j Graph Database Developer Podcast

May The Graph Be With You

Episode Summary

Today we're talking about things that happened in the Neo4j ecosystem during the month of April 2023 including some takeaways from PyCon, our upcoming NODES online conference, what the Neo4j community has been up to this month, and of course, more examples of using LLMs with graphs.

Episode Notes

Submit your audio question for the GraphStuff.FM podcast:

Episode Transcription

Will (00:00):

Welcome to GraphStuff.FM, your audio guide to the graph technology galaxy for developers and data scientists with a focus on Neo4j.



My name is Will and in the studio today we have our resident data scientist, Alison. Hi, Alison.


Alison (00:19):



Will (00:19):

And our Python expert, Jason.


Jason (00:22):

Hi, Will. Hi, everybody.


Will (00:24):

Today we're going to be talking about things that happened in the Neo4j ecosystem during the month of April, 2023. We'll talk about some takeaways from Python, our upcoming nodes online conference, what the Neo4j community has been up to this month, and of course more examples of using LLMs with graphs.



But first, let's take a listener question.


Jason (00:51):

So, this listener question came in, actually, from a LinkedIn post versus an audio question, and it comes from Karen Dalton. So I got to meet Karen Dalton briefly last year, and she is the organizer of BAyPIGgies, which is a Python group up in San Jose.



And she asked "What Star Wars character would best be the equivalent of Kevin Bacon in the Six Degrees of Kevin Bacon game?"



And very properly after I brought this in-house, Will found actually a data set that would answer this. Will, could you talk about what you did very quickly?


Will (01:26):

Yeah. A few years ago, Dr. Evelina Gabasova did some interesting work around building the Star Wars social network. She wrote a few blog posts about this and created also a graph gist, which is an embedded Neo4j document of just sharing graph ideas.



So, I remembered that and so I went looking for her material on this to answer this question. It's really quite neat. I'll link a conference talk that Evelina gave, but basically what she did is she took the Star Wars movie scripts, which are all in this very common format that specifies the beginning of a scene, who the speaker is, who the characters are in the scene, and so on.



And basically parse these scripts so that anytime a character spoke to another character or they're in the same scene together speaking, that's an interaction between these two characters in the graph.



And she said this was fairly easy to do, because these movie scripts used this common format, but when she started to look at the data that there were some problems. So like R2-D2 is not in the data set, Chewbacca is not in there. And so when she was looking closer, she realized that R2-D2 doesn't speak in the script. He frantically beeps and Chewbacca, emphatically barks.



So she talks a little bit about how she had to adjust her process to pick up these sorts of interactions as well. But once this data's been parsed from the script, so these character interactions are built up as she then can analyze the Star Wars social network, which is really several social networks, one for each movie, looking at the prequels versus the original trilogy, these sorts of things.



And so I think to go back to the question from Karen here, which is what character would be six degrees of Kevin Bacon, I think you would want to look for the character that has the highest betweenness centrality.



So there are lots of different ways to measure importance in a social network or in a graph in general. And these are called centrality measures. The simplest is degree centrality, which is just the number of relationships that a node has. In this case it would be the number of other characters that a character interacted with, which you would think might be a good indication of who would be the best Kevin Bacon, who's connected to everyone else in the network.



But actually that doesn't take into account interacting with different clusters, interacting with different groups within the network. And that's what betweenness centrality measures like how connected is a node between different clusters, different groups in the network.



And so with Evelina's analysis in the prequel network, she found that Obi-Wan had the highest betweenness centrality, Padme was a close second, and using the measure of just degree centrality, Anakin was a clear leader.



So you can definitely see how Obi Wan and Padme interacting with different groups that boost their betweenness versus Anakin who's interacting maybe with a more smaller group of characters.



In the original trilogy, it was Luke and C3-PO who had the highest Betweenness Centrality. Evelina published her dataset to GitHub. So I wanted to look at the entire social network since I didn't see Evelina's analysis published on that. And so I took the data and loaded it using APOC Load JSON, so it's published as JSON documents on GitHub and then loaded that into Neo4j using Cypher. Then I went to Neo4j Blue, which is Neo4j'S data visualization tool, which has an integration with the Neo4j graph data science library. So you can do really neat point and clicking to run graph algorithms on your data, and see the results visually.



And so I did that. And the highest Betweenness centrality was the character Bale Organa, or who I think is a senator that worked with Padme, I believe, which would make sense, politician, he's working with lots of different groups within the Star Wars universe. Close second was Jabba, which also I guess makes sense, but I'll put my code to do the import and some instructions to do this in Bloom up in maybe a GitHub gist. And we'll link that.



And we'll also be sure to link Evelina's talk that she gave at a conference. I found one of those. And then also her blog posts and the data set itself. So you can play along with that's, that's a pretty fun one.



We did something similar around this for the Game of Thrones, looking at character interactions in the Game of Thrones. We called it the Graph of Thrones. And I think there's a Neo4j sandbox example using this data set to do a similar type of analysis on who are the most important characters, finding clusters of characters, this sort of thing. So maybe we'll link that one as well.



Since we're talking about Star Wars.



Jason, is there any other interesting graph Star Wars-related content we should highlight?


Jason (06:36):

Oh, plenty. So again, I couldn't help myself. I decided to ask ChatGPT for some good jokes that combine the Star Wars universe and graph databases. And after sifting through many answers, there were two the I thought were pretty good.



The first one is the dark side joke. So the joke is why did Darth Vader use a graph database? The answer is to find the most efficient path to the dark side. Wo-wow.



And the next one, on the lighter side of the forest, why did Yoda switch to a graph database? Because he realized that size matters, not when it comes to databases, it's all about relationships.


Will (07:21):



Jason (07:21):

Also an accurate statement. So an interesting thing when I was asking ChatGPT many times for many jokes was it gave a large number of jokes that were related to visualization, which was interesting, right? Because when we talk about graph databases, oftentimes we use a graph visualization with it because visualizing a graph is very... It's very easy for people to consume, and it shows how often graphs, visual graphs were used to showcase graph databases, even though, right, the power of graph databases is not limited to visualization tools, because you can visualize data from all sorts of data sets.



But yeah, I thought that was an interesting connection that we could talk about more later.



But anyways, I just did want to highlight that you can get a lot of value out of graph databases that is separate from the visualization of it.



Other Star Wars-related content is, I like to remind people of a video that JT, Jonathan Theen, had done last year where he had gone to Kaggle and just pulled up a relatively simple Star Wars data set and just showed how to import it into a Neo4j instance, and to just start playing with it.



So I'll put a link in there for you to take a look at. But also, Alison and me had done a presentation at PyCon last week and we did it with a Star Wars theme. Alison, would you like to tell our listeners a little bit about our presentation?


Alison (09:00):

Sure. Personally, I enjoyed our presentation quite a bit. So, our intention at PyCon was to show developers how they can leverage graph database in some interesting and unique ways.



And so what we did was we made two separate Star Wars-themed apps.



One was the... I'm trying to... what was the exact title on it? It was the Star Wars Developer Portal, I think it was called.



Yes. So basically we went through and built ways for the Star Wars galaxy. So you have the ability to plot yourself from any planet to any other planet in the Star Wars system, and you could leverage hyperdrive lanes. And we actually added some interesting abilities to avoid certain planets if need be.



Perhaps there's too much of an Empire presence. So do you want to tell us a little bit about how that got built, Jason?


Jason (10:03):

Yeah, so the first thing we did was look for Star Wars Galaxy data. And fortunately there's quite a bit of information out there, between the Wikipedia API, and a wonderful Star Wars Galaxy Map that was created by a community member Henry.



So, to develop these apps, we had to pull together some information from a couple sources. The first one was the Wikipedia API, and then the other was a spreadsheet that was made available from the



So, the information available was great. It had coordinate data and interesting data on basically all the planets and systems inside the Star Wars galaxy.



The only thing we didn't have for creating a Waze mapper was the connections between systems.



So what systems were connected to other systems on particular hyperspace lanes. We did have some of that data in the Wikipedia API, but not all of them, especially like which hyperspace lanes may cross another one.



So what I had done was I had gone to swgalaxymap, and visually looked at the map and saw which connections were available between systems, and I created another doc that basically notated which systems were connected to which ones.



We took that data and we integrated it with all the other data we had available from these other sources, and put it into a Neo4j instance.



Once it was available inside a graph database, we could do a relatively simple query. We just put in the starting system, the end system, and there is a convenience function called Shortest Path, and you just put in the Cypher pattern that you're looking for, which in our case is basically start planet to connected to, or near a system. And then to the end system.



So the connections that we wrote down were twofold. One was for which systems were connected to which ones on hyperspace lanes, which in the Star Wars universe is the fastest way to get from A to B.



And then there were certain systems that were not on hyperspace lanes. So for those, Alison created a script that basically found the nearest system, based on the coordinates data, which Hyperspace system was closest to that non-connected system.



So we had two kinds of relationships. We had "Connected to" and "Near."



Once that was available for all the systems, the Shortest Path plot worked on every system inside the database.



So that... Under the hood, that's what powered all these powered the Hyperspace Navigator app.



Alison, so when we did our presentation, I had a lot of fun and we had chosen a banter-style approach. Do you want to tell the listeners a little bit more of, or the benefits doing a presentation like this?


Alison (12:54):

Sure. So the way we approached it was a little bit of role-playing. So in our scenario, Jason was playing the somewhat graph-naive developer, and I was playing the seasoned graph data scientist.



And the idea was that we walked folks through what is a traditional approach, versus what a graph approach would look like.



And so we set a scenario where Jason, you would go ahead and say, "This data structure might be how I would tackle it." And then I would say, "You could do that. However, another option would be..."



And so in the actual presentation, we were showing how, in-graph, the developer side would be often these very long if-then statements. And then what did the graph end up showing you, Jason?


Jason (13:45):

So the graph showed me that this was graph approach to solving the questions we presented was much faster. It was much less work than chaining together a lot of binary functions.


Alison (13:57):

Yes. So we helped you escape Darth If-then, so you didn't have to settle into those long if-then statements. Because one of the things that, especially from a developer's perspective that's really helpful about graph is this idea that you have an ability to manage the many-to-many relationship that could be otherwise challenging when you're trying to hard code all the possible edge cases.



So one of the things I like to say is instead of having to make your code really explore the entirety of the structure of your data and your possibilities, in this case, it's actually driven by the structure of the graph itself. And that those ways to search and to retrieve from the graph are often much more elegant and truly Pythonic in their simplicity.


Jason (14:50):

And so going back to the banter, almost improv-style way of presenting, I definitely recommend this sort of style doing that made talking about this fairly complex subject matter made it very easy I think for the audience to consume. And it was just a lot of fun to do.



Shifting gears to the booth experience. So we did sponsor PyCon, so we had a booth in the sponsor hall. In the booth, it was a also a great experience. We had a lot of people come by the booth, a lot more than we were expecting, because we did not choose the best booth location.



But still there were three of us, and we were pretty much standing and talking to different folks the entire time. Alison, were there any highlights from the booth experience?


Alison (15:38):

The highlights of the booth experience for me was really having an ability to talk to people about what we had covered in the talk that we gave. We had a lot of people who came to visit us who had seen some challenges that they had and were really curious about graph. And then we also had the ability to talk to folks who were already using graphs somewhat, already.


Jason (16:03):

While we were at the booth, we had the fortune of having a microphone on set, and we asked a few folks about their experience at PyCon, related to either our talk or with graph databases.



So have a listen.



We're at Python 2023 in Salt Lake City. My name is Jason, I'm with Gianni here at the Neo4j booth, and we love everything graph, graph databases. Yeah. Gianni, for you in your life, what are you tackling right now that [inaudible 00:16:34] graphing problem.


Gianni (16:35):

I mean, I'm measuring a lot of performance metrics when it comes to new updates being pushed out. So tracking individual users, or we'll do high-level performance over many users. And so there's just so many box plots measuring data performance, or just grasping connectivity between different users.



And this user does one thing and it's connected to this user, because they're a part of the same family and all kinds of stuff like that. So I'm graphing all the time.


Jason (17:01):

Okay. So I got a question. Box plots. What are box plots?


Gianni (17:05):

Box plots measure... So they can have outliers, but then they have these little... It's literally a rectangle with a line that shows the median data. And then you have the 75th percentile quartile, which is the third quartile, 75th percentile.



And then you have the first quartile, which is the 25th percentile. And the two end nodes, I believe are the fifth percentile [inaudible 00:17:26] range of data, and then I usually mix those in with violin plots or sorry, not a violin. Oh, I love, this is a mix between a kernel density plot and box plots, and it's just perfect to show, okay, even though we have some outliers up here, the majority of people are falling within this range.


Alison (17:40):

Yeah, exactly. So basically what it does is it allows you to plot categoricals against a continuous variable. So your categoricals are across the bottom, your continuous variable is your Y-axis, and so it shows you that spread for each of those subsets of categoricals.


Jason (17:57):

Okay. Next question. What is a violin plot?


Alison (18:00):

So, a violin plot is like a box plot, but it has shape and it basically takes the histogram of what that data is and makes it the shape. And it's even on both sides. So it looks like a violin.


Jason (18:14):



Alison (18:15):

So it's a combination of the categorical, the continuous variable, and the distribution, the histogram of that continuous variable.


Jason (18:25):



Gianni (18:26):

Yeah. All right.


Jason (18:27):



Alison (18:28):

The data science nerds are convening here at the PyCon booth.


Gianni (18:32):

Oh, that's awesome. Love violin plots.


Jason (18:33):

Hello. Welcome to our booth here at peon 2023 US in Salt Lake City. Thank you for joining our presentation earlier in the week. So I had a question. How did you, the presentation, was it useful?


Marcello (18:46):

It was fun. First of all, it was an interesting topic and I like the way that you demoed the systems and how all... The relationship between them. I specifically liked how some of the plain English queries were written in the Cypher and how you were able to say, "I want to find this data, but just avoid this other data."



And that was very cool to see. And that's something that I had never seen before. So this is a new endeavor for me. So your talk opened my eyes to a world that I just had never seen before, which is cool.


Jason (19:17):

Cool. Exciting. We've been talking about data modeling. This seemed like quite a complex problem. What is your current graph problem that you're trying to solve?


Marcello (19:27):

Well, the problem is that we have applications that are being deployed and they have dependencies. And we're not talking library dependencies, we're talking external dependencies, like routes from one system to another system. We're talking about database access.



So when the application developers are adding their applications to a server, they also have to follow up with other installation methods for setting up those routes, for setting up those database access handles. And we would like to get to a world where just simply deploying the application by virtue of having those relationships predefined in the database, that the application deployment would also trigger the installation of the dependencies as well.


Jason (20:05):



Marcello (20:06):

So that's how I see this, because it would also help us visualize the relationship between all the applications in our company, as well as everything that they depend on.



So I could also say that if I have a database that I'm looking to lead or deprecate, I can immediately see who is using this database, without having to dig into the database itself and finding the queries and looking for fingerprints or looking for DCP connections, or sockets, or what have you. So having these pre-built in and guaranteed deployment relationships would allow us to ultimately simplify everybody's life by orders of magnitude.


Jason (20:39):

Nice. Very cool. I love that.


Alison (20:42):

Like "Oh, we know how to solve that problem for you. Step on up tour data modeling workshop."


Jason (20:50):



Alison (20:50):

Oh my gosh, thank you so much.


Marcello (20:51):



Alison (20:51):

It was so nice to meet you.


Marcello (20:51):

Thank you very much.


Jason (20:53):

So special thank you to both Marcello and Gianni. Thank you for lending us your time and giving us your experience. And while we're talking about data modeling, did want to mention that from Alison's actually input during the booth experience.



So Alison, you had asked while we were making all these arrows data models, is there a way to take the arrows data model and put it into a data importer instance? Now, the data importer tool is a tool for folks to drop in a lot of CSVs or TSVs, and then to model data just like in arrows, but to hook up that data from the CSVs to the data model, so that once everything has been mapped out, you can click run, import, and ingest all that data in the form of that data model right into a Neo4j instance.



Now, arrows and data importer, though the UI experience is the same, they have slightly different purposes.



So arrows can output graphics images for you to share with, but also exports a JSON file so that you could re-import it into someone else's arrows instance. But that JSON file cannot be imported into the data importer tool because it's JSON spec is slightly different, right? Because it's been set up to connect with the CSV data.



After Alison asked, I spent the weekend after we got back from PyCon, building out a quick Streamlit app that just takes an Arrows app json and makes it importable into the data importer.



So we'll put a link to that in the show notes as well in case you are wanting to quickly mock up something, share it, but then also be able to use it as part of an import process.


Alison (22:41):

And Jason, the reason why I asked you if there was a way to do it was because you had previously built something that was very similar. So do you want to talk about that a bit?


Jason (22:51):

Oh yes. The mock data generator app. So this app I've been working on and off for the last few months, and this app allows you to take an Arrows app and notate it in a special way that when ingested into the app that I've created will generate synthetic data from that graph data model.



For example, say you build a quick data model with users who work for... Or person who works for another person who works at a company located at XYZ, for each node and each relationship, you can notate it for how many of those nodes or how many of those relationships you want created and what properties you would like.



So if I want a name and then what kind of generator, what kind of mock generator you want attach to that property. So names, you'll probably want to select a first name generator or you can have a company name generator.



You could generate random integers. So I have two dozen functions that generate random data and that's just basically assigned by the special notations. And once the app is done, it gives you a zip file that you can upload into data importer, and you can adjust things there or just run import.



And that entire synthetic dataset gets pushed into the Neo4j database that you've assigned. So as part of that process embedded in it was the ability to convert arrows to data importer-friendly format. So I just extracted that and made the arrows to data importer app.


Alison (24:21):

I love it. I love it. I have to say it's such a useful tool and it's so straightforward how to use it. So a big thumbs up from me on that one, Jason, thank you for making that.


Jason (24:34):

Cool, cool. So circling back to talking about talks, we hold a annual developer conference at the end of each year called Nodes, and the CFP for it will be open in just a few weeks, I believe. Will, do you know anything about the CFP process?


Will (24:52):

Yeah, so CFP is something that most conferences do when they are planning out the agenda, which is put out the CFP, the call for proposals. And so this is typically open to the public, write up a little proposal for something you would like to talk about, and submit it to the conference organizers. And they typically have some process to determine what talks to select for the agenda. So that that's basically what a CFP is.



The Nodes CFP, like you said, Jason, I think is not quite open yet, but we know it's coming. So we wanted to mention this just to get the idea out for folks starting to think about. So if you have just a little idea, a little voice in the back of your head saying, "Hey, this project I worked on with Neo4j was super cool. I'd love to share the outcomes, what I learned," maybe teach some technical aspect that you learned along the way. If you have any sort of idea around that, start thinking about how you might share that in a conference talk, and the CFP or writing the CFP, writing your proposal is really the first step to preparing your conference talk.



Maybe we should do a series in these episodes of the path of creating a conference talk, perhaps. And right now we're at the proposal stage.


Jason (26:15):

Nice. Yes, that's a very good idea. Yeah, that's great, Will. Do you have any advice on what people should think about when they're creating a CFP?


Will (26:26):

I do. Yeah. So I have helped to review some of the Nodes CFPs, the Nodes proposals in the past. And I think the ones that really stand out that are clear that this is going to be a really interesting compelling conference talk, they're able to sort of embed the story arc into the description.



And this is difficult because in a CFP, you don't have a whole lot of space, a whole lot of room to convey all the ideas for your talk. It's typically like a title, an abstract, some description of your talk, maybe an outline, your bio, and then sometimes you have some private information you can send that goes just to the conference organizers and isn't published with the program. But I think if you're able to convey the story arc, and this, I think, is an important component for lots of conference talks, but what is the challenge?



So maybe I wanted to analyze some sort of data with Neo4j, or I wanted to build an integration with Neo4j and some other technology that that's a challenge, and how does our hero rise to the occasion? What technical skills did you need? What data analysis techniques were you actually able to leverage to accomplish this? And then what were the outcomes?



So being able to see that story arc I think is really important. And then of course, also making sure that the topic that you're proposing, and especially the specific angle that you're taking on the talk is relevant for the attendees of the conference. So, if you're submitting to a data science conference or an application development conference, you should think in terms of who are the attendees, what are they going to be interested in?



And then I think the third thing that makes a compelling conference proposal, is being able to show your passion for the topic. Why is this something that is important to you? Why do you want to share this story with conference attendees beyond just building the thing that you built?



So these are the things that I would think about as I'm trying to put together a proposal.


Jason (28:36):

Okay, so this last month, there was a ton of content that was produced and a good chunk of it was from Tomaz Bratanic. And the first one I'd like to talk about is his graph aware with GPT-4 article.



So actually he's got a thread that's been going on for the whole month, which has been to teach ChatGPT or LLMs in general, how to create Cypher statements, and then to take the response, and present it to a user in a human-friendly format.



So the general workflow that he's been working with is to have someone ask a question into ChatGPT or another LM, take that question, convert it to a Cypher statement that is then ran against a Neo4j instance, gets a response, and then ChatGPT takes the response and then converts it to a more human-friendly format.



So in this graph-aware GPT-4 article that he produced, he was talking about how you can... You're basically doing this process, right? You're basically backing an LM with a knowledge graph to prevent hallucinations, right? Because you're narrowing its ability to just source random bits of data and to just use the answers from the knowledge graph to give you a more accurate answer.



So he did mention in his experiments a significant difference between ChatGPT 3.5 Turbo and 4. So apparently 3.5 has this... He called it a Canadian leaning, always wants to explain things and apologize if it can't give you exactly the answer. Whereas GPT-4, would he given instructions to not explain things and to just focus on its role, does a much better job adhering to those instructions.



So long short, he talked in this article about ways of using prompt engineering to train and to give examples and how to get GPT to get into this mode of being a Cypher query interpreter, and instruction creator.



And one of the interesting finds that he had was that even though he did all the training, and all the information is in English, he tried asking it questions in different languages and it was able to produce the same level of answers in those languages despite not having done any foreign language training. So that was interesting.



The last one of Tomaz's article that I want to talk about is a generating Cypher queries with a GPT-4 on any graph schema. So this is the culmination of his other GPT-related docs. So this one, instead of using prompt engineering and giving GPT examples of Cypher statements that it should use, he gave GPT just the graph schema, right?



So the nodes, the relationships, the type of properties that they have, and then went straight to asking questions and seeing what type of answers GPT gave back.



And so this approach of course is much more scalable, much faster to set up than creating a huge list of example questions, and Cypher queries, and the results actually turned out according to him, quite good. So one more article for Tomaz was on LangChain and Neo4j.



Alison, could you tell us a little bit about that article?


Alison (32:03):

Sure. For those who may not know what LangChain is, the LangChain is a way that you can use LLMs and it's pretty much the primary use that people are... Way that people are doing those right now. But basically what happens in the LangChain agent is that the user asks the question and the question is sent to the LLM along with the agent prompt, and then the LLM responds with further instructions to either immediately answer, or to get additional information and then it gets sent back to the LLM again.



And so what Tomaz was looking at, he had come across... I'm trying to remember who it was, Ibis Prevedello that was using graph search to enhance the LLM by providing additional external context. So when Ibis was using this, Ibis was using Network X.



And so Tomaz wanted to say, how could we develop this with Neo4j? So in his GitHub repository, you'll find LangChain to Neo4j, which is integrating Neo4j into the LangChain ecosystem.



So really what goes on is that he went through and was able to come up with three different pieces that three different models that are currently implemented.



One is generating Cypher statements to query the database.



So, following up on the work that you just mentioned, Jason, being able to take the input and the user question from that LLM and to create a Cypher statement to query the database.



He also implemented full text keyword searches of relevant entities. So from the response finding the three main entities that could be used for additional context, and then also vector similarity searches. So the way that it goes through, there's one example that he used in Vector similarity that I thought was really interesting, is that the question actually came in about The Hobbit and some of the additional information that was provided to it, was on The Lord of the Rings.



So it was because of the vector similarity of each of those movies because he was using movies throughout the article, it was coming through.



So obviously everyone's very curious about leveraging LLMs right now, and this was just an easy and interesting way of taking that graph ability and putting it inside the actual application.



So, instead of the user being the response to that first question to get the additional context, now it's the agent is actually going to be programmed to go in and get the additional information. So I just thought it was really interesting the way that he followed up on what Ibis was doing and shows the integration with Neo4j.



A couple of things to note, while LangChain does take in various models right now in LangChain to Neo4j, it simply it's only using the OpenAI model at the moment. And the other interesting thing too is that when you're building it, you actually have to build in some of the prompts themselves. So it's not just code. So that's an interesting thing too, because you're programming that second user question, you'll have to actually bring some prompt art into that as well. It could be an interesting way to leverage the capabilities of graph in your next LangChain project. So it's definitely worth looking into as an option.


Jason (35:35):

Cool, thank you. The last Medium article I'd like to highlight is one from a new author on Medium, Hareen, and they write about Twitch graph network analysis using Neo4j, and it's a fairly short article, but it's a great introduction to both the APOC and the GDS packages that are available and they use it just to import some Twitch data, and to analyze it. And I thought it was interesting because they were writing about kind of a real world use case of comparing Load CSV versus the Neo4j admin import CLI tool for ingesting all this data. So whereas Load CSV took quite some time, it was in minutes versus the admin import tool, which took seconds for them.



Okay, moving away from articles and going into videos, there was quite a few videos of this month as well too.



One I'd like to highlight was an interview that Max Anderson did with Donovan Bergen of JB Hunt Transport Services.



So it was on using a tool that he had created Schema Smith for data governance.



So, in a nutshell, what Donovan had created was a YAML-based configuration system for running scripts that will build either a new database or will basically correct an existing database by defining all the nodes and relationships in YAML file.



The scripts will go in and create all the indexes, constraints, the node properties, relationship properties to prep a new database for all the incoming data that will go in, or it will take an existing database, and if there are typos and extra nodes or relationships that really shouldn't be there, it, I believe, will basically trim or correct those nodes in relationships.



So yeah, it was a good tool. It was advertised as a tooling set that will get you not all the way set up with the database, but like 60, 70% of the way there, right? And he had built this to help with some internal processes that they were having at JB to fully set up dev staging and production environments.



Take a look at the video he walks through in great detail of how to set this up in a GitHub repo, and configure all the GitHub actions to make it run.


Will (37:57):

So I noticed a lot of questions in the last few weeks about APOC and specifically folks not able to find certain APOC functions and procedures in the latest release of APOC. So I thought it would be good to talk a little bit about what's going on there.



So, first of all, what is APOC? APOC is the standard library for Cypher. So APOC is a collection of procedures and functions that extend the functionality of Cypher to give you the ability to do things like import data from different formats, functions for working with collections in different ways, different types of data structures, exposing a handful of graph algorithms, enabling things like triggers to define logic that happens during a transaction.



So lots of things in APOC that extend the functionality of Cypher that are super useful. To give you a little bit of history, and some folks may remember this, but procedures were introduced in Neo4j 3.0 as a way to extend functionality of Cypher.



These were an improvement on the unmanaged extensions that we used in Neo4j at the time.



So both of these, both procedures and unmanaged extensions, these are written in Java, they're deployed to the database, and then procedures we can call and use the results of in a Cypher statement with unmanaged extensions. There was a new REST API endpoint that we would hit, and that was disconnected from Cypher.



So the introduction of Cypher Procedures really enabled doing all this workflow within Cypher. And as I said, this was in Neo4j 3.0, we're in the five series now. So this was a few years ago when this was introduced, but at the time the Neo4J Labs team immediately began building APOC as this library to bundle common functionality in procedures and actually migrated a lot of things we may do previously in unmanaged extensions into APOC.



And this grew over the years. There are a number of community contributions as well, and it became something that everyone is was using and became the standard library for Cypher, but it was a bit awkward for Neo4j engineering to officially support this simply because of the breadth of what's in APOC.



And so the solution was to come up with two different distributions of APOC, APOC Core and APOC Extended, and this change was released with version five of Neo4j and APOC. So the core distribution of APOC is available by default in Neo4j Ora, Neo4j Sandbox, for the one-click install in desktop to add it to the Docker image, there's an environment variable to set. So this is a default distribution of APOC that you get which bundles the most commonly-used functions. And again, this is officially supported by Neo4j engineering.



Now APOC Extended, this has some additional functionality, things like importing from other databases at bundles, a Mongo integration, JDBC drivers, these sorts of things that are useful, but maybe less commonly used or they need to load large external dependencies, and so it doesn't make sense for them to be in Core.



If you are upgrading to version five of Neo4j, and you see that some APOC functions you were using previously are not available, check the documentation to see if they're in Core APOC or Extended APOC. And we will link a blog post also from Tomaz talking about how do you install the extended APOC distribution in different flavors of Neo4j.



Also, in the last few weeks, the BigQuery data connector for Neo4j was announced. If you're not familiar with BigQuery, BigQuery is Google Cloud's serverless data warehouse. Data warehouses are used by enterprises to store all kinds of data, things like powering BI or dashboard tooling, enabling data for machine learning pipelines, analytics, and also publishing public data sets is a common use case for BigQuery, I realize. So things like the GDELT global news dataset, there are also a lot of earth observation and weather public data that is published through BigQuery.



Now, the Neo4j BigQuery data connector allows you to pull data into Neo4j from BigQuery, represented as a graph, perform some analytics graph data science using Neo4j, then either write the results back to BigQuery or continue on with your analysis in Neo4j.



So, some examples of why you might want to do this, maybe you want to cluster your customers based on their purchase history using community detection algorithm.



Maybe you want to flag potential fraudulent transactions, based on some graph pattern that's known to be suspect.



Lots of different ways that you could use graph to work with data from BigQuery. So this is a pretty exciting announcement.



We'll link a site that has a demo video and a bit more information. And if you want to try this out now it's available in early access through Google Cloud.



Another interesting blog post that I saw this month was from Michael Simons on the Neo4j engineering team called Describing a Property Graph Data Model.



And in this blog post, Michael talks about some of the work that he's experimenting with to build a JSON representation of a property graph model, of a property graph schema.



And this issue comes up I think typically with tooling around the database. Neo4j we like to say, is schema optional. So unlike a relational database where you need to define a schema, what is the data type of every attribute?



These kinds of things, you don't need to do that in Neo4j.



You can start importing data, connecting nodes and do that without defining the shape of the data upfront. You can define some optional constraints. So we can say "This property must exist, this property must be unique," these sorts of things, but we don't really have the concept of a full schema that defines and enforces the data model, but there always is a schema. It's either enforced by your database or it's enforced by your application.



So this concept exists, and what Michael is trying to do is address the question of how do we represent, how do we serialize this schema? And so Michael's created as a Neo4j procedure, a proof of concept that will inspect the data in Neo4j and generate a version of this data model, version of this schema using a JSON schema. It's a bit of an overloaded word there, but using JSON with a common structure to define this.



And in the blog post he lays out some examples of where this might be useful, and it's largely in tooling built around the database, or integrations for the database. So for example, in the Neo4j graph QL library, there's a version of this in Workspace, we may use a representation of the data model to enable the data importer and also to enable the query tab in Workspace for things like being able to give you Cypher autocomplete. If we know what the schema is and you're typing a pattern with a node label, we know what properties exist on that node label.



We can power autocomplete this way. And I think there's a few versions of this. I know I implemented my own kind of version of this years ago in one of the data import projects that I worked on, just having some way to represent what is the schema, so you can then use that throughout your application. Anyway, we'll link this in the show notes. I thought this was a great step forward to have a unified way of representing the graph schema, and this is available for anyone to try.



We'll also link the code on GitHub if you want to give that a try and let us know any feedback that you have.



Cool. Well, over to Alison. I think you're going to tell us about some updates to Graph Academy.


Alison (46:59):

Yes. New updates in Graph Academy include a new course in Cypher Aggregations where you will be leveraging collect or count. There are a number of aggregation functions, some MinMax, standard deviation, et cetera.



But one of the most interesting ones for me is pattern comprehensions. So when you send the query, it will return you a list of all the patterns that satisfy a particular match query that you've put in. So I think that's really interesting in that course.



Additionally, there's been an update to the Import CSV challenge. So that will now be using Workspace, which is the platform that includes Bloom Query and the data importer.



There is also an update to building Neo4j Go. So it'll be using the new Go driver. The course itself is actually a little bit shorter than the previous one and runs in Gitpod.



And the other thing is they've improved the course recommendations as well. So this should be a little bit easier to figure out which of the recommendations of the courses would be appropriate for your next step. So lots of exciting things happening over at Graph Academy.


Will (48:14):

Great. Let's wrap things up as we usually do by talking about our favorite tools of the month. What tools have we used that have brought us joy this month? I'll go first.



So my tool of the month is Cypher subqueries. So we've talked about these a few times in previous episodes, but a Cypher subquery is basically an independent Cypher statement that is embedded as part of a larger Cypher statement.



Previously, I think we've talked about the exists subquery where we're able to use these as a predicate in a WHERE clause to check to see if a complex graph pattern exists. There's also the count subquery where we're getting the count of some pattern that we're able to match on in the graph and then return the count of nodes that we encountered, or something like this. There's the... we'll call it the call-in transactions. This allows us to batch an operation across multiple transactions.



So this allows us to be more memory-efficient, as we are maybe processing lots of rows or iterating over a large file.



And the newest subquery is the collect subquery, which I think was introduced in Neo4j 5.7, which came out maybe a week ago.



But the idea here is this allows us to process and return a collection, so a list of things, and we may want to be able to do some post-processing, where maybe we have multiple lists that we want to concatenate and then sort, or we have a more complex Cypher statement to build up our list that we want to break out from our overall larger Cypher statement.



So, definitely these things were useful for me this month on a few projects. I was just using the call-in transactions to load a big dataset earlier today. So, definitely a useful feature.



We'll link the documentation page that goes into a lot more detail about these. Alison, what was your favorite tool this month?


Alison (50:39):

I have a favorite tool this month, and I'm not sure if it's considered cheating, but it has been mentioned on today's podcast already, but it is definitely Jason's arrows to data importer app because as he mentioned when we were live data modeling, being able to share those with people and have them bring them right into the system, I think really helps with that enablement stream.



And so I'm super excited about that one this month.


Jason (51:08):

Okay, so my favorite tool of the month was probably the Arrows app. We used it extensively at PyCon and it's just a solid workhorse application to quickly do some data models, and the sharing function is really great and super convenient.


Will (51:22):

Great. I think that is all that we have for you this month. I'll mention just a couple events that are coming up in Sydney.



We have an in-person Graph Summit event coming up on May 2nd. Then on the eighth we'll be in Melbourne for Graph Summit Melbourne, and at the end of the month, we have Graph Academy Live covering Cypher Fundamentals that is listed on our meetup page, so we'll link that as well.



And just a reminder for folks, we opened the episode today with a listener question. Those are my favorite sections of the podcast. I love hearing from the community. What are things you're interested in, what would you like us to talk about on the podcast? And so anyone can submit an audio question, just go to, look for the submit your question button and you can record that and send to us from your web browser, or on your phone, whatever works for you.



If there's something you'd like us to answer, some challenge that you're having, maybe how would you model a certain data set? We love to talk through these because if it's something that you're thinking of, there are definitely other folks listening to the podcast that have the same question, or maybe facing a similar challenge.



And maybe also, while we're talking about conferences for the next few episodes, maybe tell us what's your best and worst conference experience, and we can share those as well.



Great. Thanks for joining everyone, and we will see you next time.


Jason (53:04):

All right, thank you, Will. Thank you, Alison. Thank you. Everyone else, have a great rest of your week, and rest of the month.


Will (53:09):

Thanks, all. We're looking forward to hearing your questions. Have a great month and may the graph be with you.