GraphStuff.FM: The Neo4j Graph Database Developer Podcast

What Is Graph Data Science?

Episode Summary

Welcome to GraphStuff.FM, your audio guide into the interconnected world of graph databases. We'll take you on a journey to unravel the mysteries of complex data and showcase the power of relationships. Today we will explore the complex world of data science and add context to AI and generative AI topics. Is it actually as complex as it seems? We will find out!

Episode Notes

Episode Transcription

Jen (00:00):

Welcome to GraphStuff.FM, your audio guide into the interconnected world of graph databases. We'll take you on a journey to unravel the mysteries of complex data and showcase the power of relationships. Today, we will explore the complex world of data science and add context to AI and generative AI topics. Is it actually as complex as it seems? We will find out. I'm Jen and I'm here in the studio with my colleagues Alison, ABK, and Will. Hello.


Alison (00:26):

Hello, hello.


ABK (00:27):



Will (00:27):

Howdy, howdy.


Jen (00:28):

We're going to be talking about graph data science. Let's dive right in.


Will (00:32):

To kick us off, maybe we can kind of synthesize some common questions that I've seen in the community that come up a lot around graph data science, and especially right now there's a lot of, I feel like renewed interest around AI, especially around the generative AI things like ChatGPT, large language models, these sorts of things. In some of our previous episodes, we've talked around these sorts of things.  



We've talked about how you can use an LLM to generate Cypher queries. Tomaz has done a lot of work around this, but I think we haven't really done a deep dive into what is actually graph data science. How does it relate to some of these larger artificial intelligence machine learning topics?  



To kick off the discussion here, maybe let's frame the question as something like, I've heard about graph data science, there's a lot of interest around generative AI, what is graph data science? How does it relate to AI? How is Neo4j relevant in this context? Maybe that's a good framing to start off. Maybe just to kick off the discussion, let me take a stab at answering this. I'll point out I am not a data scientist. I come at this from the context of an application developer, like a full stack application developer, who's trying to keep track of a lot of different pieces of my architecture of my app, and I'm interested in adding features is what I care about.



With that context, let me try to take a stab at answering just the broader question of what is graph data science. The way I think about this is thinking about this from the context of access patterns to the database, graph access patterns. I would put this sort of on a spectrum between graph local traversals and graph global operations. Typically, in a transactional graph database where let's say, I have a, maybe it's a movie recommendation application that I'm building, people can write reviews of movies and then I suggest movies they like, something like that.  



Typically, the answer to my question is a traversal through the graph. I have a well-defined starting point. I know, let's say, the currently logged in user, I'm going to start with that node, and I want to know what are all the videos this user has watched, what are all the comments they've left? Something like that.  



I traverse out the graph starting at that user following relationships to movies they've watched, comments they've left, things like this. That's the answer to my question. That's how I can render review in my application. Graph databases are optimized for these traversals. There's a specific reason I've chosen the graph database to build my application.  



Now, contrast that with graph global operations, maybe I want to know who's the most influential user in my network. To do this, I would use something like PageRank, which is an iterative global graph algorithm where the answer to the question touches every note in the database, goes through this iterative operation to figure out who has the highest PageRank score. These are wrapped up in graph algorithms. With the graph local traversals, I'm writing a Cypher query. I've defined a pattern that I'm looking for in the graph with Cypher.  



With a graph global, graph algorithm, these are typically in the context of Neo4j packaged in the graph data science plugin that extends the functionality of Cypher. I have procedures and functions that I'm calling. I'm not necessarily writing a Cypher pattern. That's how I think about the differences between these two things. Graph global, graph local, graph data science falls on the global end of this spectrum.  



Now, as an application developer, I can still use these things together to add functionality to my application. Maybe finding the user with the highest PageRank score is part of a recommendation query or something like that. I'm leveraging that in a transactional used case. That's, I guess, my attempt to answer the question of what is graph data science, I guess from the context of an application developer. Let's check in with Alison, who's our resident data scientist. You can tell me how did I do? Is that a good sort of approximation for what is graph data science?


Alison (04:59):

I think you did a good job of showing what the used case of graph data science is. When I think of graph data science, I think of it in a couple of different layers. Coming from the data science background, I see the parallel with traditional data science. To your point, this idea of, you were talking about PageRank when you're trying to talk about influence, what's the most influential, I think that's the classic way that people think about graph data science. They think about the algorithm itself.  



What are algorithms that can only be run on graph structures? We have certain types of algorithms that are unique to certain data types. For example, in natural language processing, we have particular algorithms that are specific to words. For example, sentiment analysis or word embeddings, but in general, we have certain algorithms that align to certain data types. When we think about graph algorithms, we think about these.  



You mentioned one of our centrality algorithms or an influence algorithm, which is PageRank and as you said in PageRank, it calculates what is the importance of the node based on how connected it is to other nodes as well as how important the nodes are that it's connected to. We have other kinds of graph algorithms as well. You also mentioned pathfinding. Pathfinding is a common algorithm as well, the traversal you were discussing. That can be one as well.  



We have a number of other types of algorithms, but when we think of graph data science, we have the algorithms themselves, but we also want to take into account the other aspects of data science such as statistical information. What do we know, how many relationships does that node have? How many second hop relationships does that node have? You can have those more traditional types of statistical information about the nodes as well, which would come into data science.  



Then, something in between where it's sometimes called graph analytics where it's kind of a combination of the two. Maybe you want to see the overall structure of the graph. You could run an algorithm on it to do community detection, but maybe you just want to see is the entire graph connected? Is it a series of subgraphs? Some of that is just in the structure itself. When we think of graph data science, what we want to begin to look at is what additional information or signal can I bring to my problem solving project that can only be encapsulated when we look at the data in this connected structure?


ABK (07:44):

That makes sense to me. I guess the question about some of the, I know there's different sets of algorithms that we have, and I guess you touched on this, some of those algorithms are analytical in nature to tell me the characteristics of this entire graph and then some of them, maybe Patreon, is one of these that we'll start with, you can run just by itself and get an interesting outcome. I think those kinds of algorithms, is it true that those turn around and enrich the graph with those? After you're done with the algorithm, the output is that you actually write back to the graph. Now, the graph has got new information that is based on the algorithm. Is that right?


Alison (08:23):

The used cases can vary. For example, it could be that you are a data scientist who currently has some type of machine learning pipeline and what you want to be able to do is leverage that graph structure, and that PageRank algorithm to have that value that you're saying you write back and you extract that and it goes into your traditional 2D matrix of what you're running, say, I don't know, a linear regression on or some type of classification algorithm, a traditional 2D matrix type, or to your point, it can then be an additional piece that's written and stays in the graph itself.  



Perhaps, you run the PageRank, you write PageRank to the graph, and then you want to leverage PageRank for some kind of similarity algorithm. That value itself of PageRank becomes its own feature. Whether it's a feature within a graph algorithm or a feature that you then feed into a 2D algorithm, it becomes that new feature. It could be any number of ways that we participate in feature engineering from the data that we have.


ABK (09:38):

That makes sense to me. I've got a sense now of as you first set this up, Will, of course now with the graph local queries, that's the classic Cypher pattern matching stuff, graph global algorithms either for analysis or enrichment, that's graph data science, now the third piece of the diagram here that we're trying to paint with words is, and I don't quite get this, how does that all fit in with generative AI and all the cool stuff that the world is turning onto these days? How does GDS relate to that?


Alison (10:11):

Sure, that's a great question. Let's just start by clarifying what is generative AI and in our context, we're particularly talking about large language models. We can talk about those a little bit. Roughly, what happens in a large language model at is most basic is that the algorithm, leveraging deep neural networks will take the language it has access to that it's trained on itself, it will then tokenize and break that down, and then will create a probabilistic sentence outcomes, shall we say, that takes into account, not just the rules based of how those words were used in the past, but also the context and the meaning of those words to create new sentences that it may not have seen before, at its most basic.  



The question then becomes in this age of generative AI, how does this relate to GDS? There's a couple of different ways. I know in our previous podcast, we had talked about leveraging LLMs to write Cypher. One of the things about this availability of ChatGPT and these other LLMs is that it allows the common man, the non-technologist to interact in a way that they may not have been able to before. As we said, using say, show me a graph that is run some Cypher query underneath.  



The other thing that we can look at is, and where I find what we have at Neo4j and where the graph becomes more interesting is how leveraging graphs can actually improve your large language model. For example, there was one that we had talked about regarding LangChain. In LangChain, you have the initial query and then a subsequent query. Leveraging the knowledge graph of your, say, your company's data to produce a better output or training your large language model on your internal customer service book to help run your chatbot.



There's a number of different ways that we can look at generative AI as it relates to GDS. Generally, they're going to be either in the way that you are consuming the graph by querying that graph or it could be the way you are improving things in the backend based on that graph.  



One of my personal favorites is this idea, what we're working on now, is leveraging the large language model to help create the graph. Taking the large language model, taking that, putting it on top of a piece of unstructured data and having it return the knowledge graph to you. You run that iteratively over a corpus of works and you then begin to build the knowledge graph of all of those structured pieces, many layers to it. Some folks were talking about leveraging, they called it GraphGPT, that ability to take on structured data and turn it into a graph, to actually create summaries of works.  



If we have a number of documents that all feed into a single knowledge graph, that gives you the overview of that body of work. If you're looking at subsets of the corpus, you can look at those kinds of things as well. There's many ways that the large language models and graphs work together in this particular instance. It's, like I said, in the front, in the back and in the middle as well.


Will (13:53):

These are really three complimentary technologies then? It's not one is a replacement for the next, even though like GDS does pathfinding, you don't want reach for that necessarily because you can do some pathfinding just in Cypher, these very complimentary solutions?


Alison (14:09):

Exactly. We talked last month about the right graph for the right problem. In this case, what I really appreciate about the tool set that we have is we can reach for what we need as we need it. Shortest path I think is already built into Cypher. We have some that are just within the plugin. We've got some other new ones coming out, and it's not one or the other. It's about, to your point, ABK, how all the tools fit together in order to find the optimal solution for whatever your pain point is or whatever your goal is. Will was talking about I want to make this recommendation. How do I put this into my application? How do I use GDS to improve what I'm delivering in this app?


Will (14:55):

One of the challenges that I've had when using ChatGPT and other LLMs, I think are what are known as hallucinations. This is the case where the LLM tells you something that is not factually accurate. LLMs are great for this sort of creative brainstorming process, but if you don't live in that world, if you live in the factual world of, there's the great example of the personal injury lawyer who used ChatGPT to create a brief or some motion that he filed in court that cited, I think, three precedent court cases that simply did not exist, and he of course, got in trouble for that.  



In my world, it's more like, help me write some code to do something to solve some problem. There've been several times where ChatGPT tells me to use a Python package that doesn't exist. You use the Neo4j Python package for whatever it was. There was some sort of geospatial thing that it simply doesn't exist, that that's concerning to me because that's not helpful.  



One of the things I'm most excited about this idea of combining graphs with large language models is, as Alison was saying, to improve the results of your model, to improve the model as almost the secondary process where you're using your internal knowledge graph, whether that's your enterprise data or maybe your personal data, whatever the context is, but you're basically using that to verify the results from the large language models.  



In this context of the lawyer generating the false motions, it would be, do these court cases exist? Does this relationship between these cases and the precedent law actually exist in my legal knowledge graph? If you have that step almost as a post-processing step to verify this in your knowledge graph, that can really address this problem of hallucinations and enhance the value that you're seeing.  



Another sort of maybe related, and maybe Alison, you can tell me if this is related or if this is a different idea, I hear folks talking about this idea of explainability in AI. The model is a black box. The model generates something, tells me something, but I don't necessarily understand what are the inputs, what is the data that it's using to generate that? Is there a place for graphs to help with this idea of explainability in AI?


Alison (17:29):

Yeah, that's an excellent point. One of the challenges that is inherent in these multilayer deep learning models is this black box problem that we don't necessarily have that explainability. What becomes possible when you're leveraging something like GDS or knowledge graphs as it's related to it, you could actually then go through and say, "Where did this come from?" Depending on how you connect them, obviously, if you have a fact you can say, "Show me in the graph where this fact exists," and then you can go through and see where that's coming from.  



In that case, you absolutely can use the graph for that kind of explainability. It happens a lot in what I like about many of the GDS algorithms is that we can actually see, for example, in PageRank, when we're considering what is the most influential, it's mathematical. You can actually look at any node in the graph and understand how did it come to that score for that node? It's because it's connected to these nodes, therefore, it has this amount of influence and you can track the math that produces that.  



Graph definitely has this ability to be more explainable. Obviously, if you're running deep learning, then, you're going to start to venture into some of those black boxes again, but for many of the algorithms that we're using, the math is very clear and that explainability is also very apparent. I know that some of the folks on our responsible AI team right now are looking at graph for understanding bias. Can we see bias in the structure of the graph itself?  



Things that maybe might be missed because it doesn't hit a particular P value in those statistical ways that we're looking for bias, we can now see it in structure by running say, community detection or other kinds of graph analysis so that we can see bias in a new way. Graph, when any time that we are bringing context, we are going to have an ability to have more access to other layers of information that are tied to the context itself.


Jen (19:53):

It's been a little bit since I've looked at GDS, to be honest. When I played with it, I was playing with the graph algorithms library back before the GDS library really existed, at least in official context, but what I really found was interesting, we played a lot with, at the time Game of Thrones was really big, that kind of places you in the gear in time this was, but we did a lot with the Game of Thrones data set.  



What I always found really interesting is that you could utilize the pathways between the nodes, between the entities, and kind of see these relationships forming over time. As you went through either the series or the books, you could see that structure of the graph changing over time and how new alliances were formed or how events caused certain alliances to break apart into fracture.  



I really found that interesting that yes, you can run these algorithms and you can list off all sorts of fancy names for these, but when you actually look at the structure of the graph and these algorithms applied to it, you're really understanding what those algorithms are trying to do. From an application developer point of view, that makes the world of data science so much more accessible because now I can see what these algorithms are trying to accomplish and I can verify those results because I can see that structure changing right before my eyes, visually.


Will (21:13):

I feel like visualization is an interesting compliment to graph data science and graph algorithms, especially the idea of using the results of your algorithms to style the graph. Alison was talking about using LLMs to point an LM at a Wikipedia page and there's a technique for generating a knowledge graph as a summary of that Wikipedia page or whatever it is, some body of text, and you can really see the value there when you have the graph visualization, you're basically doing, it's like some entity extraction process to model the entities and then figure out how they're connected essentially as what's going on.  



When you see this in action, it's so apparent to look at just this giant wall of text versus a graph of nodes and how those entities are connected, you can add a glance, understand what's going on, what this body of text is talking about. Then, to go one step further, if you then apply graph algorithms like some centrality metrics, some community detection, and use that to style, say, the size of the nodes are sized relative to their PageRank score, for example, so that more important nodes are bigger and we color them by the results of some community detection algorithm. It's clear that if you're a blue note, you belong in the blue cluster. If you're red, you belong in the red one, whatever.  



To me, that's super powerful in visualization tools and near Neo4j Bloom supports this. I can do this sort of point and click now to not only use the results of the algorithms, but to actually set up and run the algorithms in Bloom. If you haven't done this before, this is amazingly powerful to just point and click and see the power of using graph algorithms and graph data science in the context of your visualization.  



It just adds so much more visual information to potentially tens of thousands of nodes that you're visualizing at the same time. I think this is an extremely powerful combination in that not only can graph data science be useful for all the things we've talked about, but even just for powering visualizations it's really quite magical.


Jen (23:34):

It's not magical, Will. It's math, but sometimes math is magical.  


Alison (23:42):

No, but you're right. That's what's exciting to me about this particular time in AI. For those of us who have been working in it for a long time, we are always the nerds in the back of the room. To be in this position now where it's so readily available and to have these no code opportunities to interact with AI, that accessibility, I think is just a really powerful time.  



We were talking earlier and I was saying I know that the future is here because our Sunday nights we always do dinner with my parents and my children. My parents are now in their 70s and they were talking to me about Chat GPT. Then, the children were explaining that to them, how it worked and how it had been recently banned at school. It's just an interesting time to be able to, like you said, bring GDS to Bloom, bring GDS to things that people can clearly understand and interact with, and have some power within to leverage.


Will (24:43):

We've talked about some used cases and some ideas here with some practical things sprinkled in, but I guess, for someone who wants to get started, if I want to get hands on with graph data science and start doing some of these things, we mentioned that the Neo4j graph data science plugin, is a plugin for Neo4j that extends Cypher, but what's the best way to get started? Does it matter if I'm a data scientist or if I'm an application developer? Are there different pathways to get started with this graph data science technology?


Alison (25:20):

It's a great question, Will, and in some ways the on-ramp will change depending on whether you're a data scientist or an application developer or just curious. In some ways they'll be very much the same. We have a lot of content availableat, but more importantly, we have something called Graph Academy. At Graph Academy, we have courses that will walk you through how the technology works and you can increasingly participate in that.  



One of my favorite tools, honestly, is we have something called Sandbox. In Sandbox, we have datasets that are automatically loaded for you and we have playbooks that walk you through a variety of data science used cases. The good news is, you don't have to write any of the code. We show you all of that code so that you can just walk through and see how it works.



Then, the third option if you're interested, is you can actually set up a Sandbox and then work with a product we have called Workspace. Workspace is the platform so that you can actually load your own data into it. You can use these Bloom capabilities, Bloom with GDS within the Sandbox. The sandbox is it's got a time limit on it. I know for data scientists, we like to get in, get out, nobody gets hurt. For us, that's usually a pretty good tool.  



For folks who are interested in more of the analytics aspects and less of the deep algorithms, leveraging AuraDS, which is a free database instance with some code, will allow you to interact with that as well. Depending on your area of how you want to use it and what your level of experiences with Neo4j or with graph in general, we have an on ramp for everyone. We'll definitely include in the show notes a couple of those different options and paths, but as I said, if you come to, we can point you in all the directions, but definitely check out Graph Academy.


Will (27:20):

I think it's also important to point out that Neo4j integrates with lots of existing machine learning platforms and tooling. For example, as a data scientist, you are probably using Python all the time. There's a Python clients specifically for GDS that's available. There's also an interesting announcement a couple of weeks ago from Google talking about the Neo4j integration with Vertex AI, which is one of Google's machine learning platforms as that's talking about how to use embeddings and in the context of Vertex AI.



Just want to point out that there are lots of interesting integrations with Neo4j and other toolings that, of course, is really powerful because we're not only using this technology in a vacuum, you, of course, need to be able to integrate with different systems.


Alison (28:14):

The only other thing that I wanted to drop in more for the developers who may not think that GDS has anything of interest for them, sometimes what you'll find, and I don't know if you folks want to speak to this a little bit as well, is that sometimes, you don't know that a GDS tool will be helpful. For example, if you're trying to do this traversal, you're trying to get from point A to point B, maybe you weren't even aware that a shortest path is a possibility.  



I didn't know if there were ways that you had used GDS in a way that previously, you might have done it a different way that was much more laborious and that leveraging GDS enabled you to do something more efficiently as a developer in a way that was easier to maintain. Does anybody have any examples of those experiences?


ABK (29:13):

I still have a very clear split in my head. I think it's a great question, Alison, in that for me, I haven't been able to find that middle ground where there's some things I would do just with Cypher and I just do it with Cypher. Doesn't occur to me that there might be another way in the GDS world. On the other hand, there's things I know I can't do with Cypher, that I'm like, "I'm going to go do community detection," that's a GDS thing, I'm going to completely switch.  



I had the very hard kind of choice where it's either this or that. I know that there's an overlap and sometimes, it might be better to lean rather than go all the way over, but I haven't had vacation for those yet. That's a good thing to bring up though, I should keep in my mind, is be like, let me look for those opportunities and it doesn't occur to me.


Will (29:58):

One of the more interesting aspects, for me, in the context of graph data science is this idea of graph projections. The idea here is that oftentimes, the graph that you want to run your algorithm on is not the same graph that you're storing in the database. We talked on the last episode a bit about data modeling and I think a lot of that is in the context of application development. As application developer, that's oftentimes what we're optimizing our graph data model for.  



Now, with graph data science, the typical pattern is you take the data that's stored in the database and you define how to project that into a new graph. You're inferring relationships, maybe it's a subgraph or maybe you have these multi hot inferred relationships and you're projecting that into a simplified version of that graph.  



This is really interesting because that allows you, I think, more easily to be able to combine this idea of using graphs in Neo4j for application development with the powers of graph data science. I don't need to change my graph model that I'm working within Neo4j to use GDS and graph data science because the first step is projecting the graph to build out this inferred graph. I'll give a couple of examples of where this came up.  



A few years ago, I was working on a project with NBC News around analyzing Russian Twitter trolls in the context of the 2016 election. NBC News had obtained this dataset of deleted tweets from Twitter. There was basically some process that an investigation had found these troll accounts, Twitter removed them. The problem there was researchers and data journalists didn't know what these trolls were doing. They weren't able to analyze the data because it had been removed from the Twitter API. NBC news obtained this deleted data and used Neo4j to help analyze that. I was a technical consultant, I guess, with the NBC news folks.  



We had this data of users, most of them were troll accounts, posting a tweet. One of the things that the NBC News folks wanted to look at was how are these troll accounts amplifying other troll accounts? That actually ended up being the majority of what these accounts were doing were simply amplifying the message of other troll accounts. We were looking at retweets.  



The model that we had was something like, a user posts a tweet, and then there's a retweet, which is a new node connected to that tweet, because a retweet is posting, a new tweet, and then that's connected to another account. We had three or four relationships to get at this user, retweeted or amplified another user. We projected out this inferred user to user relationship because that's where we wanted to run our graph algorithms, what users are amplifying the message of another. We projected that out from a much more complex model. That, I thought, was a really powerful pattern.  



This also came up in another project I was working on that was building a routing application for a specific used case. We had data from New York City, basically the road network. We had some specific aspects of this routing application as like predicates. What we did was to project out a simplified version of the graph. We had a much more complex graph to measure the physical road network, but that wasn't helpful for the routing aspect of it.  



We project out the routing graph on top of the physical structure to give us this intersection to intersection route. Then, the GDS plugin has pathfinding algorithms, things like A*, Dijkstra's algorithm, these sorts of things. We used that for building this routing application on top of a simplified projected graph.  



I just wanted to highlight that idea, I guess, one that this idea of projecting the graph out is super powerful and allows you to have a much more complex graph that you may have optimized for another used case, not data science related, application development related perhaps. I think that aspect really makes graph data science and these concepts a lot more accessible for application developers. I just want to point out that this idea of graph data science is not just for data scientists. It allows application developers to add really powerful features to your applications.


ABK (34:52):

Totally makes sense.  


Jen (34:53):

Welcome to the other side you guys.  


ABK (34:55):

A lot to digest there, a lot of good thoughts, a lot of good from the data science side and obviously from the developer perspective on graph data science, all the fun stuff happening with generative AI. Alison, thanks for some of the hints about where to get started how people can get going with this stuff. One of the ways people get started that they'd have to wait a little bit for it, but of course you mentioned the graph academy.  



There's lots of great blogs you can read and if you wait a little bit or register now, you can join us at NODES 2023 in October 26th, but there's going to be an amazing track just dedicated to AI and ML talks. We're just touching the surface here of this stuff. We're just scratching the surface. At NODES, you're going to have lots of people, lots of great ideas. It should be a pretty fantastic experience for really getting a deep dive on these topics.  



Then, we want to touch on some of the, as I mentioned, blog posts. There's been an amazing amount of blog posts being put out around this stuff. Will, you've mentioned this cloud Vertex thing.


Will (35:57):

Exactly. Google's Vertex AI. I think we'll link some of these blog posts in the show notes. They touch on a lot of the topics that we've talked about here. There's a blog from Google Cloud, talking exactly how you can use graph embeddings in the context of Google's Vertex AI platform. There's been some interesting posts that have come out from some folks internal to Neo4j who are looking at using LLMs with Graphs and Neo4j. We'll link those as well.  



I won't try to summarize them here. I think we've touched on some of the main ideas in our discussion above that these touch on, but one blog post I do want to talk a little bit about is actually one that I wrote, which, is I guess, a blog post and also a two-page colorful PDF, which is what I call the spatial Cypher cheat sheet.



I've been working on a lot of geospatial projects recently and I wanted to put together a document that talks about and covers a lot of the different geospatial things that we can do with near Neo4j in one place. I really like this cheat sheet format, which if you remember in school, sometimes you you're allowed a one index card and so you write as small as you possibly can to have everything on there. I really like that format.  



I went through and I put together this two-page cheat sheets that talks about a lot of the different ways we can work with geospatial data in Neo4j. I have a hard copy of it here. You can listen to the crinkle of the paper there. Basically, what it is, the first page talks a bit about some of the spatial functionality built into the core database.  



A lot this is built around the point 2D and 3D geographic point type that we have available in the database and covering how we can do things like distance searches, bounding boxes, geocoding, importing data that has geospatial components, how we can do things like some of the pathfinding with A* and Dijkstra that we talked about earlier.  



Then, on the back of the page, one section is covering using Neo4j Python driver for working with geospatial data. We're using a flights example dataset for most of this. Assuming we have data in Neo4j, how do I create a geodata frame from Neo4j? If you're familiar with Python and Pandas, there's a data structure called a data frame. A geodata frame is built on top of a Panda's data frame to add some geospatial functionality. Basically, there's a column that has a geometry, which then gives you the ability to do some geospatial operations. Anyway, we showed how to build one of those.



Then, there's a section on working with OpenStreetMap data and Neo4j in Python. We look at how to pull that into Neo4j and then, we of course cover a little bit on how to build these beautiful visualizations of road networks by, again, applying Graph algorithms in Bloom for community detection. In this case, I found that between this centrality on a road network can show you intersections that connect other sort of chunks of the road network. I found that to be most interesting. Anyway, the Spatial Cypher Cheat Sheet is out now.  



I also wrote a blog post version of it because I found that easier for copying and pasting code, but I like the paper version as well. That's available. Alex and I did a live stream also just diving into some of these different aspects of it that's linked in the blog post as well. Of course, we'll list these and some other interesting blog posts from the community in the show notes.


ABK (39:52):

Fantastic. Good stuff, Will, as always. Will, why don't you keep the stage here for a minute here. This is our section that we do every month that we love. What's your favorite tool of this month?


Will (40:03):

I love the favorite tool of the month category. It's fun because it forces me to think back and reflect on the project I worked on this month and what was I actually doing, what did I actually find useful? My favorite tool of the month, and I feel like I may have chosen this one previously, but if I did, that's okay. It's quite powerful and that is APOC Load JSON.  



We've talked about APOC before. It's the standard library for Cypher that gives us some additional procedures and functions within Cypher. One of those is APOC Load JSON, which allows us to parse JSON files either locally or remote, I can also use this to hit like JSON APIs with parameters, which is really cool. Then, I get back an object in Cypher that I can then, with Cypher, define how I want to process that JSON object to either create data, which is my typical used case, or just iterate over a JSON structure.



In the context of the geospatial functionality that I was talking about earlier, there's this data format called GeoJSON, which is basically adding the idea of features in geometry to a JSON document. This is in the cheat sheets. If you want to see how this works, just check out that. You can see some examples. It's really powerful because it allows us to load GeoJSON using Cypher and then define in the graph how I want to store and model complex geometries. Super useful. I use this all the time. In the same similar vein of Load CSV, if you've seen that, it's a similar idea, but that is my favorite tool of this month, APOC Load JSON. How about you ABK? Tell us your favorite tool of the month.


ABK (41:49):

Sure, mine is a little less practical than yours in some ways. I think it's amazing the amount of utility you get out of just a single APOC function or procedure. APOC has hundreds of them and each one of them you could probably spend an episode on talking about all the cool things you can do with just that bit. APOC is amazing.  



I have, in the last month, the last two months now, become obsessed after having naively adopted the fantastic Arrows tool. The body of work is not mine, but I've inherited the body of work and I'm reviving it, modernizing it, carrying it forward, and I'm falling in love with it all over again. It is a marvelous tool. If you haven't used, you should go out there today. Check it out. Quickly, put together a graph of your own choosing either for data modeling purposes or just for doing some sample data. It's a lovely tool. Enjoy to work with.  



I can't think of anything else other than Arrows these days. I'm hoping to get to the point where doing the coding part of it is behind me a little bit and I can get back to actually just doing things with Arrows and then maybe I'll shift my attention to something else. Maybe Arrows plus APOC, that's going to be my goal is to figure out what's the perfect marriage of APOC and Arrows. That'll be for another episode. How about you Alison? What's your favorite tool this month?


Alison (43:08):

My favorite tool this month, not surprisingly, is a GDS algorithm. We have something coming out at the beginning of July called the Bellman-Ford shortest path. We were talking about pathfinding earlier. Bellman-Ford does a couple of things. One of the biggest things that it does is it allows you to use negative weights. Sometimes, when we're calculating the shortest path, you can think of it like when you're using Waze, do I want the shortest distance or do I want the fastest route? In each of those, each segment of the route is going to have a weight.  



Now, you may be in a situation where you want to have a negative weight to a particular connection or relationship. Bellman-Ford is going to allow you to create shortest path, leveraging negative weight. Additionally, it also employs sampling and parallelization. It actually significantly reduces computation time as well. If you've been struggling with computation time and trying to veer out how to manage negative weights in your shortest path, take a look at Bellman-Ford.


ABK (44:11):

Alison, that now sounds like one of the answers for earlier question of when to use Cypher, when to use GDS. This might be one of those occasions where with this particular short [inaudible 00:44:21], I don't want to use that in Cypher. I do want to switch over.


Alison (44:24):

Yeah, that's a really good point. I love that you're seeing yourself in GDS already, ABK.


ABK (44:30):

It's unavoidable.


Alison (44:32):

Jen, how about you? What's your tool of the month?


Jen (44:34):

I'm actually going to highlight a little bit of ABK's with a practical example. I used Arrows, I think it's been in the last week or two. I had a community member, I've been popping out to the community a little bit here and there and someone had asked me a question, "Here's structure, my graph, here are my nodes and this is what I'm trying to do to update the graph, and I was trying to follow the text of the graph. What I ended up doing is just jumping out to Arrows and drawing it. I know I have this entity. I know there's a relationship to this other entity and here's the structure I ended up with."  



Then, I was actually able to send that back to the community member and say, "Hey, this is what I see your structure as from the way you explained it. Does this look accurate? If this is the structure, this is what you need to do in order to go about updating it to plug it in correctly." I will highlight Arrows just to say that I've used it recently and it has been super helpful.  


ABK (45:27):



Jen (45:27):

Then, tagging on my favorite tool of the month, I've spent quite a bit of time this month in it is Spring Data Neo4j. If you haven't explored that, feel free to check that out. If you're familiar with Java, with Spring framework, spring Boot, so on, Spring Data Neo4j is the kind of data plugin to the Neo4j database. They have lots of other integrations with stores such as MongoDB, lots of other relational by using Spring Data JPA and so on, but this is the Neo4j integration.



I've spent a lot of time here in the last month especially exploring some features that I never thought I would really use, the more niche features if you will. It's been really interesting and lots of learning paths for me that I didn't have before. Lots of things I'm turning into blog posts or to content hoping that maybe others won't trip over the same things I have in the past. That's my tool of the month. Super useful for writing applications and there's lots of good kind of features in there that maybe aren't highlighted enough on a regular basis.


ABK (46:26):

I love that you're bringing up Spring Data Neo4j. If you think about it, Jennifer, it's been with us almost as long as we've had Neo4j. Spring Data Neo4j has just been there.  


Jen (46:35):



ABK (46:35):

We sometimes forget we have this magical, mature, refined fantastic library.


Jen (46:42):

Yeah and Spring is often referred to as being magic, but when you start diving into it, you start seeing it, it's not necessarily it. As we were talking about GDS, it's not necessarily the black box magic. You're seeing what the designers put into this in order to help you build tools more efficiently and faster and better.


ABK (47:00):

Awesome. Will, do you want to bring us home?


Will (47:02):

Absolutely. We will close things out here as we usually do by touching on a few upcoming events in the next month. For July, first of all, on the Neo4j live stream, every Monday, Michael and Alex do a stream on exploring Aura free by looking at a different dataset. These are really fun and unscripted. My favorite part is watching all the different real world issues that come up and watching Michael and Alex talk through them and figuring out how to get around them. These are always quite interesting. We'll link the different live stream sessions. I don't think we know, I don't think Michael and Alex even know until maybe the day of what datasets they're going to be looking at. That's always fun.  



On July 12th, we have a Neo4j live graph data science related demo talking about how to build a 360-degree view of your customers. Then, a few more similar format, I believe, of the sort of live demo with graph data science approach on July 14th. Touching on graph data science and machine learning, with times for both Europe and US timezones, as well as on the 19th looking at fraud detection with graph data science. On July 21st, talking about the pros and cons of native versus non-native, finding the right graph database.  



Great. Thanks so much everyone for listening. As always, check out the show notes to find links to various things that we have talked about. If you don't subscribe, please feel free to subscribe to GraphStuff.FM. We will see you next time. Bye-bye.  


ABK (48:39):



Alison (49:00):

[inaudible 00:49:00].  


ABK (49:01):

Bye all.