In this episode, we cover the latest news and releases announced at NODES 2021 including the Neo4j Database 4.3 release, the availability of Neo4j Aura Free Tier, improved MLOps with Neo4j's Graph Data Science Library, Neo4j's $325M funding round, and the Trillion Relationship Graph demo from the NODES 2021 keynote.
William Lyon (00:00):
Welcome to GraphStuff.FM, the place to get your graph related news updates. My name is Will Lyon. I'm joined by our cohost Lju Lazarevic. And today we are going to be recapping a lot of the announcements and product releases from the Neo4j online developer expo and summits also known as NODES, a keynote, which was just on June 17th. So it be recapping some of the announcements and new features released in the areas of Neo4j database, Neo4j Aura in the cloud, graph data science and the larger graph ecosystem. So a lot of what we'll talk about today, you can also discover by watching the NODES Keynote, which we will link in the show notes. But with that, let's get into it.
Lju Lazarevic (00:55):
Yes. So first of all, the fundraising and yeah, let's get the biggest piece of news out of the way, Neo4j has become the database company with the largest ever investment. So in our latest fundraising round, we raised 325 million and also receiving a valuation of over $2 billion. And to just give you a bit of a comparison for those of you thinking about it, Neo4j has now raised over $500 million, which is similar to what MongoDB raised prior to their IPO. So quite an impressive number. And some of you be thinking, well, hang on a minute, this is a developer podcast, why are we talking about funding? So the big thing here, the context for developers is the time you invest learning about graph databases, learning about Neo4j, all of the training that you do, learning about the products, that's not in vain.
Lju Lazarevic (02:06):
There is a lot of value in putting that time in and what this fundraising is showing that there is a lot of value in Neo4j, it's a key technology, and it's going to be a key placement within the database space, just in case you weren't already aware of that. That's the confirmation that you're getting from this funding. And also, it's highlighting that the investment in the Neo4j platform and it's worth to see based on the back of this what's new, how is this funding being used now and in the future?
William Lyon (02:42):
On that note, let's talk about some of the new releases that investments like this are enabling. And the first thing we want to talk about is the 4.3 Core database. So 4.3 Version of Neo4j database that was released at NODES on June 17th. So let's talk about some of the big new features in Neo4j 4.3. There's a lot in here, but we won't talk about everything, but we'll touch on some of what we think anyway are the more interesting and relevant features for you as developers and data scientists. And the first feature that I think is interesting to take a look at is relationship, property indexing. So in Neo4j 4.3, you can now create indexes on relationship properties. Previously, you could create indexes on properties on nodes, but not relationships and this had some impact on how you had to think about designing your graph model.
William Lyon (03:44):
Maybe before we dig into this too much, let's take a step back and talk about why we use indexes in Neo4j in the first place. So you've probably heard this idea of index free adjacency, which means that we can traverse from one node to any other that we're connected to without using an index lookup. This means it's more of a constant time operation, not impacted by the overall size of the data, just the number of relationships we're traversing. So this is index free adjacency, this is a really important concept in graph databases Neo4j, but that doesn't mean that there is no place for an index in the database at all. Instead of using a index, when we're traversing the graph like the equivalent of a join, instead, we're using an index often to find the starting point of our traversal.
William Lyon (04:36):
So you might have a customer ID, an order ID, something like that, or maybe you're looking up customer by name, doing, I don't know, a text search to find the name of a blog post, something like this. You use an index to quickly and efficiently find the starting point for your traversal. And so previously, we've done that in nodes, but now with this new feature of relationship, property indexing, we can create indexes on relationship properties. So this is really useful for, for example, where we may have datasets of let's say flight information. So we have a graph where nodes are the airports, and we have relationships connecting them that represent the flight from one airport to another.
William Lyon (05:21):
Now, if we want to look for all of the flights on a certain date, we can now leverage an index to very quickly find the relationships that represent the flights on the certain dates that we are looking for. And so this, I think will be a really important feature that changes how we think about modeling now that we have this in our toolbox. So I think this will be a feature that we'll probably see a lot of examples digging into a lot further in the future.
Lju Lazarevic (05:49):
Another absolutely huge change in the 4.3 database release is how we deal with dense nodes. So very commonly, for those of you who have attended some of our data modeling classes, we'll talk about the Brittany Spears model and also talking on the relationship indexes, there's a bit of a hat tip there to the modeling examples we do with airports. So you think of the problem, we're looking at Twitter and we want to model all of the people who are following Britney Spears. And you think, I think Instagram is the more relevant one these days. I think we've moved on social media platforms. So let's pick Instagram. So we're looking at all the people that are following Britney Spears or the Rock or a famous celebrity. And by virtue of that, you're looking at a dense node.
Lju Lazarevic (06:41):
And the challenge that you had there with regards to rights, is if you wanted to write a new fan, who's now following, say the Rock on Instagram, you had this situation with locking. So what would happen is you would lock the start node and the end node, you would do your change in the relationship and then carry on. And the problem with that is you have a bit of a performance hit so effectively you are not able to touch the other relationships connected with that node. And effectively what's happening under the covers is what Neo4j does when you go past the engineering definition of a dense node. Now, everything has changed here. So previously how you would deal with that is you would change your data model. So you change your graph data model to try and reduce the density of your nodes.
Lju Lazarevic (07:31):
Now, there has been a change under the hood, how that data is stored, so effectively, we use relationship chains, and this now removes the need to do the locking on the start node and the end node. So this massively improves the performance of concurrent rights and how you do with these dense nodes.
William Lyon (07:54):
The next couple of features that we want to talk about are specific to working with deploying Neo4j in the cloud. Specifically, the first one we'll talk about is Helm charts for Neo4j. So Helm charts, these are like a recipe for deploying and managing services in Kubernetes. And the Helm charts for Neo4j have now graduated out of Neo4j labs and are now officially supported by the product engineering team. Neo4j labs as a reminder, this is where we experiment some extensions and integrations for Neo4j with the goal, really of validating that these are useful pieces of technology, and then graduating them to supported core product engineerings. This is where things like the Apache Standard Library, Graph Algorithms, the Neo4j GraphQL integrations, the Neo4j connectors for Apache Kafka and Spark that's where things like those were initially created incubated and then eventually graduated on.
William Lyon (08:56):
So Helm charts for Neo4j have graduated. So you can now use those to deploy Neo4j with Kubernetes using the officially supported Helm charts for Neo4j. And the other cloud and deployment specific feature that we want to talk about is the introduction of server side routing for Neo4j. And what this means is that it makes Neo4j easier to use in a clustered environment, put behind a load balance or something like that by exposing just a single IP address. Previously, clients of the cluster used the IP address of each database instance and the cluster routing was handled client side, but now with server-side routing, you can just use a single IP address to work with the Neo4j cluster.
Lju Lazarevic (09:40):
And let's have a look at some of the other features that I got to mention. So not quite as big as some of the really exciting ones we've covered, but still key changes to the product. We have got plan optimizations for order by and limits in your cipher queries. We've got paralyzed backup and restore. We have some security updates including those to logging. And if you want to find out about these changes, get a bit more detail as well as the many of the features that we haven't covered so far, do check out the release notes and we'll have a link to those in the show notes.
William Lyon (10:16):
Great. So that was looking at some of the new features in the Neo4j 4.3 database release. Let's talk about some of the other tools and products in the Neo4j platform. First looking at some things from the user tools area. So firstly, in Neo4j browser, you'll notice that there's a new layout for the guides that make it easier to go back and forth between querying in Neo4j browser, visualizing your results and working with the guides, the browser guides. You may be familiar with these from Neo4j sandbox, for example, where you have a guided interface that includes text images, embedded queries, to help you explore data sets as you're learning. So there's a new layout for these, and they're also a way to discover existing guides in the sidebar in browser. Another cool feature in Neo4j browser is the introduction of the Monaco editor for the cipher editor. Monaco, this is the editor that's used in VS code. So if you're familiar with VS code, you have a lot of familiar functionality, things like multiple cursors and a lot of the familiar keyboard shortcuts. Those are now available to you within Neo4j browser.
Lju Lazarevic (11:40):
And we have some updates to Neo4j Desktop as well. So you can now easily spin up sample projects and import the files and data into Neo4j Desktop. So we have a repo which is called Neo4j graphic samples. And these are all copies of the sandbox examples that we all know and love. So we are talking the data sets, we are talking the browser guides, all of that good stuff, and those are available. So anyone can take those and get those up and running in their Neo4j instance. And that can be running on Aura, that can be running on a local machine, you could be running it as a console or a service so forth, but the one thing that desktop now allows you to do is very easily pull these sample projects and get them spinning into a local database instance. Has a nice move slick experience, so for those of you who want to use the sandbox examples beyond the maximum 10 days, this is a great way to do that.
Lju Lazarevic (12:42):
And there is better multi database support, so you are able to work with the multiple databases from the user interface. So for those of you using Neo4j 4.X and above where we have the multi database support, you have a much nicer and slicker experience from Neo4j Desktop, so you don't have to do that via browser.
William Lyon (13:03):
Also, from the Neo4j user tools area of the graph platform, there was a lot of talk about the Neo4j GraphQL library in the NODES Keynotes. This, Neo4j GraphQL library is another example of a graduated Neo4j labs project that's now been transitioned to officially supported product engineering at Neo4j. The Neo4j GraphQL library allows you to build GraphQL APIs backed by Neo4j. So we basically define some GraphQL type definitions, point them at Neo4j and we have a fully functional GraphQL API that handles generated database queries. We don't have to write GraphQL resolvers. And so if you're interested in building a GraphQL API, this is a really interesting tool to check out. This wasn't released at NODES, but released in April. But what was covered in NODES is that in the month leading up to NODES, there was a Neo4j GraphQL hackathon that were judged on things like the ingenuity of the project, the complexity and the functionality of what the teams were able to build.
William Lyon (14:08):
So the winners were announced during the NODES Keynote. We'll talk about the top three here, which I think were really neat projects. The winners were developers.z which was a site for finding software developers in Zambia, based on their skillsets and the skills required of projects. That's a good graph problem there. Number two was Seed Fund Me, which was like a kick starter for seed funding projects. And then the third place winner was Help Me, just like stack overflow for social good, where you could ask for help, help others and earn these gratitude points. Again, a really good graph problem. And there are lots of other winners looking at things like industrial pollution analysis, disaster prevention and tracing patient zero, things like this. So there's a blog post that announces the winners with a lot more detail that we'll link in the show notes. And you can also check out the hackathon gallery. Each one of these projects has a short video that you can watch if you want to see some of the really cool things that these teams felt.
Lju Lazarevic (15:12):
So another big set of announcements were around the graph data science library. So we are now at version 1.6 and as a quick reminder, graph data science is all about analytics and machine learning with graph data. And a key feature of the 1.6 release of the Neo4j graph data science library is focusing on enabling easier MLOps with your graph data. So let's have a look at what's included. So as we mentioned, the big feature is machine learning. And if we remind ourselves, you generally have two different groupings around machine learning. So you either have unsupervised machine learning or supervised machine learning. And what the GDS library does around unsupervised machine learning, so this is all about looking for patterns and trends in your data. So when we talk about unsupervised machine learning, in the context of graph, we are talking about things such as community detections, centrality, pathfinding and so forth. And these were the starting points of the graph data science library.
Lju Lazarevic (16:24):
And the exciting things in this space, we have some new algorithms, so we've got Hopelink can do search topics and influence maximizations under centrality. We have got speaker listener, label propagations under community detection, and we've now got k-nearest neighbors under similarity. And also, we have got 20 algorithms that have graduated from the alpha tier to support product tier. So just as a quick reminder, we've got three tiers within GDS. We've got alpha, beta and product. So alpha are the experimental ones, there's no guarantees what's going to happen to those algorithms. Beta, you've got some additional functionality, some degree of performance work done, and they typically are candidate algorithms for the production tier and then the production tier are the supported algorithms. And the supported ones also have another degree of functionality such as getting statistics and so forth.
Lju Lazarevic (17:28):
So we have had 20 algorithms graduate from the alpha tier. So these are including pathfinding and search, centrality, community detection and so forth. And another thing to mention as well is they've got easy to use consistent APIs and some of the original alpha tear algorithms are now faster as well. So Dykstra is 25% faster. Page rank is almost 90% faster and node2vec is 60% faster. So let's touch on the big new parts of the GDS library. It is supervised machine learning. So this is all about getting data and effectively giving some nudge what you're looking to predict something going on. So some of the things that we've got within this space for the GDS library is predictions influenced by graph structure. So chairing prediction, lead scoring, that kind of thing. So this is all sitting around node classification. So being able to predict what node grouping something should be.
Lju Lazarevic (18:32):
We've got predictions around graph structure. So this is all about trying to find missing labels and how the graph might change in the future. And we also have linked prediction sitting in there as well. And we also have in graph machine learning, which is another new feature in 1.6, as well as using input graphs as labeled data. And all of this feeds into this idea of supporting MLOps. So we have got the expansion to the model catalog, and there is also more models available for those that using the free tier of the GDS library and this ability to train, store, publish, share, and execute models in the graph. And it's the only complete in graph machine learning workflow.
William Lyon (19:18):
And of course, graph visualization goes hand in hand with graph data science and one of the best tools for creating graphic visualizations and exploring the graph visually without writing code or cipher is Neo4j Bloom, and its latest release has some new features focused around graph pattern autocomplete. This is really handy for users that are not familiar with cipher, but wants to query and explore the graph visually. The way this works is you start typing maybe some node labels relationship types, or just how you would in natural language express, how these graphs are connected that you want to bring into the visualization and you see this really nice dropdown box of the different graph patterns that exist in the graph. You can select what you're looking for. So really handy for building up your graph visualizations for users that aren't comfortable using cipher.
William Lyon (20:19):
Another feature in Neo4j Bloom is multiple layouts. So toggling back and forth between forced directed layout and the hierarchical layout, depending on the type of data you're working with, different layouts can be better for visualizing. I know I worked on a project recently that was looking at county and State connected data and the hierarchical layout was actually a lot easier to make sense of visualizing that data. There's also new functionality for filtering as you are expanding relationships in Bloom and also filtering what's shown in the results table. So be sure to try out Bloom it's available by default in Neo4j Desktop and on Neo4j Aura.
Lju Lazarevic (21:09):
Speaking of Neo4j Aura, we have some updates around that as well. So for those of you who were at the inaugural NODES, so that was NODES 2019, we launched the professional tier. So that's the self-serve database as a service. And between then and now we also saw the launch of the Aura enterprise tier. And this is all about large scale, mission critical, customized services aimed at the enterprise in the cloud, but something very exciting for NODES day, we saw the official launch of the Aura free tier, and it's available now, it's free forever, and you can use it to build graphs with up to 50,000 nodes and 175,000 relationships. So if you haven't already gotten yourself access to Aura free do so you'll find a link to getting your free instance in the show notes.
William Lyon (22:05):
So that was a bit of a whirlwind look at new releases in Neo4j database, in user tools from the Neo4j graph platform, graph data science and Neo4j Aura. Talk a little bit about scalability and performance in the world of graphs. So this is an interesting topic because graphs scale differently than other data structures, a graph as such as Neo4j will typically prefer a replicated architecture as much as possible. So in our Neo4j cluster, we have multiple machines, multiple Neo4j instances that we're running in our cluster. And what we typically want to do is replicate data, replicate the graph to those different machines. So that as we're querying the graph as we're traversing it, we're not introducing network latency, if we've split or cut the graph across different machines, so that's what we mean when we say we typically prefer a replicated architecture.
William Lyon (23:10):
So each machine in our cluster holds a copy of the graph, and this allows us to scale horizontally for reads. So we can add more machines to scale reads and vertically for right increase the size of our machine if we want it to scale that way. But what happens now, if your data set grows beyond what we can replicate to each machine in our cluster, well, then we need to shard the data. We need to distribute the data across multiple machines in our cluster. Other databases, like from the NoSQL world, they distribute replicated pieces of the data set across machines, typically using some hashing function so that they know which machine in the cluster has, which pieces of data. Then at query time, the database goes to different machines with the relevant pieces of data, a symbolism and returns the results. And this works great for unconnected data. But we said earlier, we don't want to introduce this network latency as we're traversing the graph. So this is not a great architecture for connected data. So for graph data.
William Lyon (24:17):
And again, this is because the primary way that we interact with the graph is by traversing it. And if we have relationships across shard boundaries, that means that we've introduced network latency, that's slowing down our traversals. So this just spray and pray approach where we've just randomly sharded data throughout our cluster. This is not going to work for graph data. So what we have with graph native sharding, I guess is the way you can think of this is that if you invest a little bit in data modeling, so think about which pieces of the graph to shard. So identify natural subsections in your graphs. Maybe we have customers and customer data, product and product data. Maybe we have regional splits. We're not going to be maybe traversing from our North America customer and product data to our European customer, product data, this sort of thing. It really depends on your domain. If you identify that upfront, invest in the data modeling, then with graph native sharding, you can have unbounded size, linear right throughput and constant read performance.
Lju Lazarevic (25:27):
The really exciting parts from the keynote was that demo, the trillion relationship graph demo. So let's have a look at what that demo comprised of. We were looking at the social forum datasets. This is from the very well-regarded LDBC benchmark and the benchmark for graph databases here is all about torturing the graph database. It's how can we have the most awkward data set with most awkward queries to really put a graph database through its paces? So again, taking this thoughtful approach to thinking about how we're going to model this data, which consists of persons, forums, posts, comments, tax, city, country, and so forth. And the thoughtful modeling process suggests that what we should do is we have one shard where we have all of the person data and all of the person data on there was around 3 billion nodes. So this is equivalent to the number of accounts on Facebook.
Lju Lazarevic (26:40):
So this is a good equivalent of size of a very large graph consisting of person data. But for those of you who are familiar with Neo4j 3 billion Neo4j standards, not going to sweat it, that's not the problem. So our fought for modeling has said, actually, it makes sense to have all of these person nodes on one shard. And then where we're going to be doing the growing of the shards is to say, actually, what we should do is have our forum data. So all of the things around the posts, the comments, the tags, et cetera, around this specific forum, that's what we have on the separate shards. So the idea is one shard, one forum. And what we're going to have on there for performance is we're going to have some reference data to the person shard on that as well. And the absolute beauty of this setup is that we're now completely independent as to how many forums we want to add. So we can just keep adding forums.
Lju Lazarevic (27:48):
We've now been very thoughtful about our data model and we can now shard the graph without impacting performance. And for those of you who watched the demo or those of you who didn't do watch the demo, it starts off with 10 forum shards. So don't forget we've got the one shard for our persons and we now have 10 forum shards. So we're using 10 forums, each forum is on a shard and we are now looking at around 19 billion nodes and relationships with the person shards at around 11 billion nodes and relationships for the 10 forum shards. So we had queries on user feeds. So things such as posts tagged with topics, either following and so forth. This is coming back in at 20 milliseconds. So this is real time performance. This is the kind of thing where if you were looking at building a web app, the thing that's causing the performance issue is not the database. So this is super fast, sub 20 millisecond performance.
Lju Lazarevic (28:51):
Then the forum shards are grown. So we've gone from 10 to a hundred forum shards. So we're now looking at a hundred billion relationships. And again, same query looking at the user feed queries, still coming back at less than 20 milliseconds. So we're still got the same latency, but we've increased the size of the graph tenfold. And then this is where the AWS bill gets insane, relatively insane. We're now going to a thousand forum shards. So we've got 1,001 shards in total. So we've got the one shard for the person nodes, and we've now got a thousand shards representing a thousand forums. So again, we've just gone a magnitude of 10, again in the size of beta, we've got 1 trillion relationships. And again, we're looking at the user feed query. It's coming back in less than 20 milliseconds.
Lju Lazarevic (29:53):
So let's really test this. And this is what the demo then goes on to do, where you're looking at recent messages from friends of friends, again, coming back at double digit millisecond response times. And this is huge. And again, this is on a beta set that is designed to torture graph databases and Neo4j 4.3, using fabric with thoughtful modeling, no sweat at all, real time performance on a graph. I suspect size of a graph that we never normally see in production anywhere. Just to show you the scale of this. And for those of you who've got a load of AWS credit dollars you want to go and use, check out the show notes. We've got the link to the GitHub repo. So you can take that code and run the experiment yourself and build it out onto a thousand machines.
William Lyon (30:49):
So we just covered a ton of new features and announcements from the NODES Keynote, definitely check out the NODES Keynote, the videos online. I will link that YouTube link in the show notes. Let us know on Twitter what your favorites is that we talked about here just #neo4j to let us know what you think. We didn't even touch on any of the 50 so other talks from NODES though, they go into deeper detail on a lot of the things that we just talked about. Plus lots of talks from the community, sharing things that they've built and really so much more. So definitely, if you missed NODES, we'll share a link to the videos in the show notes. So those are all online available for you to watch. But I think that is more than enough for us to talk about today. So thanks so much for listening in and we will see you next time. Cheers.
Lju Lazarevic (31:48):
See you next time. Have an awesome day.