The Neo4j Connectors enable graphs to be integrated into larger systems using Apache Kafka, GraphQL, Apache Spark, and Business Intelligence tooling. In this episode David Allen, Global Lead Cloud Architect at Neo4j joins us to discuss the ways that the Neo4j Connectors can be used for extending the reach of graphs and what this means for developers and data scientists.
Lju Lazarevic (00:00):
Hello, and welcome to GraphStuff.FM, a podcast series devoted to explaining all those connections in your data. We're your hosts, Lju Lazarevic and Will Lyon.
William Lyon (00:10):
In this episode, we are joined by special guest, David Allen, a solution architect at Neo4j. David's going to give us a look at the Neo4j Connectors that enable using graphs and Neo4j with other technologies, and that enable different architectural patterns. Specifically, we're going to look at the Neo4j connectors for Apache Spark, Apache Kafka, business intelligence, and the Neo4j GraphQL library.
William Lyon (00:40):
A couple of interesting takeaways for me, and to think about, as we talked to David, is that all of these connectors enable integrations with Neo4j out of the box, but they also support these power ups to bring these unique benefits of graphs, often through Cypher to your applications and supporting your use cases. Something that was definitely a takeaway for me as we were talking about the sort of light bulb flashing that I was thinking of, I guess the general theme here is really to use the best tool for the job, this idea of polyglot architectures, where we're really interested in using different technologies to accomplish different tasks.
Lju Lazarevic (01:22):
And now, let's check out that interview with David Allen.
William Lyon (01:26):
So, we are joined today with David Allen, who is a partner architect at Neo4j. Thanks for joining us, David.
David Allen (01:36):
Hi Will. Thanks for having me here today.
William Lyon (01:39):
Of course. Yeah. Do you want to tell us a little bit about yourself and what you work on now at Neo4j?
David Allen (01:46):
Sure. I've been with the company for a couple of years. As you said, I'm an architect, and I work within the team at Neo4j that does strategic partnerships, and in particular, partnerships with the various cloud platforms. Most of my day deals with either building integrations or describing architecture patterns of how people use graph to build bigger and more complicated systems.
William Lyon (02:10):
Gotcha. That sounds really interesting. I know, before you joined Neo4j, you used Neo4j in a few projects. Could you tell us maybe a little bit about what brought you to graphs and Neo4j in the first place?
David Allen (02:24):
Sure. I used to work for this government research and development company called MITRE. And when I was there, I was working on a sponsored research project that dealt with data lineage. So, we had government agencies come to us, for example, and they wanted to make really big and important decisions on the basis of analysis reports that they might get from intelligence agencies or from regulatory sources. They didn't feel in a good position to make these really impactful decisions because they didn't know whether the decision was made on was trustworthy or not.
David Allen (02:59):
What you often found was this long chain of derivation kind of problem, where an analyst in the field somewhere writes a report, the report gets summarized, it gets included in something else and so on and so forth, 10 steps down the chain. Then eventually, some senior executive is being asked to make a decision and isn't sure whether the information is trustworthy or not. The way that I got involved in Neo4j is basically I implemented a system that worked with this kind of data lineage stuff on top of MySQL first and suffered all of the pains of trying to build a graph on top of a relational database.
David Allen (03:33):
And I discovered Neo4j when one of my colleagues introduced me to it and re-implemented a graph on top of a graph database and everything got a lot easier after that. Go figure.
William Lyon (03:44):
Neat. What we want to chat about today is the Neo4j connectors. In a previous episode, we covered sort of an overview of the Neo4j graph data platform, and we talked about a lot of the tooling that exists in the Neo4j ecosystem for Graph Data Science, for visualization with things like Neo4j Bloom and Neo4j drivers and Cypher for building applications. But I think these connectors are also an important piece of how we think about the Neo4j graph platform, really putting that in terms of working with other technologies.
William Lyon (04:19):
Because I think an important takeaway is Neo4j does not exist in a vacuum. You're not just using Neo4j, you're also ingesting data from other systems. You also want to integrate your architecture to other systems. I know you spend a lot of time thinking about that, but is that a good way to frame how we think of these Neo4j connectors?
David Allen (04:39):
Yeah, absolutely. People who've worked with graphs, they know how unique the structure is and how different it is from other things that are out there. I like to think about that uniqueness as kind of a double-edged sword. On one side, the fact that graphs are so unique, lets you do these really powerful things, and sometimes we go out and we teach people about how to use Cypher, and then their eyes start to glitter a little bit when they really realize what's possible. That's a really cool moment to facilitate, but the other edge of the blade, so to speak, is how different they are and how strange they can seem for people who are coming from different data formalisms, which are usually things like tables and documents.
David Allen (05:21):
Really, the intent behind the connectors is, not only to ease like the technical integration of moving bits from place to place, but it's basically to make it easy to include graphs in any kind of a larger system architecture so that you can take advantage of the capabilities where they apply. To your point, most people nowadays are using lots of different kinds of databases. We're long past the days of one Oracle database to rule them all. And although Neo4j is great, usually Neo4j isn't even going to be your only database. There's a lot of other things in the mix.
David Allen (05:58):
As an architect, I'm always thinking about how important it is for all of the technical components to be good citizens of the wider ecosystem. What I really want out of a good technical component, whether that's a database or a queuing system or an API, or any other thing that you might use, is for it to be a well-behaved Lego block and for it to be easily pluggable with other things, because ultimately what I'm trying to build is always that bigger thing. That's really the idea of the connectors in a nutshell, is make graphs and make Neo4j a better easier to plug Lego block.
William Lyon (06:39):
Gotcha. That makes a lot of sense. That Lego analogy I think is a really useful way of thinking about that. Cool. Let's dive in and talk about some of these connectors specifically. I think there's a handful of these that we want to talk about specifically and maybe how they work, how we think about them, and then also maybe, what do each of these enable you to build? Maybe talk about examples or something like that.
David Allen (07:02):
Sure.
William Lyon (07:02):
The first one we want to talk about is the Neo4j connector for Apache Kafka. Maybe first, if it's worth talking a little bit about, what is Kafka before we start talking about the connector.
David Allen (07:14):
Yeah. Kafka is an example of what I might think of something like a queuing system, where basically, it allows you to create a set of topics. A topic is like a message queue and it allows producers to push a message onto the topic and for a consumer to pull a message off of that topic. Now, that might sound simple, but there's so many different uses of these types of queuing systems. A fundamental reason why people use things like Kafka in the first place is that we might say that they want to decouple producers and consumers. Let's say you have an application that produces Jason messages, and it just chugs along and it produces a thousand Jason messages.
David Allen (07:54):
Well, as your architecture evolves, there might start by being five downstream programs that want to use those messages, and then later on, there might be 10, and then later on after that, there might only be seven programs. You don't want the actual producer of the Jason messages to have to worry about, who are the consumers, how many of them are there, and what format do they want their data in? A system like Kafka, it gets used to decouple the producers of the data from the consumers. The producers simply create the Jason, squirt it out over the wire, and the queuing system in the middle handles who is subscribed to that topic and who receives delivery of those messages whenever they're produced.
David Allen (08:34):
I think the second thing that Kafka does is it produces more of a push model. Normally, if we're using a RESTful API, for example, if you want information from the server, you have to ... In Kafka, if you subscribe to a topic, Kafka will notify you and push messages to you whenever new data is available. That's really what people, like what the technology does. Now, when people are using Kafka with Neo4j, usually what they're trying to do is set up some sort of a polyglot persistence or a microservices type framework where they are replicating data from their MySQL database to their Neo4j instance on an ongoing basis, or maybe they are running some Cypher inside of Neo4j and they want to notify some other component in their architecture whenever that Cypher matches something, for example, a fraudulent, a financial transaction.
David Allen (09:25):
That's how all of this stuff fits together, is that the Kafka integration itself, and we can talk about how that integration works, but the integration itself is really handling, how do I take a message off of a Kafka topic and write it into Neo4j? How do I take something that's in Neo4j and then push it back out to Kafka as a message? That's all the integration does. But what's interesting about it is what it enables, which are these wider architectural patterns of polyglot persistence and things like microservices structures.
William Lyon (09:56):
This connector then is allowing us to use Neo4j as either a consumer of a topic or a publisher to a topic.
David Allen (10:08):
That's right.
William Lyon (10:09):
And you mentioned, earlier when you're talking about Kafka in general, this idea of not having each individual service to worry or think about the format or the structure of these messages. I'm wondering, what role does Cypher play in this context? Because we can use Cypher as a way to manipulate the structure of our data. Do we use Cypher to define the type of messages that we want to publish in this context? How does that work?
David Allen (10:37):
You can use Cypher to define the format of the messages, but honestly, the more a customer is familiar with using Kafka, the more I suggest that they don't use Cypher for that purpose. Basically, in a wider architecture, whenever you're going to use like some combination of technologies, we always use this kind of toolbox analogy. You got a hammer, saw, and pliers, and you want to use each thing in its area of strength. What is Cypher better than anything else in the world at? It's matching and dealing with graph patterns. Okay? Now, Cypher can structure and reformat JSON documents, but Cypher is not compellingly better than other technologies at doing that.
David Allen (11:16):
Now, on the other hand, when you put a message onto the wire with Kafka, Kafka has this really rich set of structures for inspecting and transforming message payloads. One of the principles with Kafka is that sometimes information producers will produce in whatever format is meaningful to them, and you don't really want the information producer to think too much about the format of the message, because that will tend to couple it to the needs of the consumer. If you recall, like the whole purpose of using the message queue, is to decouple producers and consumers. Sometimes customers will use Cypher to try to reformat or do really complicated things with message structure. But honestly, I think sometimes it's better to publish a simple message and then do the transform within Kafka, not within Cypher. Does that make sense?
William Lyon (12:05):
Yeah. Then I guess, does that mean when we're consuming messages into Neo4j and we want to, say create some structures in the graph, create some nodes and relationships, is that then where we're using Cypher, where we're using these messages as an input into a Cypher statement that we're defining of what graph structures we want to create from those messages?
David Allen (12:26):
Yeah. That is a spot where Cypher does shine, where let's say that you have like a Jason document as an input. Normally, what's coming across the wire is going to be a JSON message or an Avro message in Kafka. Rarely CSV, but usually Jason and Avro. If you think about what Cypher does, it's very good at taking that and then transforming it into a graph pattern, like this field should be the idea of one node, this field should be the idea of another, and then we're going to write a relationship between them. Less so on the output side, where if you're taking stuff from Neo4j, you don't necessarily want to use Cypher to do really fancy JSON formatting. But if you're ingesting JSON into Neo4j, the use of Cypher to create that graph pattern is essential.
William Lyon (13:11):
Gotcha. Makes sense. Leverage Cypher for things that it is really good at, which is expressing graph patterns and working with graphs. Makes a lot of sense there.
David Allen (13:19):
Yeah, that's right, and that pattern is going to repeat in all of the connectors. It's like, with Cypher, you have basically something that you could use as a programming language if you want. The purpose isn't to say that Cypher is good or bad or to knock Cypher. It's just that, whenever we're using a whole bunch of technologies together, you got your hammer, your saw and your pliers, and you want to use each in its area of strength. You could do everything with the hammer, but you're not going to be happy if you do.
William Lyon (13:43):
Right. One thing I've heard in the Kafka ecosystem is this idea of the stream table duality, which is essentially, I think saying that, if you have a stream of messages, you can convert that into a database table, and if you have any database table, you can convert that into a stream of messages. That makes sense to me. I've heard you speak about introducing graphs into the stream table duality that sort of makes this a trinity with graphs and tables and streams. Could you maybe speak to that a little bit?
David Allen (14:17):
Okay. First, on the stream, table duality, when I first learned about this myself, I thought that was a really compelling concept, where Kafka is saying, look, you can look at your data either way depending on what's most convenient for you at a given point in time. You can either treat it as a stream of updates or you can treat it as a table of your state at any given time. When we did the Kafka integration, what we found was, okay, the Kafka integration lets you take any stream and write some Cypher and thereby manipulate that structure into a graph form. Now, if a stream is a table and a table is a stream, and the stream is a graph, then basically, we can freely move back and forth between any of these representations that we want, sort of like Fahrenheit versus Celsius.
David Allen (15:03):
There are different ways of looking at the same data. Now you might say, why would you want to do that? Why would you want to move between those things? Different formalisms give us different power and different insights. Imagine that you have a lot of financial records. If you want to find all of the transactions that were greater than $100, a relational database is a good choice for that because you apply a single column filter and then you're done. If you want to know which transactions were fraudulent, you might need to know how they fit into the wider set of relationships, and there, a graph might help you more.
David Allen (15:38):
When you think about stream, table, graph trinity, so to speak, basically, that's just flexibility for an architect to say, the same data that's flowing through my system, I can freely adapt whichever viewpoint helps me the most for my use case. It's not a statement that streams are always best or the graphs are always best. It's another tool in the toolbox, and that kind of leads us into this idea of building solutions, where you can adopt both perspectives at once, like the real-time fraud analytics setups that we've done with Kafka.
William Lyon (16:15):
Fundamentally, then what this is allowing you to do is to leverage the best database, the best tool for whatever specific use case that you're solving with that piece of your application, right?
David Allen (16:29):
That's right. A very common pattern I see with customers is that they have a Postgres database that stores all their customer records, and they replicate a lot of that over to Elasticsearch because, while Postgres can search text, it's not as good as Elasticsearch. And so they drive their website search feature with Elasticsearch, and Postgres is the single source of truth, right? When you have a pattern like that, there's no one database to rule them all. You use them each in their areas of strength, and Neo4j has really well-defined and compelling areas of strength.
David Allen (17:04):
When you get into Kafka and the stream, table, graph trinity, it just makes it easy to use it in its area of strength. That goes back to the point of be a good citizen of the wider technical ecosystem and make it easy to mix and match graphs into any architecture.
William Lyon (17:20):
If I'm interested in using this connector, if I'm already using Kafka and I want to introduce Neo4j and connect it into Kafka in my system, how do I get this connector? How do I install it? How does it work with Neo4j?
David Allen (17:34):
Sure. The code's downloadable on the Neo4j downloads page. It comes in two forms. You can run it as a connect worker within the Kafka framework, or you can run it as a Neo4j database. At present, if you run it as a connect worker, you can sync records to Neo4j, you can't produce records out of Neo4j back to Kafka. If you use the database plugin, you can do it either way. Basically, on that downloads page, you can download the technical artifacts. You can get a link to the full product documentation. When you configure it as a plugin, all it really takes is including a jar in the right place, adding some fields to neo4j.conf and then you're done.
David Allen (18:13):
When you use it as a Kafka connect worker, folks may be familiar with how those work, but there's a configuration interface and it's the same as any other connect worker.
William Lyon (18:21):
Gotcha, and we'll be sure to link all of these documents and download pages that we're talking about in the show notes. That's the Neo4j connector for Apache Kafka and some of the things that, that enables. Another connector that we want to talk about today is the Neo4j connector for business intelligence or BI, which I imagine this is something that enables me to maybe hook up Neo4j to what? Something like Tableau and look at some charts, that kind of thing? Is that what this is?
David Allen (18:53):
Yeah, that's right. That's how it started. Customers wanted to basically treat Neo4j as a store that could drive business intelligence tools, and Tableau was and is a big one. In Tableau, people are basically doing summary statistics and visual dashboards. A common thing that medium and big size companies do is that they have lots of databases owing to different business groupings. And they create these visual dashboards that go up to decision-makers, and so a decision-maker might want to know, how much money did we lose to fraudulent transactions last quarter?
David Allen (19:31):
That's not a single database query, but it might be a bunch and you might put it into a pie chart. So, Tableau handles all of the data source querying and the graphics and the specification of all of that. The way we started here was basically trying to make it possible for Neo4j to contribute data to those platforms. But it was tricky because those platforms have no concept whatsoever. They're very table oriented tools, and so they want to look at databases as things that you connect to over JDBC and things that you issue SQL queries against. The way that we did that is that we basically created a connector that exposes Neo4j as a JDBC connection and makes it look like a relational database.
David Allen (20:19):
If you connect using the BI connector, you will be told that Neo4j is full of tables. It's not actually, it's a graph database, but this is kind of like a virtualization layer, if you will, that says, here are the tables that you can query with SQL that exist in Neo4j. Because it works in that way, it fits nicely with the BI tooling and lets them do their thing without changing their assumptions about how data works.
William Lyon (20:42):
Is this leveraging this concept of, is it called labels for table?
David Allen (20:47):
Tables for labels. Actually, yeah, but so the idea is that every label in your Neo4j graph turns into a table. If you have a person label in your graph and you have the property's name, date of birth and salary, then basically, in the table catalog, you would see a table called person and it would have those three attributes. Relationships in turn also turned into tables. Let's say that you had a relationship person has job, or that connects a person in a job, then you would see another table in there that would look like a many to many join table.
David Allen (21:22):
It would be called person has job, and it would have two columns that are the ID of the source node and the ID of the target node that is suitable for using with SQL joins if you wanted. In theory, you could navigate the graph structure by doing joins. In simple cases, people actually do that with the BI connector, and they join a bunch of tables together, and they build up more complex reports and so forth. There's another option though, which is you can use what's called a Cypher backed view. What that is, you get to define your own Cypher query and whatever that Cypher query returns gets exposed as a table.
David Allen (21:59):
I could, for example, write a Cypher query that does an arbitrary-length path matching expression, and that returns, say five fields, and then have that exposed to the SQL layer as a table with five columns.
William Lyon (22:12):
Gotcha. It sounds like, out of the box, I am able to infer this relational model from the graph database model so that I can basically point Tableau or similar tools at this and integrates into that BI tooling. So, I can see tables that I can work with, and because these tools like Tableau, they're generating SQL on the backend, it works with Neo4j just like it would any other sort of data source.
David Allen (22:39):
Yeah. That's right. Back to the double-edged sword of graphs, graphs are great. The graphs can seem a little strange to other technologies that are out there, but Sometimes I think of SQL as the English of the data world. That's not a statement that English is the best language, but it is the most common. It's the same with SQL. It's not necessarily the best data language, but it is the most common. If you want to be a good citizen of the wider data ecosystem, it might be a good idea to learn how to speak the most common language, and that's really what the BI connector does, is it teaches Neo4j how to speak SQL.
William Lyon (23:13):
Cool. Then, in the case where I want to customize this a bit, take advantage of maybe some complex graph projections or something like that, you mentioned being able to use Cypher to define these sort of Cypher views that I can then leverage that power and the BI connector as well.
David Allen (23:31):
Yeah. That's right. One of the objections that sometimes comes up about the connector for BI is, well, hasn't Neo4j been saying for years that SQL isn't good for graphs? Yes. Yes, we definitely have been saying that. SQL is not good if what you're trying to do is like 13 way joins between all of these crazy tables. That's where the Cypher-backed views come in. If you can write a Cypher statement and you can return a list of columns, then you can turn that into a table, and that lets you basically take advantage of all of the graph parts of Neo4j and yet still have the result appear as a simple table. You can basically shortcut all of those joins if you need.
William Lyon (24:10):
How does this work in terms of installing this and configuring, is this another database plugin that I download and install in Neo4j?
David Allen (24:18):
This isn't a database plugin. This is just a single jar file at the end of the day. It is a JDBC driver. So, if developers are familiar with how you would install the JDBC driver for any other database, it's a similar process. In the case of Tableau, if you're using like Tableau Desktop, you download the connector, you copy a jar to a particular folder, and that's it. That's the install process. Depending on what other JDBC compatible tool that you're using. There might be a slightly different install process, but in general, they just want you to put the jar on the class path, copy it to a particular directory, and that's really all you have to do. There's no database plugin involved.
William Lyon (24:57):
Gotcha. So, it's not something I install in Neo4j, it's just something that I install a driver into the BI tooling that I'm using.
David Allen (25:04):
Yes, or into the client code that wants to run the SQL query.
William Lyon (25:08):
Cool. Let's move on to talking about another connector and that is the Neo4j connector for Apache Spark. Spark, maybe we should talk a little bit about Spark, I guess. We're talking about what the connector enables, I guess fundamentally, Spark is, how would you describe it? A distributed data processing platform.
David Allen (25:27):
Yeah, I think that, that's a pretty good description. Another way of thinking about it is, it's a cluster of machines. And if you wanted to process a really big dataset, what you might do is chop it up into pieces, give each of the machines in your cluster a piece of the dataset, and then have them all do the processing in parallel, and then gather up the results at the end. A number of years ago, there was this Hadoop idea of MapReduce programming. Honestly, one of the ways that I think about Spark is MapReduce programming grownup. So, it allows me to partition, and in parallel, execute some complicated computation on a big dataset, and then combine the results at the end.
David Allen (26:10):
People use this for like large scale data engineering and large-scale data analytics a lot. You have platforms out there like Databricks, which is really cool, where people will do analytics on just massive volumes of data and aided by the parallelism that Spark has built into the core.
William Lyon (26:27):
That's spark in general. Then, in the context of the Neo4j connector for Apache Spark, I'm assuming that what lets us then talk to Spark, either to or from Neo4j? What's going on there?
David Allen (26:42):
Yeah. It's bi-directional. Because Spark deals with so many different data sources, like a common data engineering thing you need to do in spark is maybe pull these records from Oracle and those records from Postgres and then join them, join data from two different databases. Spark has something called the DataSource API that basically lets you treat any data source in a consistent way. The Neo4j connector for Apache Spark is basically a data source API so that you can run a Cypher query and pull all of the results into a Spark data frame, or you can take a data frame and you can write it into Neo4j with, for example, a Cypher query.
David Allen (27:22):
In exactly the same way as the Kafka connector will let us turn a stream into graphs in Spark, usually everything is a data frame, and this connector will let you turn any data frame into a graph, and it will let you turn any graph into a data frame. Data frames are this abstraction within Spark that's very easily parallelizable and computable across this cluster. That in turn means that it is very useful for doing data engineering and analytics tasks. Examples would be things like customers pull data from three different databases, mash it up, load it into a graph using Spark, or they might pull back millions and millions of nodes out of Neo4j and then do some analytic computation in parallel on Spark on that.
William Lyon (28:06):
Gotcha. That makes a lot of sense. What's the best resource for getting started if we're interested in using Neo4j and Spark together?
David Allen (28:14):
All three of these connectors, they're all on the Neo4j downloads page. This particular one too, it has a GitHub repo. If you want to check it out, it's github.com/neo4j-contrib/neo4j-spark-connector, I believe. And you can download releases from there, build your own, contribute code if you wish. In the end, the way you install this into Spark is that, it's again, a set of jars. The Spark ecosystem is a little bit, I don't know how to say this persnickety, I guess, in that there are so many different versions of it out there.
David Allen (28:51):
People use different versions of Scala, different versions of spark, different runtimes for things like Databricks. So, we publish multiple releases that are targeted at different versions of Scala. If you want to install this, the first most important thing to keep in mind is make sure you know which version of Spark and Scala you're running, and then make sure that you download only one jar that's the right one for your distribution. When you get the jar, it's just a matter of including it on the class path for Spark and it's installed.
William Lyon (29:20):
Again, we'll link all of these resources in the show notes. The last connector we want to talk about today is, maybe not really a connector in the same sense of these others, but I think it can facilitate the sort of integrations, I think, it's worth talking about here. That is the Neo4j GraphQL library, which you to build GraphQL APIs on top of Neo4j. This I think is really interesting, largely for say, like building web and mobile application. I want to build some API and I want to use GraphQL to do that because maybe my front end developers are really seeing the benefits of GraphQL and want to leverage that.
William Lyon (30:07):
This Neo4j GraphQL library allows you to build GraphQL APIs on top of Neo4j really, without writing a whole lot of code. The idea here is you use GraphQL type definitions, which define sort of the shape of the data in GraphQL terms, and then the Neo4j GraphQL library can take those type definitions and generate a full GraphQL schema with all of our crud operations for creating updating data in the graph, but then also automatically translate any arbitrary GraphQL requests to this endpoint into Cipher, and handle that database query.
William Lyon (30:45):
That means you don't need to actually implement, in GraphQL terms, these are called resolver functions, which oftentimes are a lot of boilerplate that just specifies how to go out to the data layer and fetch that data. That's kind of how folks think of GraphQL in terms of building applications, but I think we can also think of GraphQL as a connector here that sort of exposes data from Neo4j in a standard format, which is GraphQL. Do you think that's a fair comparison or how do you see GraphQL fitting into this ecosystem?
David Allen (31:19):
It's definitely a fair comparison. What this is fundamentally doing is exposing a different API on top of Neo4j that works with a lot of other downstream tools and frameworks. From that perspective, it fits in perfectly with the idea of a set of integrations. I've been really impressed lately by seeing some of the stuff that's going on with Retool. Before I make another point about that, can you say anything further about Retool, Will?
William Lyon (31:45):
Yeah, so Retool is like the low code, no code UI building platforms. Allows us to build web applications, kind of this drag and drop functionality. Jennifer, on the DevRel team recently built a Twitter application using the Neo4j GraphQL library that it basically allows you to see some analytics and dashboard functionality based on your Twitter network. You just pointed at a GraphQL API that gets generated. In this case, it's coming from one of the Neo4j sandbox examples, which we'll link. But I think, in general, yeah, Retool is this no-code application building framework, and you point it at a data source.
William Lyon (32:29):
So, you point it at a database or it has a graphical integration. So, in this case, you're pointing it at a GraphQL API, and then you can build an application to work with that data without having to write any client code or how to work with that.
David Allen (32:42):
I, I saw that demo Jennifer did. What I really enjoyed about that concept is that, okay, so Retool, here, we've got a company that has built a generic way of getting data into their tool from any GraphQL API. What's really cool about that from the Neo4j perspective, as we were talking about the BI connector a little while ago, and about how we needed to be able to speak SQL. But GraphQL is a little bit more natively graphy in the sense that the idea of a graph or a path based query is not foreign in GraphQL. It's really positive to see other tools picking up graph oriented ways of doing data integration in the generic. Now, the other thing that I think is really architecturally interesting about the GraphQL as a point of integration is that a lot of us have been doing REST APIs for years.
David Allen (33:34):
All along that time, there were these purity arguments about, what's pure REST, or did you follow the semantics? Did you use post versus put, versus patch correctly and so on? And the APIs were never described. So, you never really knew how to use a REST API, even though you knew the general set of conventions that it followed. There was a lot of discomfort about that. Then along came things like Swagger and OpenAPI Spec to solve that problem, like to literally describe whatever on earth this REST API is that we just built. GraphQL has a better description layer, in that it tends to have type depths and queries that come with it.
David Allen (34:13):
It's almost as if the thing that OpenAPI or Swagger was doing for REST API was built in from the beginning in the GraphQL layer, and that makes it particularly interesting to me. I mean, Will, do you think that's a fair comparison?
William Lyon (34:28):
Oh yeah, for sure. I think there's an extreme amount of value of having this idea of type definitions that explicitly define, and this is typically the starting point, like there's this concept of GraphQL first developments that you can start with these type definitions that define the data in the API, the entry points, what's available, what kind of operations you can do, and that becomes the specification the documentation for the API. You can start to build your application even without having any data through some of the mocking functionality, right?
David Allen (35:03):
Yes. So, it's exactly that aspect of GraphQL that could make it useful as a generic integration interface where REST API could never do that. Because if somebody gave you a REST API, you would never really know quite how it worked unless you got the OpenAPI Spec, which sometimes you would have, and sometimes you wouldn't. Then, even when you did have it, a lot of times, it wouldn't give you enough detail about what kind of payload was going to come back. I feel like what Retool was trying to do with GraphQL wasn't really pragmatically possible with wrist, but is with GraphQL. That in turn makes it interesting for the future as like a generic integration point, and in turn, why the GraphQL library does fit in with this idea of bringing graphs to the wider world and integrating graphs with other technologies.
William Lyon (35:59):
That's a really interesting point. I think there's two sides of this as well. What you're talking about is, once you have a GraphQL API, to integrate with, you run, what's called the Introspection query. This is a powerful feature in GraphQL, is this idea of introspection where I can ask the API, hey, what types do you have? What fields do you have on those types? How are those types connected? The tooling and the graphical ecosystem basically allows you to generate documentation around the results of this introspection. This also means you have all kinds of type safety in the tooling that you used to work with us GraphQL API.
William Lyon (36:36):
That's from the consuming point, and we're talking about here in the context of connectors and integrators really makes a lot of benefits there, that you know exactly what you can do, you know exactly what data's coming back.
David Allen (36:47):
Let me connect this still, for example, extraction transformation and load, or like ETL tooling. They typically would operate over something like JDBC. The very first thing that they do is they go out and they suck the metadata out of the foreign database and they say, well, you've got these tables, you've got these columns, you've got these data types. Then you might have a designer type GUI that allows you to connect like, I want to pull these fields and then push them over there. The only reason that entire pattern of integration is possible is because it is metadata driven. The central point here is that GraphQL is metadata driven because of this introspection set of capabilities that you're talking about, where the things that it is replacing were not.
David Allen (37:30):
Now, that's already pretty cool in my view, the fact that it is more heavily graph oriented than REST was, is almost icing on the cake.
William Lyon (37:38):
Yeah. That graph-oriented aspect of GraphQL, I think is really important. What the Neo4j GraphQL library allows you to do is, is take your GraphQL type definitions that are defining the schema of the API and use that to drive the schema for the database as well. Neo4j itself doesn't really have this concept of a strict schema, but when you're using the Neo4j GraphQL library, you have the ability to impose that schema through GraphQL, and you have this sort of one-to-one mapping of the schema for your graph database and the property graph model and the schema for the GraphQL API that you're describing.
David Allen (38:14):
Yep. So, to summarize some of this, the integration that Jennifer did with Retool, I'm hopeful that we're going to see a lot more of that sort of thing, and I think that it's a great example of a positive pattern of how you can string together lots of different tools to do things even when they don't necessarily pre-agree on data structures.
William Lyon (38:34):
There's also a theme here that I noticed across these connectors, which is, if you don't know anything about Cypher, don't want to use Cypher, you get functionality out of the box. If I'm using the BI connector, I have my graph exposed through tables and a relational model. If I'm using Kafka, I don't have to use Cypher to express how I want those messages to be produced, how I want it to handle them. And it's similar with GraphQL, whereas, I can just define these graphical type definitions and I get this auto-generated GraphQL API with all of these crud operations out of the box.
William Lyon (39:08):
But if I have some custom logic that I want to express, so it's important to understand GraphQL is not a graph database query language. It's very much an API query language. There are things I can't express in GraphQL like variable length traversals, complex graph patterns, projections, these kind of things that I can do in Cipher, and take the power of Cypher and use that in our GraphQL API. There's this really neat concepts called the Cypher schema directive, which allows us to basically attach Cypher queries into the GraphQL schema, that then allows us to expose the power of Cypher in GraphQL with our custom logic.
William Lyon (39:48):
That's a theme I'm seeing here, right? Is that all of these connectors give you this out of the box functionality, that, that's fine. You don't have to know Cypher. You don't really have to know anything about graphs, but if you really want to leverage graphs and Cypher, they all expose the ability for you to do that.
David Allen (40:02):
That's absolutely right. In the same way as you can have a custom Cypher directive in the GraphQL library, this is almost an exact equivalent of what we were talking about with Cypher backed views earlier in the BI connector, where it's like, okay, imagine you've got your left hand formalism and your right-hand formalism. The right-hand formalism when integrating with Neo4j is always Neo4j and Cypher. The left might be JDBC and SQL, or it might be GraphQL. There's always going to be this impedance mismatch, where the left-hand and the right-hand are different and they have different capabilities.
David Allen (40:37):
What you start off by doing is making it easy, using the left-hand formalism if you will, and then you provide some escape patches where people can use the underlying power, where you have some concept or idea that isn't translatable in the left-hand formalism. Not to get too philosophical, but this goes all the way back to like human language. English has words for all of these different things, and some words we just straight out steal from other languages like souffle or kindergarten, right? When a concept is not expressible in English, we just borrow the way of saying it from another language. We don't necessarily even try to reinvent the wheel.
David Allen (41:14):
At the end of this path, when you think about these sorts of things, I forget who said it, maybe it was like Larry Wall, the guy who invented Pearl. But one of the things that always stuck with me is this idea of, good technology makes easy things easy and hard things possible. You can't ever make hard things easy, but what you're striving for is to basically make it simple and idiomatic in the technology that you're using. So, if that's SQL, you can write SQL queries. If that's GraphQL, you can write GraphQL queries. When you reach the end of your rope, there's something that you just can't do with that formalism or that language, that there's a way to go deeper and to do something harder.
David Allen (41:53):
That general design pattern is something that I see recur across all of the technology that I use, that I really enjoy using. It doesn't even have anything to do with graphs or connectors. It's just a general pattern in the world.
William Lyon (42:08):
In this context though, of graphs, when you want to go further and add that custom logic Cypher, going back to the conversation of, do we use SQL for graph things? No, we should probably use Cypher because that's what it's built for. So, Cipher, in that case, is how we can go further and add our sort of custom functionality in the context of graphs with these connectors.
David Allen (42:31):
Cipher is the best thing in the world for doing certain things, not everything, but certain things. For sure, graph querying and graph pattern extraction, and so on. And it's all just back to that toolbox metaphor, is you've got lots of different tools and use each in their area of strength, and don't spend any time trying to saw boards in half with a hammer.
William Lyon (42:52):
We've talked about these connectors now and sort of how to use, and in some cases, install them individually. But I guess, thinking at a higher level of abstraction, if we want to leverage these connectors with Neo4j, does this impact how we think about maybe how we want to deploy Neo4j, how we want to architect our applications, and maybe does this all work in the cloud as well? How do you typically think about these things?
David Allen (43:19):
A lot of this stuff already works in the cloud. The few exceptions where it doesn't are things that we are actively working on right now, because basically, as an architect, I think, whether it's now or later, whether your company is ready for it or not, a lot of things are going to be going in the direction of managed services in the cloud. The reason why it's going to go in the direction of managed services is pretty straightforward in my view is, it's like, nobody really enjoys running databases or running queuing systems. It's complicated and costly.
David Allen (43:48):
Basically, when these tech companies find a way of giving you all the parts that you like about the technology without the operational and maintenance burden, people are generally going to adopt that over time, because what else can you say? It's like all of the juice without the squeeze, right? When I'm looking forward to the way that people are going to be building systems in the future, it starts to look more and more like strung together Lego blocks, where each Lego block is a API or managed service that is offered by some cloud provider. Now, you might say, isn't that going to facilitate horrible vendor lock-in? And all of this future design pattern is not without its risks and drawbacks.
David Allen (44:29):
No such thing as a free lunch. I see working with some of these cloud vendors that they're extending their abstractions into the on-prem world. How crazy is this? You can run Google Kubernetes engine on your servers inside of your company. You can do the same with Amazon Lambda. The use of a cloud-managed service abstraction doesn't have to mean that you're even running on a public cloud. Basically, the way that I think about this in terms of graph is that Neo4j Aura is going to be the managed service for graph in the future, and that these connectors are a very important piece because it's like the waffle pattern on the bottom of the Lego.
David Allen (45:07):
It's what's going to make it easy and convenient to snap Aura together with Elasticsearch for searching large volumes of text with BigQuery, which is where you're going to do all of your data warehousing analytics, with Spark, which is how you're going to get some of your graph data over to your machine learning pipeline, and so on and so forth.
William Lyon (45:25):
Yeah. So, going back to that Lego analogy, what we're talking about with these connectors is really just thinking about how to make all the Legos fit together.
David Allen (45:33):
Yep. That's it. Can you imagine how annoying it would be if half of your Legos, if that waffle pattern on the bottom of it was just off by a couple of millimeters. It wouldn't seem like a big thing, but you wouldn't be able to build anything. By the way, this is not a hypothetical example. My son bought some knockoff Legos and this actually happened to me.
William Lyon (45:56):
I think also, to extend your Lego analogy a little bit, maybe the surface of where we're building with our Legos now more and more is in the cloud rather than on-prem, and so that's especially what we need to think about, right?
David Allen (46:10):
Yeah, and so there's also this kind of wider cultural trend that sometimes gets called by the very loose generic term of microservices, but there is also this general development trend of developing in more smaller loosely coupled individual teams, and so if your company is building an application that has 20 parts and that decomposes into 20 different teams, and they're all using different programming languages, and they're all using different frameworks and APIs, and even databases in some case, you can see that the days of one database to rule them all are over, and that it's extremely important that these Lego blocks fit together.
William Lyon (46:54):
Absolutely makes sense. We're a bit out of time here, but one question I do want to ask at the end here is largely, what's next? What do you see out there emerging in the graph, or maybe just even in the cloud or technology space in general that you're really excited about right now?
David Allen (47:10):
Okay. This might be a tad forward leaning for some folks, but there's been a lot of discussion about machine learning approaches and so forth, and Neo4j, together with the Graph Data Science Library, it is building this story around graph assisted machine learning. In particular, the thing that I'm finding the most interesting right now is something we call graph-based feature engineering, where you use the algorithms and GDS to compute, for example, node embeddings, or a centrality score, or a community identifier for a group of nodes, and the basic idea is that you can take existing ML approaches, whether it's scikit-learn or anything else, or TensorFlow, PyTorch, any of that, and you can make them better.
David Allen (47:55):
Because unsurprisingly, if you feed the machine more signal, it's going to perform better. Taking all of these graph techniques and basically pouring them into an augmenting standard machine learning techniques is what I'm looking at the most lately.
William Lyon (48:10):
Gotcha. Graph-based feature engineering. So, we're not running these machine learning models in the database. Instead, we're using the graph to project out features that then feed into other machine learning systems, right?
David Allen (48:25):
It can actually work both ways. Some kinds of models you can run inside of the graph, but it doesn't need to be that way. Over the last week, I've been working on some Databricks code, where basically, yeah, it's a payment data set. I'm engineering some graph features, and the actual model is being run within Databricks. So, there's flexibility to do it either way. Again, when we think about systems from an architectural perspective, it's good not to be too dogmatic and to basically think about it as this box of tools, and how do we best combine the tools to accomplish the goal?
David Allen (48:58):
We don't want to begin with the notion of running the model here and there, and then figure out how to justify that. I think the bottom line here is that graph has a lot to offer machine learning workflows because it's simply encoding interesting novel facts about your source data into the dataset that's being fed to whatever machine learning approach you're using. You can treat those things as separable concerns. You can say, how do I engineer my features versus what is the thing doing the statistics on those features? I expect, over time, graph is going to have something to say about both sides of it, but I'm just looking at the first part right now.
William Lyon (49:34):
Right. Gotcha. Awesome. Something we'll definitely try to keep the pulse on. Maybe we'll have you back on to give us more detail on that later, but that I think is more than enough for digging into today. So, thanks so much for joining us, and yeah, we'll have to have you back on the podcast again in the future.
David Allen (49:53):
All right. Thank you, Will. Had a great time.
William Lyon (49:55):
All right. Cheers everyone.