GraphStuff.FM: The Neo4j Graph Database Developer Podcast

Finding Data For Your Graph Adventures With Neo4j

Episode Summary

How to find interesting datasets for analysis and building applications with Neo4j. We cover how to identify a good graphy dataset, sources for graphy data including APIs, data portals, finding "unofficial APIs", and tools for modeling and importing data into Neo4j.

Episode Notes

Neo4j Sandbox: https://neo4j.com/sandbox/
OpenStreetMap Neo4j Sandbox: https://sandbox.neo4j.com/?usecase=openstreetmap
Neo4j Graph Examples: https://github.com/neo4j-graph-examples
Google dataset search: https://datasetsearch.research.google.com/
Meetup API: https://www.meetup.com/meetup_api/
Working with the Meetup API & Neo4j: https://github.com/moxious/meetup-dataset
NYTimes API: https://developer.nytimes.com/
Working with NYTimes data in Neo4j: https://github.com/johnymontana/news-graph
538 datasets: https://github.com/fivethirtyeight/data
Data Is Plural weekly mailing list: https://www.data-is-plural.com/
Open Library API: https://openlibrary.org/developers/api
Yelp Data Challenge: https://www.yelp.com/dataset
Companies House - GOV.UK: https://www.gov.uk/government/organisations/companies-house
New York City data portal: https://opendata.cityofnewyork.us/
Philadelphia data portal: https://www.opendataphilly.org/
San Francisco data portal: https://datasf.org/opendata/
US public data on GitHub: https://github.com/unitedstates
OpenElections data on GitHub: https://github.com/openelections
Using Google Forms with Neo4j: https://neo4j.com/developer-blog/getting-to-know-you-getting-to-know-all-about-you/
Neo4j Twitch channel: https://www.twitch.tv/neo4j
Neo4j YouTube channel: https://www.youtube.com/neo4j
Video series: Discovering Aura Free WIth Fun Datasets: https://www.youtube.com/playlist?list=PL9Hl4pk2FsvVZaoIpfsfpdzEXxyUJlAYw
CSVKit: https://csvkit.readthedocs.io/en/latest/
Dedupe.io: https://dedupe.io/
APOC text functions: https://neo4j.com/labs/apoc/4.1/misc/text-functions/
APOC import procedures: https://neo4j.com/labs/apoc/4.1/import/
Cypher LOAD CSV: https://neo4j.com/developer/guide-import-csv/
Arrows.app for graph data modeling: https://arrows.app/
neo4j-admin import (bulk import tool): https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/
Neo4j ETL Tool: https://neo4j.com/labs/etl-tool/
Neo4j Drivers & Language guides: https://neo4j.com/developer/language-guides/

Episode Transcription

William Lyon (00:00):

Welcome to graph stuff FM. The place to find your path to graph. My name is Will Lyon and we're joined by our co-host Lju Lazarovich today. In this episode, we are taking a look at finding data for your next graph adventure with Neo4J

Lju Lazarevic (00:19):

A common theme we notice when we start talking to people about graph databases and how they work and where you use them, you get a lot of enthusiasm from people. You get "This is great, graph sounds amazing. I can totally see they're value. They're fantastic. They're great. I'll have a go at one, as soon as I can think of a project where I can use one" And that's not unusual, it's quite a changing thinking about how you can use graphs. And if you are not currently looking at a graph shaped problem that you're trying to solve, you may not necessarily see that direct leap about how you can go and have a play and just start emerging yourself in this idea of using the graph database.

Lju Lazarevic (01:07):

And it can be quite a big blocker. What we're very keen to do in this episode is to show you the ways and means of being able to discover a suitable data set and just have a play, just get started with using graph databases. First of all, when we think about finding a graph data set, it's probably a good idea to have a think about what does a graph data set look like?

William Lyon (01:37):

We've talked about this in previous episodes a little bit, but it is I think a common pattern to think about, "Okay, does this data set make sense for working with, as a graph?" There are some common characteristics that data sets that work well in a graph will have. These are things like if we have lots of discrete entities, We have people and books and articles and topics that are connected to them, these sorts of things. If you see a data set like that, it's probably a good graph data set. And also graphs are all about relationships or relationships are a first class citizen in the property graph data model. There should be some interesting questions that you can ask of the data that take those relationships into account.

William Lyon (02:30):

And oftentimes if you are starting to draw a graph in your head, If you're looking at a data set and you're thinking, "Oh, well, this person is connected to this review that they wrote of this business" And you sort of mentally drawing this visual? Well then chances are, that's probably a pretty good graph data set. We talked about what kind of data makes sense for working with as a graph? What kind of data is not a good fit for graph? Well, if we don't have a lot of discrete entities, if maybe all we have are people and their name, this going to be a very boring data set, but you get the idea. There's not a lot of discrete entities to work with. There's not a lot of ways to model relationships between them. That's perhaps not interesting, perhaps if there's a of aggregations, sums totals, aggregate data without a lot of other contexts, that's maybe not super useful. And then things that are just discreet values. A lot of times you'll have like historical weather forecast for a single location over time.

William Lyon (03:37):

Things that might be better fit for working with as a time series, or if I have lots of sensor data observations. Now it, it's not to say that this sort of IOT sensor data and time series data cannot be useful for exploring in a graph, just often the context or maybe the structure around these sensors is more useful than the actual observation. To give you an example, I worked with a project a while ago that was looking at oil and gas pipeline, sensor data. And we had things like measuring the pressure of the pipe, looking at the temperature at different locations. And it turned out that actually, what was more interesting to look at was the overall structure of this oil pipeline and how the sensors were formed, which ones were next to each other and downstream and upstream in the pipeline. This sort of thing. Anyway, that's just a look at things to think about when you are starting to ask yourself, is this a good data set to work with as a graph?

Lju Lazarevic (04:42):

We've talked about what a good data set for a graph might look like. We've talked about what a not so good data set for a graph might look like. Let's start thinking about where you can find this data set. If you've got an idea in your mind, something you want to explore, maybe you are wanting to look at some book and author data. Maybe you want to look at some movie data, you've got an idea in your head, what you're looking for, then try searching for the dataset. You would be absolutely surprised what you can find sitting in public repositories just start that journey by having a search. Search for that topic adds keywords such as data or CSV or JSONs, that kind of thing, and see what pops up. It could well be that there is a public data set there already, and you've got dataset search areas.

Lju Lazarevic (05:39):

For example, Google has a dataset search area. That's dataset search.research.google.com. Have a look in there. Something might pop up as well. Another thing to check out is see if there is a public API available. Very popular applications and services such as Twitter will have an API available. Some great examples to check out would include the open movie database, meetup.com, where you can stream JSONs. You've got the New York Times and The Washington Post, and they will allow you to pull information about top article book reviews. You also have the Twitter API, which we've mentioned. So there's lots of good fun projects to have around that. And also check out unofficial APIs.

William Lyon (06:27):

Unofficial API is a term that I like to use to describe web applications that make the data they use available publicly.or example, around election time in the US sites like New York Times and Politico put up these live election result dashboards. And if you check the network tab in your web browser, you'll find that a lot of these web apps are actually loading data into the client and then processing that for the dashboard on the client and you'll see adjacent endpoint or something like that, where the data is actually coming from that you can then use to load data directly into Neo4J. I like to do this anytime.

William Lyon (07:08):

I see an interesting web application, I'll just check the network tab and see if it's actually processing the data client side and making that data available publicly. One of the more interesting sites I noticed was doing this recently, that I've been playing around with the data from is called in InciWeb which in the US at least publishes data about active forest fires and air quality. That's the concept of the unofficial API, which can be another great resource for finding data to work with.

Lju Lazarevic (07:40):

If you're thinking about a data set that you're excited about where you inspired by blog post? If so, go check out that blog post have a look where they pulled their data from, and that could be a useful place as well to pull some data.

William Lyon (07:54):

APIs are a great way of working with current up to date data because oftentimes those APIs are updating in real time. So with New York Times and Washington Post you're looking at that day's top articles, the meetup APIs actually streaming JSONs you getting that in real time. And that can be useful for building web applications and doing some analytics in the graph, but it can be a challenge sometimes to work with changing data. There are lots of static data sets available as well, that we don't have to think about how to handle changing data values as it's coming in. This is another sort of class of sources of data are for lack of a better term. We can call them popular data website that I guess is maybe a way to think about these. These are things like Kaggle. Kaggle is the site that has things like machine learning competitions, where there's a data set released, and we're working to build machine learning implementations that are making predictions or answering some question, but there are lots of really interesting data sets that are available from Kaggle.

William Lyon (09:04):

And oftentimes those are sponsored by real companies that release real world data from those companies, businesses, which can be really interesting related to that are hackathon and challenge websites that may release some dataset as part of the challenge. Dev post is a great way to find these hackathons. We just hosted a year for J graph Q hackathon on dev post other challenge sites. There's the Yelp dataset challenge where each year Yelp releases a subset of their data on businesses, users and reviews and challenges the community come up with interesting ways to use that data set there's movie lens, which is a data set of movie recommendations that came out of a recommendation system challenge. That's a good one of as well. And that one we actually use in the recommendations near JCN box. Another category of these data websites are data portals. And so these are things like government data sets that are released to the public.

William Lyon (10:10):

So these are available on the national level. Things like data.gov in the US data.gov.uk, In the UK campaign, finance data is really interesting. Who's donating to what candidates looking at their connections to companies, this sort of thing in the UK. There's a data set of all property ownership as well as company ownership as well, which is really interesting. And then at the city and the state level, there are often data portals. Maybe search for your city's data portal to see what's available. I know that New York, Philadelphia, and San Francisco in the US have really good data portals, which we'll be sure to link all of these in the show notes kind of related to the government data portals, are data sets that come from data journalism. this is a type of often investigative journalism. That's using data to drive the story, to find interesting insights in some public dataset.

William Lyon (11:13):

And a lot of times these data journalism groups will release the data that they use toward these investigations. Publicly 538 does a really good job of this. I think they publish almost all of their data sets to GitHub in really nice clean CSV format. That's a really great wealth of information there. And the other thing that's great about these data journalism public data sets are that the data journalism team has often gone to great links to clean these data sets clean and dedoop them, which I will talk about a little bit later is a really valuable step to go through

Lju Lazarevic (11:53):

Another really interesting place to get some data from would be roughly under the category of crowdsourced data sets. So a quick example, not quite crowdsourced, but do check out museum APIs. There are now hundreds of museums from around the world that have an APR interface available to their collections, or you have, for example, The Met Museum of Art in New York who give you a flat file to download. Some of you who took part in the summer of notes challenge last year will have in fact, worked with that data set.really fun data set, and then more specific crowdsource data sets, for example, open street map, where you can go and find very rich information about anything from roads and motorways to inter sanctions, to junctions, to street furniture, benches bins, public conveniences through to trees. It is absolutely amazing data set where everybody from the around the world are contributing to information about their local area.

Lju Lazarevic (13:06):

And we do have an open street map importer for Neo4J and as well, you will find on our sandbox. You've got an example of working with a cut of data from the open street map.It's a great way to start exploring and having a look. And I'd also like to shout out Estelle SIFO. So she's part of the Neo4J community, and she's done a lot of work around pulling data in, from open street map and working with in Neo4J as well as her Neo maps application that runs with plotting out some of that processed information. And also you've done a lot of work around this space along with Craig Tavalod. Those are definitely names to check out if you are interested in the open street map, dataset, other big one, Wiki data, Wikipedia and DBP. These are all great examples of crowdsourced information, where people are working together to update those data sets.

Lju Lazarevic (14:04):

And some of you may notice that those are in an architectural format, so you can use Neo semantics to help you work with that data and pull it into Neo4J you've got concept net, which gives you concept hierarchies. And that's very useful for NLP work. You've got data world, so that's data.world, and that's got a huge collection of open and crowdsource data sets. You can check out open library. So if you are looking to have a play with data around books and authors and snippets and titles and that kind of thing, that's a great place to go there. They've got an API and another mention as well. You would be surprised what you can find on public GI repositories. Do go have a look, have a search within GitHub to see what comes up. You may find some interesting data or information about APIs on there.

Lju Lazarevic (15:01):

And definitely there's a shout out there to github.com/UnitedStates and github.com/openelections. And it's also a good thing as well, to mention about these various sites that we've mentioned is we've used the assumption that you might know what you're looking for.

Lju Lazarevic (15:19):

But again, the nice thing about these collections, especially around Kaggle around the hackathon websites and so forth, you can just browse and have a look and maybe there'll be a data set that catches your eyes. And if all else fails, you can always try generating some data. So there is a nice trick which we like to call the Google forms, trick, where you can go off, create a Google form for free create what questions you want to ask. Maybe you ask friends and families to fill it out. Maybe you yourself do a bit of crowdsourcing and ask social media for people to fill that out. And then you can open that up as a spreadsheet and download it as a CSV. And you've got data generation websites such as Macarro where you can specify what kind of field you want. So do you want some kind of a string filled or flat chart you want floats integers, you can give it some information and then we'll go away and generate mock up some data for you to go off and import.

William Lyon (16:21):

Okay. Now we've found an interesting data set that we want to work with. And now it's time to start thinking about how we work with that data as a graph, how we work with it in Neo4J oftentimes when we are talking about these public data sets, we end up with these flat files, so like CSV type format, and we then sort of need to think about, "Okay, how can I model this as a graph?" So we have the graph data modeling exercise to think about. But I think before we even get to the point where we're ready to forming the graph model that we want to work with, it's important to have the questions that you want to ask of your data, at least have those in mind, at least have some general area of the type of questions you want to ask, because oftentimes this is going to inform the data model that you want to create and how you're going to import that data into the FJ.

William Lyon (17:19):

So knowing the questions you want to ask can inform, for example, do I want to model this as a node label, as a property, as a relationship, some other type of structure? That sort of thing.

William Lyon (17:30):

So the questions you're going to ask are what are interesting insights it's that I can find if I know that let's say a person is connected to a company in the data set, are there aggregations that I can do if I'm looking at maybe campaign finance data, can I find everyone who works for Google and what candidates they're donating to? Would that be interesting? Can I find similar nodes in the network? So for example, if I'm looking at businesses and user reviews, can I find user who have similar interests, these sorts of things. And one thing that is really interesting to think about is what happens now, when I start combining these data sets. So let's say, for example, if I take the UK property records dataset and the UK company ownership, datasets, and combine those together, well, then I can start to look at people who own companies and what properties those companies own.

William Lyon (18:35):

I can start to traverse across these data sets and start to uncover some really interesting connections. So I think before we're, we get to the point where we're drawing out our graph model, think of the questions you want to ask of the data.

Lju Lazarevic (18:50):

So we've had to think about the questions and now we can start to think about what the model looks like. So typically when approaching this, have a think about what is the center of the universe of your model of your question. So picking up on the example there, talking about UK property records and UK comp company information, it's quite likely if the kinds of questions we're looking to ask are which companies own, which properties. And then we are looking at who is a direct that company. For example, then quite quickly, we can start to imagine that actually company is going to be the center of our universe because all of the relationships are going to be coming off of company. So we're going to be having addresses of those properties coming off of company. We are going to have the registered address of that company coming off a company, we are going to have the director coming off that company and so forth.

Lju Lazarevic (19:56):

In that scenario, if the questions we're looking to ask are which companies own, which registered properties in the United Kingdom, that's how the model would look like. Now, if actually the question we were looking to ask was we want to de duplicate or find similar addresses. So we're taking the registered address that we find in company's house. And we're taking the address of a company that is in the property database. And what we're trying to do is find which ones are similar. Then all of a sudden, it's the address that becomes a center of the universe in our model here. And this is quite important because this is then going to start to drive the thinking about what becomes a node, what becomes a property. For example, and let's keep working with this example. If we are looking to try and map properties to a director, then we're probably assuming to some extent that the data was clean.

Lju Lazarevic (20:59):

So we're going to be doing links to addresses. Now, if we were trying to do something around similarity, then what we would probably do would start to what we call Rafi, our node. So our node may have the properties of an address line and a post code and maybe a country, whereas a Rafi model we'd have a node, that's a address, but then it would have three nodes coming off. It for one node would be a dress line. One node would be post bond load would be country. These are certain things that start to feature based on what questions we're looking to ask. So it's looking at things like, is there a certain attribute or element commonly referred to in the query? Let's pick another example here. If we were looking at, for example, people and books, so let's say we want to pull in some data about what books people have read.

Lju Lazarevic (21:56):

So let's say they've been posting it on social media and we've done some work to clean that data up. And now we've got user names and we've got book titles, and then we pull in another data set. We're doing a mashup here where we pull in some data from the open books, data set, and we then pulled the book title and the author and the genre and that kind of thing. We're starting to pull that things in together. Again, what's the common element that we're querying on. So if we're always querying on title, then that's going to be a key value there. If actually we're always querying on say the author. Actually what we are interested in is to have a look at different books that an author has done, then maybe we are going to pull the author title out.

Lju Lazarevic (22:38):

So there's lots of different trade offs that you're going to be doing based on the query you're looking to ask and performance wise. It's quite a difficult one to explain on a podcast, but the best to do is to think about what questions you're looking to ask. Are there common themes? Do common nouns keep popping out. And if nouns keep popping out, they're probably going to be your node labels. Do you keep asking quantities about things? Those are probably going to move more towards being on a node rather than as a relationship and so forth. And once you start to start thinking about that, and if you start having a few more questions and you're not so comfortable with modeling in the graph, I would definitely recommend checking out our graph academy course on modeling if you're a bit new to it.

William Lyon (23:29):

If you're looking for some inspiration in how we can go from dataset to graph model, to working with that data in Neo4J there are lots of resources and, and we'll link some of these in the show notes. But one thing that can be helpful is looking at pre-built near forge graphic examples, such as Neo4J sandbox. Neo4J sandbox is a hosted environment that we can spin up Neo4J instances that are populated with some data sets that we choose from. I think there's, I don't know, maybe 12 or 15 out there now at this point. And the nice thing about this is that someone has already done the finding of the dataset, the cleansing in, in some cases, combining some data sets, importing an engineer for J and in included some cypher queries, some visuals and things like that in an interactive browser guy that's built within Neo4J browser that we can go through and explore these data sets.

William Lyon (24:32):

Now, all these data sets that are available on Neo4J sandbox are also available as Neo4J dump files that are published on GitHub. So the Neo4J graph examples GitHub organization, in which again, will link in the show notes has a GitHub repo for each data set that you can find in sandbox that has code samples, for example, in the drivers who working with the data sets. But it also has these dump file. So what is a Neo4J dump file? So a Neo4J dump file is like a compressed version of the database that is portable from one Neo4J instance to another. What's really nice about this is in Neo4J Aura, for example, you can just drag and drop a dump file into the web console app for Aura, and that will load your dataset Neo4J Aura.

William Lyon (25:27):

You can do that with all the sandbox data sets. You can work with those locally, import those into browser, which is really nice. Some other good resources for inspiration are things like the Neo4J live stream. Lju and Alex are doing a series on working with data sets in Aura free, so that's a really fun one to watch that develop going forward, all those are recorded and then go to the YouTube channel. That's a good resource. And then the EFJ developer blog has a lot of examples on this kind of working with data sets, modeling, that sort of thing. Okay. That's some ideas for inspiration. So now we've talked through how to look for interesting data sets, how to evaluate those data sets for if they're a good fit for working with a graph or not. We've talked about how to think of interesting questions that we want to ask of the data. And we've talked about how to think of identifying our graph model for working with this data in property graph model and Neo4J. So I think that only leaves one more step and that is importing our data into Neo4J.

Lju Lazarevic (26:42):

Yes. Here we are, we've got our data set and we're about to go, we've got the model, we're all set, but you may need to clean your data first. So as we mentioned earlier, you've got 538, and they put a lot of effort into cleaning their data. And so certainly if you pull some of your data from some of the museum APIs that typically tends to be cleaned and standardized, but you are not always as fortunate to get this. So you will most likely have to do some kind of cleaning on your data. You've got a few choices as to how you want to go about doing this. So one option may be, you want to use one of the numerous tools that are available to help you clean through your data. For example, you can use things such as CSV kit, and that's going to help you do some level of cleaning and standardization and removing of funny characters and that kind of thing on your data, maybe be you're going to use Deju with the IO to help you do that.

Lju Lazarevic (27:47):

So it's a Python library, and you've got a hosted service associated with that as well. And it helps you to Deju your data set and finding synthetic IDs for entities and that kind of thing. These are things you are applying to your data before you pull it in. You may have an option as well with the driver. We'll touch on that one in a bit. But another thing you can do as well is depending on how much data you are working with and how much time you want to put in another option is you can do some data cleaning at the time of load. So either as you're importing the data into the data base, you can do some cleaning or what you can do is if the data is clean enough to import it, you can then do some post process cleaning on your data.

Lju Lazarevic (28:35):

So you can either use some of the cipher functions that are available out of PIF cipher. The other thing you can do as well is you can use APOCs So a quick reminder of what APOC is, APOC is a plugin you can drop into Neo4J and versions of APOC are available for the enterprise edition. That you'd be running either as a service or through Neo4J desktop. You've got it on Neo4J Aura and also got it if you're running Neo4J sandbox. You'll have some variety of those, and you've got a set of over 400 functions there to help you do various things. And one developed area within the April library are the various text functions and procedures. So in there you have got things such as fuzzy text matching. You can use such as sores and dissimilarity or Levinstein distance to help you with that.

Lju Lazarevic (29:31):

You've got cleaning functions that will strip out characters and weird spaces and that kind of thing, and trim, and that kind of good stuff. You've got phonics functions to help you match words that sound similar. You can use REDX functions as part of splits and REDX groups and so forth. So you have a very rich set of tools there to help you do some cleaning on your data.

William Lyon (29:59):

Okay. So I guess I lied. I guess there wasn't just one more step for importing our data. I guess we do need to think about cleansing our data a little bit first, but, oh well, so now we are really at the last step, which is to load and import our data. Let's talk through a few different ways to approach this, and we'll talk through a few different options and the option that's best for you will depend a little bit on what format your data is in. I mean, also maybe some of your familiarity and comfort with using different tools, but we'll talk through a few different options here. So the first option that will talk about is load CSV, CSV Common Separated Values. Even though we're not talking just about common separated value files, this can be like tab limited, fixed width files, basically any sort of flat file works with load CSV, but load CSV is this functionality that is built into cipher, the coil language for Neo4J

William Lyon (31:02):

So oftentimes we're writing these load CSV scripts in Neo4J browser, but this functionality that's built decipher that will allow us to parse CSV files. And this can be either from a local file, or it can be from a remote hosted CSV file. So for example, you get just point load CSV at one of these hosted files from say, one of the data portals, if they're hosting CSV files, that sort of thing. But the, the key idea here is we're just writing cipher to load and parse the CSV file. And then we write cipher to describe the graph pattern that we want to create. We're writing things like create and merge statements in cipher followed by some graph pattern. And this is really neat and really useful because it allows us to leverage this really powerful concept from cipher of drawing these graph patterns.

William Lyon (32:00):

We just draw these ASCII art graph patterns of how we want to create the data. And then we've loaded that into the graph. A distant perhaps cousin of load CSV is the bulk import tool for Neo4J. This is the Neo4J admin import command. This is a command line tool, the bulk import order. This is useful when I have very large data sets. Load CSV that works within the transactional nature of Neo4J. It's good for small, medium size data sets. Roughly we're talking about up to maybe working with a few million lines of a CSV file, but if I have some, a larger dataset than that, if it we're talking about like tens of millions and more rows of a CSV file, then maybe I want to skip the transactional nature of Neo4J.

William Lyon (33:00):

And I just want to write directly to the file store. Skipping a lot of steps. That means that this is going to be a lot faster. The downside of that is I can only use this for my initial data load. With load CSV, I can add data to an existing near two database with other data with the bulk imports, the Neo4J admin import tool. I can only do this when I'm first building my database. And I also need to think about how to structure my CSV files, how to clean them, how to de dup them with load CSV. I can do some cleaning and, and de duping like with APAC, for example, that Lju mentioned earlier, but with the Neo4J admin import tool, I need to have already done that and have my CSV files structured in a certain way. We'll say that the Neo4J admin import the bulk importer tool is maybe a bit more for advanced use cases, but just wanted to mention that it's out there so when you're ready to graduate from load CSV, that's sort of the next tool to think about.

Lju Lazarevic (34:06):

So the mighty apox steps in once again in our quest to load data into Neo4J and as well as the various text functions and procedures that we covered earlier, there are also a number of tools available within APOP to help you load some data. So we've got a JSONs loader. So in a similar vein to load CSV, you can use that to load either local files with JSONs in there, or you can link up to remote files and you also have a not option to work with APIs as well, but are providing adjacent outputs and you can work with authorization headers as well. If you need to do some level of authentication, you also have an XML Passa available within APOC. So this is great for things such as pulling podcast and RSS information. You'll also find a lot of the website and public repositories tend to have their information published as XML.

Lju Lazarevic (35:12):

So again, that's another great place to be able to grab that information and work with that and get it loaded. Another tool you have as well is load HTML, and this will allow you to do some level of web scraping. You will provide a URL and what you will do as well is you'll provide some elements that you want to pull that information out. It'll help you do some level of assisted web scraping, and last but not least, you also have a JDBC connector available within APOC. So this will allow you to connect to most SQL databases from sypher. So if the SQL database that you're looking to connect to has a JDBC connector, then you can use it with that route. And what's really powerful with this is you can use something like Neo4J browser and you effectively call your APOP function.

Lju Lazarevic (36:06):

You put in your connection strength for your database. You write your SQL query that you want to push to that database within, Neo4J browser. And then when you pull that information back, you can do things with it and go off and save that into your Neo4J instance. You also have some other tools as well. So you've got the Neo4J ETL tool, and this is a no code data import option again, importing data from a SQL database. So it is very much a click blast experience. You effectively give it a connection string. You let it know what tab you're looking to do. It'll go off, it'll pull back information about the schema in your database. It will suggest the model to you. You do a bit of tweaking and then where you go, you pull that into Neo4J and there are going to be more ways you can pull your data in, but we're going to cover one last chunky one before we wrap up this journey and that is with drivers.

Lju Lazarevic (37:06):

So we have a number of official Neo4J API drivers to be able to work with your language of choice, to connect to Neo4J and also we have many more community drivers as well. And this is really useful for you to be able to do any specific data cleaning you wanting to do with specific libraries. So for example, there will be many data cleaning and processing library available and Python, javascript.net, etcetera.

Lju Lazarevic (37:39):

It's really good if you are doing an API mashup against, if you're looking to bring in lots of different APIs and other data sources, and maybe you're bring in some reference data to, and you're doing some fine tuning. Obviously it makes sense if you're going to do any big amounts of work around there, you do that at a programmatic level. And then you use the driver to push that data into Neo4J. That's very useful for cleaning outside of cipher. It's also the process that you do is you do something such as passing adjacent objects as a cipher parameter. You can unwind over it and use that's pull in, and then you can use cipher to express graph patterns over it to create that data in there. Hopefully we've given you on this very quick route, some inspiration about how to get started.

William Lyon (38:29):

I'm sure we missed some really interesting data sets or ways of loading data in Neo4J. So please let us know what we missed. Let us know on Twitter, just #Neo4J and share with us what interesting data set are you working with that we should know about, but that's it for today. So thanks for joining and we will see you next time. Cheers. See you later.