GraphStuff.FM: The Neo4j Graph Database Developer Podcast

Modeling Physical Systems Using Graphs: The Path To NODES 2021 With Mike Morley and Peter Tunkis

Episode Summary

Mike Morley and Peter Tunkis will be presenting "Modeling Physical Systems Using Graphs" at the NODES 2021 online conference on June 17th. They join this episode of GraphStuff.FM to give us a preview of their talk and discuss their journey working with graphs and geosciences data.

Episode Notes

Full agenda and free registration for the NODES 2021 online conference here: https://dev.neo4j.com/nodes

Episode Transcription

William Lyon (00:00):

Welcome to GraphStuff.FM, the place to figure out efficient ways to traverse your data.

Lju Lazarevic (00:06):

We're your hosts, Will Lyon and Lju Lazarevic. In the run-up to NODES, our online developer conference running on the 17th of June, we're taking the opportunity to interview some of the speakers.

William Lyon (00:18):

In these sessions, we dig in to learn how they first found out about graphs, what was their light bulb moment, what are they going to be covering in their sessions at NODES, and what are they looking forward to in the future around graphs. In this session, we're speaking with Mike Morley and Pete Tunkis.

William Lyon (00:35):

Don't forget to register for your place at NODES at neo4j.com/nodes, where you'll be able to check out Mike and Pete's talk on modeling physical systems using graphs in the visualization track. I'm joined today by Mike Morley and Peter Tunkis. Thanks so much for joining us today and welcome to the podcast.

Mike Morley (00:54):

Howdy.

William Lyon (00:55):

I guess to kick us off, maybe could you introduce yourselves and tell us a little bit about yourselves?

Mike Morley (01:02):

Sure. I guess I'll start, being the old guy in the room. Yeah so I'm Mike Morley, and in my current incarnation, I'm the director of machine learning and artificial intelligence technologies at Arcurve. A few years ago, I started a company called Menome and that was recently acquired by Arcurve, so we joined the Arcurve folks in November 2020.

Pete Tunkis (01:22):

Yep, and I am Pete Tunkis. I am Arcurve's inaugural data scientist. I joined Arcurve about the same time as when Mike here formally joined the organization as well, so we came in at the same time. My introduction to Graph is, I don't know, I think it's quirky because it also ties into how Mike and I started working together as well. Originally, I was your standard run of the mill relational database data scientists at a Fortune 100 company in the States. Then we came across an opportunity to help out our fraud and special investigative unit, so that department in the company.

Pete Tunkis (02:02):

So came up with a graphy approach, fairly similar to a lot of the white papers and that kind of thing that Neo4j has, so that served as inspiration. We shopped around, but Neo4j seemed to have the most user-friendly approach to being able to solve the problems or challenges that we are looking to do. And so that's how I got started with Graph. I had a lot of good help from some of Neo4j's technical folks, both in the US and in Canada.

Pete Tunkis (02:30):

In fact, life happens and so, it turned out that my wife and I decided to move to Calgary, which is where I'm originally from. I reached out to actually one of the engineers at Neo4j with whom I had worked previously who is based in Ottawa. And I said, "Hey, do you know any people I could reach out to or whatever, in the industry across Canada for remote jobs or whatever?" And he said, "I don't really know anybody in Calgary like that, but I do know this one guy, Mike Morley, who you might be interested to talk to."

Pete Tunkis (03:00):

And lo and behold, we're now working together for the same company, and we're working on this really neat graph physical modeling project together. To me it's a neat story anyway.

William Lyon (03:10):

Yeah. That's a great story, the graph of relationships.

Pete Tunkis (03:13):

Exactly.

William Lyon (03:15):

I think fraud detection is a really common use case for graphs, right? Definitely something that I see a lot of folks digging into and an intuitive way to dig into that data. Cool, that's a good intro to Graph story. Maybe Mike, could you tell us how you first came into graphs in Neo4j?

Mike Morley (03:30):

Yeah, you bet. So before I started the Menome Company, I was working at a company called Matrix Solutions, which is an environmental engineering firm. We had set up at very earliest, I started there in 2004, and through the course of studying the organization's IT function and software development practice and whatnot, we created the unified and started building a unified data mart to bring together project data and spacial data about the activities that people were working on and then environment data.

Mike Morley (03:59):

As the data sets grow over time, and as the organization grew. When I started there, I think it was 40 people and it was about 800 by the time I left. The data centers, of course, grew as that process went along and I kept running into the same problem. We went from a fairly simple data model using a SQL type approach and the big Kimball star schema thing, and OLAP cubes. But invariably you'd still hit this wall where things connected to other things.

Mike Morley (04:20):

As you can imagine, when you're talking about people, working on projects are located at sites that have all sorts of different environmental characteristics associated with them. Then you have the drilling data that constitutes bore holes. You have wildlife observation, you have rare plants, contaminants, spills, all these sorts of data trying to actually get a unified picture of everything that's happened in the context of a product or a site, a spatial location where some activity has happened. It becomes very complicated and it was actually quite a bit past the types of databases and data models that were available at the time.

Mike Morley (04:57):

And look, 2009, I was doing some research on how to try and figure out a way around all this, like this thing connected to the other thing, and then came across a post about Neo4j. And I was like, "Oh, this looks interesting." And then sort of digging into it, and sure enough, it was like, "Wow, this is mind blowing." And immediately started conducting some experiments with it in terms of being able to unify these different data sources together. And then we ran some experiments through the course of that process, and it got into using LDA topic modeling and a few other things to try and further augment the way you'd be connecting all the data and classifying things and whatnot. They were quite successful. And then that essentially led to starting Menome in 2016.

William Lyon (05:37):

Maybe you could tell us a little bit more about Menome and you said now you're at Arcurve. Maybe you could talk a little bit more about that transition and some of the things you worked on it at those two companies.

Mike Morley (05:47):

Yeah, so Menome was started, as I mentioned, to pursue the idea of bringing datasets together into the graph. We created a platform called Data Lake, that allowed for continuous integration of streams of data by atomizing data from sources using the harvester messaging pattern, and that worked quite well. And then what we were doing is we were bootstrapping the company through using this platform and Neo4j to generate insight for various different types of data ranging in some cases, project data, environmental data, school educational data, all kinds of interesting stuff. And then that led to this. Essentially we created the way we're funding the business was through this idea of advanced analytics. Doing the analysis of different things, which would generate consulting revenue, which would then feed into the platform development piece.

Mike Morley (06:32):

We had been camping out of Arcurve space, and that's a funny story too, and one of the original Matrix employees, Konrad Aust, we were on a call with a client, myself and him and his friend [Mark 00:06:43], and at that time we weren't at an office, we were at a house and we were on a client call and Konrad saying, this was in January in Calgary, It can be quite cold here. And so he comes on the camera and he's got this light bulb over his head and a scarf and you could see his breath. And he's like, "Dude, where are you?" "I'm in my parents garage. They don't want me talking on this call."

Mike Morley (07:08):

And I thought, "Oh boy, I better do something about this." I contacted [inaudible], who's the CEO of Arcurve. And I said, "Hey, you mentioned you had some space." Basically he let us camp out there, we got to know the Arcurve folks quite well. And came up last year, they were looking for an advanced analytics practice. And by that point we'd had a good cadre of a strong team and whatnot. It made a lot of sense in terms of bringing those two entities together. And now we've since been able to hire folks like Pete and bring a lot more data scientists into the team and are continuing that process. It's been really great.

William Lyon (07:43):

Gotcha, great. And when you say an advanced analytics practice, is that like a consulting engagement?

Mike Morley (07:51):

Yeah, yeah. We're primarily focused on delivering insight through the analysis of data and whatnot. We still do use obviously the platform and whatnot that we developed through the course of Menome. We have access to all of that [inaudible 00:08:05] packs, but we actually are now primarily focused on the analysis piece rather than just on the platform side of it.

Pete Tunkis (08:10):

One thing I would just add to that too, and this is one of the things that appeal to me as well, is that it struck me that it's not like a software as a service type of thing where you're paying for Arcurve version 1.0 or whatever it is. We build out solutions, whether they're analytical, insight-based, or whatever fashion they might happen to appear. We deliver that in a sustainable way, meaning that when we hand the product over, we hand a solution over to a client or a business partner, then they're the owners. In theory, they should be able to support it themselves, and then we're always just a phone call away type of deal. But it's their product, it's not our product per se.

William Lyon (08:54):

Great, well we wanted to talk to you both today about your upcoming talk at the NODES conference, which is coming up in just a few days. The title of your talk is Modeling Physical Systems Using graphs, could you maybe give us a preview of what are some interesting things that the attendees will learn during your talk?

Mike Morley (09:15):

Yeah, so this, again it's been something we've been plugging away on for a while, and it comes out of some work that I did back in the mining industry. I originally started out working at a place called [inaudible] in Montreal doing 3D modeling and blast design stuff. At that time it was basically... Relational databases had just become a thing, and it was marrying the relational database capabilities with the two dimensional capabilities that came out, I think [inaudible 00:09:42]. We basically wrote a system that would allow for the modeling of complex surfaces and being able to interact and segment those services out and then run all sorts of analyses, using a unified model. That sort of paradigm shift going from file based databases, which we had used at the time, to a relational database brought huge capabilities to the equation that were just not possible prior to that happening.

Mike Morley (10:05):

With the advent of Neo4j and the way graphs work, and the way you can structure and model things with graphs, it's quite revolutionary, particularly when you start thinking about the way you model physical systems. If you look at any kind of numerical modeling in particular, or even geospatial models and then temporal models, all of those things in the world of and for the analytical side of it, are actually composed of nodes connected to other things, nodes connect together with relationships. So you can actually model boerholes with linked lists in Neo4j. You can actually model buildings with mesh networks and there's some great work by Will Reynolds from Houdini who's done some experiments with this on projecting building [inaudible 00:10:47] models into graphs and really cool stuff. So yeah, the brilliant thing about Neo4j is you can actually construct a database representation of a physical thing using the Graph structure.

William Lyon (11:00):

Cool, and I know one other thing you mentioned earlier that really stood out to me, is this idea of pulling in disparate data set. So you have data on maybe some geospatial model, you have data from other data sources. I imagine that is a useful component on the analysis work as well. Is that something you're going to dig into a bit at NODES?

Mike Morley (11:20):

Yes, that's hugely useful and it's actually a core aspect of the... This has actually been a series of experiments we've been conducting for a while now, where we started off, can we model street and surface level data? And yep, we can open street maps, some of the work that Craig Tavener is doing. And because Neo4j essentially supports both the spatial and Cartesian coordinates, you can do really cool things, like having a spatial model interact with an accurate building model in the same graph structure. Then you can bring in all the adjacent data saying, "Okay well, if you've got surface data, and you've got a building model on top of that, you can have all of the things inside of that building model, the mechanical, electrical, plumbing systems, then you can attach the things like the documents and all of the unstructured data to that same model.

Mike Morley (12:05):

If you're going into the building and you want to make it essentially a data twin or a digital twin of that entire facility, you can do that. It doesn't stop there because the limitations of a lot of this stuff, historically, have been, you'd only be able to really model one domain effectively. Surface data is not connected to subsurface data. Building data is not connected to the environment around it. By projecting all of this stuff in the graph, you can actually model all of those physical systems in the same graph, which I think opens up a whole new field of potential opportunities in terms of the unified model, particularly when you're talking about engineering systems and how they interact with the environment.

Mike Morley (12:38):

What we've been doing is we've been conducting a series of simple experiments for the last year or so, but focusing on trying to figure out each particular facet of this process of unifying these models. So the one we're going to be doing coming up here is taking data from the Alberta Energy Regulator and pulling those data in and try to do a simple site characterization model of the entire province of Alberta, from a gas well perspective.

William Lyon (13:05):

Awesome, that sounds really cool. I'm excited for this talk in NODES. Have to watch it live. One thing that I'm always interested in asking guests is... And since you're on the cutting edge here of doing some of this analysis, as you've watched the graph space evolve over the years, where do you see things going? What's something that you're excited for the graph space to move towards?

Pete Tunkis (13:30):

I'll take this one. For me, I think it depends, so is it what we think we see coming, or what we would like to see? Because I can definitely speak to the latter, which I think is already happening, and that is wrapping things up into... I hate to use the phrase, but a full stack platform, so beyond an architectural modeling platform as a database type. Then Neo4j is already pretty good at allowing analysts, data scientists, engineers, architects, and so on, to basically mess with the data natively within the platform. So it's not that we have a database and then we just have an engine that's wrapped around it or plopped on top of it to say, "Oh, we're doing quote-unquote graph data science now." Everything is contained. The data are in the graph and any manipulations, so any data munching and any of that stuff would be done within that native environment.

Pete Tunkis (14:29):

Almost as if you had RStudio or PyCharm or Spyder or something, but you're in a graph the whole time. I think in some ways Neo4j is... And to be fair, I think a lot of the graph database industry is slowly moving in that direction. I find that pretty exciting because if we can contain as much of the, for me as a data scientist, as much of that end-to-end data sciencey stuff from the engineering and feature engineering, feature selection, along with any of the unsupervised or supervised algorithms, and those are continually being brought up and developed outwards and all that sort of thing. That's where I see it going and certainly where I would like to see it going.

Pete Tunkis (15:12):

It sure beats having to import data sets from Python or R, and then export properties and nodes back into R or Python to do something, and then re-import back. Which is certainly possible because there is a lot of good support out there from the community, as well as the officially supported peripheral languages by Neo4j. But yeah, as things keep moving towards a unified platform, that would be at least very convenient, for sure.

William Lyon (15:38):

Right, sort of the evolution of the developer tooling, so moving up the stack a bit, yeah.

Pete Tunkis (15:43):

Yeah, definitely.

William Lyon (15:43):

Makes total sense.

Pete Tunkis (15:45):

Definitely. Just really quick, because a lot of the math and a lot of the Graph Data Science itself, isn't necessarily new in the same way that like, "Oh, we developed this code of this package library or something within the last year. A lot of this stuff has been around for decades and decades. So just being in a place at an age where we can see this all coming together is good. Again, it suggests that's more or less where things seem to be going, at least from my point of view.

Mike Morley (16:15):

Yeah, and it's one of the reasons that the Graph Data Science Library has been very exciting, because it's taking those capabilities and bringing them together in such a way that it makes them easier to use, easier to validate. It's something that's a huge issue, regardless of the modeling domain you're existing in, is having the processes and the patterns around making these sorts of analyses repeatable, stable, and robust. It's huge because a lot of the work that's done when you're pushing and pulling CSV files around, which is so often done with a lot of it, particularly in the world with the medical modeling, makes for... There's a lot of management, a lot of time that's required just to try and keep track of versions of models. I've seen that, so we were working again, at Matrix and some of these other places where you're just trying to construct a model, run the model, take the output, visualize it, make adjustments, run it again. That ends up consuming massive amounts of time and resources.

Mike Morley (17:09):

So having an engine like Pete's describing, where you can basically have all of that stuff taken care of, and the way that GS has evolved, particularly recently with the latest addition, to being able to store model runs and all this kind of thing, that's exactly the right direction. The other piece is the visualization piece, so being able to project those data into various analysis tools, whether you're talking Tableau or Power BI or whatever, or even going past that, with the stuff that GraphXR in particular are doing. There's some really interesting things going on that side of the equation too.

William Lyon (17:40):

We've been talking about different kinds of technologies, the intersection of geosciences, GIS Software Development. One thing that I'm wondering is, for folks that are interested in getting into a career in this space, like what you're working on, what suggestions would you have? Is it something where you need to have this geosciences domain background? Is the technology a good introductory point? What do you think about that?

Pete Tunkis (18:10):

To be fair, really quick, I generally defer to Mike on these kinds of questions, but for what it's worth, I do not have a geosciences background. I have a PhD in political science. That's very little to nothing to do with learning about lithographic sentences or stratigraphy, or geoformations, or any of that sort of thing. Which I've learned a lot, but I didn't come into it with any sort of concrete background. It's more about knowing the kind of questions to ask, how and when to ask them. Then thinking about, "Okay, if I have this given situation, how would I apply the skills that I already have?" Which are translatable essentially for technique, methodology, and process. And how would I apply that in a new domain with a little bit of extra context and substantive help from colleagues like Mike and others that I know who are in the field? In my defense is the nutshell of that.

Mike Morley (19:04):

The interesting thing about that though, is there's tremendous power coming in from the adjacent domain. There's a really great book by Arthur Koestler called The Act of Creation that talks about this. You see this happen. Everybody, particularly in the world of the geoscience domain or other domains, when you're talking engineering and whatnot, we're taught going through that program that datas are square. You work in a spreadsheet, you work in a database, just that whole relational model is just sounded into people's heads. And it has been certainly a challenge in terms of trying to get across the point that when there's a lot of work that goes into trying to stick data into a box. And the thing that's really nice about Graph is you don't have to do that. It's much more amenable to model natural systems because you can shape the data in a way that reflects the actual structure of the natural system that you model.

Mike Morley (19:51):

So the interesting thing is that we realized when we were starting to link borehole lists together and whatnot, a sense of, "Hey, boreholes look a lot like sentences." So that's where the inspiration for the other part of the talk Pete's going to go into is, taking a thing sequence to sequence, which Pete can describe the whole thing around that better than I can. Instead, I'm looking at the boreholes going, "Geez, these things look sentences." Pete, do you know of an algorithm that can actually take a sentence prediction thing that preserves the order? Because of course the term sets in the context of borehole lithology, when you're talking of [inaudible 00:20:22] borehole, it's the same kind of pattern where the sequence of things is important and the way people log things is important. But you can bring machine learning into the equation but if you could use the boreholes that are linked up in the being a forte into the link list site structure, you could use that to then train and recommend it.

Mike Morley (20:40):

That would be smart and have some intelligence to it using the historical data launched in that area. So that's one of the other things we're going to have a look at as a school presentation here, but it's being able to come from an adjacent domain, looking at the algorithms and looking at the databases in a bit of a different way, and then talking to subject matter experts, and then just trying some of these things out. So that's getting into the stuff that's actually the way we typically do it, it's all pick out a problem. And right now abandoned wells are a big problem. If we can get access to those data through the ADR, then let's try some of these things out. So when I'm talking to people who are looking to get into this space where they're coming in from the geosciences side, or whether they're coming in from data science side, that's what I would suggest doing. It's lots of really interesting problems in this space. Take one that hasn't necessarily been addressed or there's challenges around it with respect to the additional approaches and try it out.

Pete Tunkis (21:31):

Yeah. And really, and again, ask questions. Ultimately, a lot of what we talk about with Graph and Graph Data Science is relationships. So indeed as you're asking questions, if you're not a subject matter expert in this particular area, so geosciences, geology, that kind of thing, then ask the questions that would help you determine how does this context relate to a context that I might already be familiar with or translating into terms regardless of what the subject matter is. If you have a modeling task in front of you as a data scientist or an analyst, what is your outcome variable? Is it continuous or is it just a set of counts or floats or whatever, or is it discrete categories? And then without knowing what those data are about, you can already start to get an idea, "Okay, generally speaking, I probably would want to take a discreet modeling approach or a classifier to categories," or "I would want to take a continuous, like a linear type model, like a regression or whatever to these continuous data given that's my target."

Pete Tunkis (22:34):

At no point in thinking about that by itself, did context come into play. Of course, that doesn't mean that it's not important. I don't want to make it sound like I'm downplaying it in any way. But nonetheless, it's all about relating and applying your knowledge and skills and translating them into new contexts. And that's half the fun because you end up learning something new about... Like myself, I had no idea about anything geology. And I have a friend who works up in Northern Alberta at one of the mining sites up there, and all of a sudden I can actually understand some of the things that come out of his mouth when he talks about work. A year ago, he would tell me these things and I would be, "Oh yeah, cool story, bro." But now it's actually, "Oh yeah, this is actually relating to the project that Mike and I are working on. Blah, blah, blah." So yeah, it's fun.

William Lyon (23:20):

One thing that really jumps out to me is the transferability of Graph algorithms and Graph Data Science to different domains. For example, I was talking to a bioinformatics professor and we were looking at protein gene interaction data. And he starts telling me that the algorithms that he's interested in using for this specific problem that we were working on actually come from social network analysis. Some of those are available in Neo4j out of the box with GDS. So I know absolutely nothing about bioinformatics and biology in general. So I didn't initially think I'd have much to contribute. Then I realized, "Oh yeah, I actually know a little bit about some of the social network of stuff." So we're looking at collaborating now. So certainly a lot of interesting ways to transfer your knowledge in different domains there.

Pete Tunkis (24:07):

Absolutely.

Mike Morley (24:08):

Yeah, definitely. That's totally the point of these little experiments we've been running. So this one you can say, is it around a very simple starting point for site characterization? If it looks it proves out then what we'll do is we'll start to dig into that further and deepen the model and whatnot, but it is very much exploring some of the algorithms that are in TDS to see how they might apply to this briefly today.

William Lyon (24:31):

Great. Thanks for taking the time to chat with us today. We're looking forward to your talk at NODES coming up in a few days. We'll put a link to that in the show notes, and I guess we will see you all at NODES.

Mike Morley (24:44):

Looking forward to it.

Pete Tunkis (24:44):

Looking forward to it. Yep.