GraphStuff.FM: The Neo4j Graph Database Developer Podcast

The Path To NODES 2021 With Tomaž Bratanič - From Text to a Knowledge Graph: The Information Extraction Pipeline

Episode Summary

In the run-up to NODES 2021 (the Neo4j Online Developer Expo & Summit) we're interviewing a few of the speakers to tell us how they first got excited about graphs, what they're working on, and give us a preview of what they'll be sharing at NODES 2021. In this episode, we speak with Tomaž Bratanič about knowledge graphs and graph data science.

Episode Notes

Register for NODES 2021 here: https://dev.neo4j.com/nodes
Follow Tomaz on Twitter: https://twitter.com/tb_tomaz

Episode Transcription

Lju Lazarevic (00:00):

Welcome to GraphStuff.FM. The place where developers use the shortest path to solve their graph data challenges. We're your hosts Will Lyon and Lju Lazarevic.

Lju Lazarevic (00:11):

In the run-up to NODES, our online developer conference running on the 17th June, we're taking the opportunity to interview some of the speakers.

William Lyon (00:21):

In these sessions, we dig in to learn how they first found out about graphs, what was their light bulb moment, what are they going to be covering in their sessions at NODES, and what are they looking forward to in the future around graphs.

William Lyon (00:34):

In this session, we'll be speaking to Tomaz Bratanic. Don't forget to register for your place at NODES at neo4j.com/nodes, where you'll be able to check out Tomaz's talk, From Text to a Knowledge Graph: The Information Extraction Pipeline, in the data science track.

Lju Lazarevic (00:54):

Today, I am joined by Tomaz Bratanic and he's going to be talking to us about his experiences with graph and give us a bit of a warmup as to what's coming in his NODES 2021 talk.

Lju Lazarevic (01:07):

So, Tomaz, welcome.

Lju Lazarevic (01:10):

Thank you for joining me today. Tomaz, can you tell us a little bit about yourself?

Tomaz Bratanic (01:18):

I live in a small country, Slovenia, and I live in a smaller city that's called [inaudible 00:01:26] we have around I think, 15,000.

Tomaz Bratanic (01:29):

So I'm kind of enjoying my time here working remotely through the day, and then I get to enjoy nature in my free time because there's a five minute walk and I go to the nearest forest and I can turn off and enjoy the nature.

Tomaz Bratanic (01:47):

And that's my free time, but as my profession, I simply became... I've always been enthusiastic about graphs but the last month or so I'm expanding my business a little bit. So I took on an intern, I also am preparing a friend of mine to be able to write interesting content.

Tomaz Bratanic (02:14):

So basically it's a really interesting time because I'm expanding from a team of one to a team of two or three.

Lju Lazarevic (02:24):

That's really exciting. So obviously you're very enthusiastic about graphs and you're spreading the graph love. So let's dig into that a little bit. How did you come across graph databases in the first place?

Tomaz Bratanic (02:39):

That's an interesting subject. It was let's say five years ago, I didn't have any developer experience all I knew a little bit of HTML that I did in summer school, maybe 15 years ago, and that was it. It was more by chance that my brother got me a job working at a startup incubator. There was one long hallway and I don't know, 15, 20 start-ups in that hallway, and my boss was basically the boss of the whole incubator and he was really excited about graph databases.

Tomaz Bratanic (03:26):

So basically I was kindly or gently pushed into graph databases by him, which was, as it turns out was actually a very good idea. So basically before, my first developer experience was basically with Neo4j [inaudible 00:03:46] that makes me a bit unique because usually people start with scripting languages, a little bit of SQL, but not me.

Tomaz Bratanic (03:57):

I immediately started with Neo4j, and then I started to learn a little bit about the scripting languages and SQL. That was my introduction to graph databases was basically my boss was so excited and I was like a blank page because I didn't know I need development. I didn't have any development skills, so I guess I owe him a favor that he pushed me into graph databases and specifically Neo4j.

Lju Lazarevic (04:32):

That is really cool, that is super amazing, and like you say, that's quite a unique thing because normally the people we come across when they learn about graph databases, they're coming from that developer background, so maybe they've done computer science at college or at university, maybe they've done some coding and they bump into a database or maybe they've got a specific problem they're looking to solve, and they're having a look at what's available out there.

Lju Lazarevic (05:05):

Yours is a truly unique position in that you came from no developer background at all, and that literally your first contact point is Neo4j.

Lju Lazarevic (05:17):

How was that, how did you find that whole experience out of interest? It's one of those things where your boss has obviously said, "hey, Neo4j, this is amazing," but what was the moment for you where you thought, "no, this is amazing, this is great."

Tomaz Bratanic (05:42):

At the start, it was both [inaudible 00:05:42] Neo4j, and I have to look at some blog posts. I actually think I remember it was that, there's a blog post about television on Neo4j from five years ago, and I distinctly remember how I copied the code from there and inputted my first graph.

Tomaz Bratanic (06:04):

That was really awesome because everybody around me was excited, but no wonder they didn't know any Cypher, and it was more like the starting phases where I like to say that we were allowing [inaudible 00:06:18] sent links between the [inaudible 00:06:22] and that I had to figure it out how this whole thing works together.

Tomaz Bratanic (06:27):

But as a unique moment for me, it was like in Neo4j, what do you do when you interact with the database. Most of the time what you're doing is you're the matching some patterns. You Cyphered to make some patterns and you need to leave them and maybe do some data manipulation, like processing after that.

Tomaz Bratanic (06:51):

But the essence of everything it's better than matching. I felt that was really awesome because we as humans are also better matching machines. So it was like me as a human, a better matching machine is like interacting with a pattern-matching database. So that was for me, really exciting.

Lju Lazarevic (07:16):

That's really cool.

Lju Lazarevic (07:20):

So we've heard about how you first came about graph databases and the bit that got you excited about them. What was the thing that you were like, "oh, this is so cool."

Lju Lazarevic (07:33):

So how are you using graph databases today? What's your typical interaction and application of them?

Tomaz Bratanic (07:40):

I've heard that they're using Neo4j with business environments, but lately almost everything I do in Neo4j more like academia or just a plain research for myself.

Tomaz Bratanic (07:55):

And I like it because obviously there's no business type things. There's no deadlines, there's no pressure. There's just finding cool data trying to come up with some cool insights. There's no boss on your back pressuring you, and if you don't find any insights, then onto the next step. You learn something and now you can go and try something else just because for example just a one [inaudible 00:08:33] I've had before.

Tomaz Bratanic (08:35):

I'm associate of how do you say Slovenia Institute for Biostatistics and Medical Informatics? I think that's the full title of the institution, and what they have basically is like a Neo4j database and so they put the same concepts and the concepts can be for example genes, [inaudible 00:09:01], diseases, plants or food and the relationships between them indicate how they interact with each other.

Tomaz Bratanic (09:11):

For example, let's say aspirin treats headaches. So basically you have a knowledge graph with the medical entities and the relationships between them. And let's say, for example, if we were to search for new drugs that can have the headache, for example, what you can do is basically locate which existing drugs do we know treat the headache best and then calculate which drugs are the most similar to the drugs currently known to be treating a headache.

Tomaz Bratanic (09:53):

Then you can see based on results does it make sense, does it not make sense? I'm not an MD, a medical doctor, but I have two of them on my team so they can tell me if the results makes sense or they don't make sense. So it's interesting. But again, I say it's more like academia uploads strategist trying to find cool stuff without the potential of carrying deadlines, like business requirements.

Lju Lazarevic (10:25):

Yeah, so it's interesting as well, touching little bit about this idea of putting data in and seeing if there's cool insights, and sometimes you do have insight and something to come out of that, and sometimes you don't.

Lju Lazarevic (10:38):

I guess the interesting thing there and sometimes we forget about is even if you don't get something out of it, that's still a lesson, you're just learning that is maybe an approach that doesn't quite work, or maybe there isn't something to be found in there. So that's still an outcome just maybe not the one we were expecting.

Tomaz Bratanic (10:55):

Yeah, because now I remember my first boss, but also I'm doing a lot of experiments. He liked experiments, and for example, he gave me something to graph and it didn't turn out the way he intended. I invalidated his assumptions and he was like, the experiment failed, and I was like, no, the experiment succeeded, and we learned that your assumptions are wrong.

Tomaz Bratanic (11:27):

Now imagine me telling my boss that his assumptions are wrong. It was fun times. But I like this kind of approach, like a regular scientific research approach. So basically you'll come up with some assumptions, you try to design an experiment that can either validate or invalidate your assumptions and then you get the results and basically it learns something new and you can then design a new experiment and just iterate and all the time you will learn a lot of new things and maybe your input data may change or you may get bad quality input data because even with graphs, most of the time when you get the data and you input it you don't really know what's in the graph, did you model it correctly other than some exceptions that you don't really know.

Tomaz Bratanic (12:38):

So you do your first iteration of the graph analysis and then you go, "okay so maybe there will always be a single relationship here." If there's more than one it's basically an error in data collection process. And the more knowledge you gain about it, the more precise your knowledge graph can be. And consequently, the more precise insights.

Tomaz Bratanic (13:06):

So I feel like everything is an infinitive process, nothing is set in stone, basically just like in life, you iterate, and hopefully getting better day by day. I should do maybe a motivation speech.

Lju Lazarevic (13:24):

That can be the next podcast.

Lju Lazarevic (13:28):

So on June 17th, the date of NODES, you are going to be doing a talk, and the title of the talk is From Text to Knowledge Graph: The Information Extraction Pipeline.

Lju Lazarevic (13:42):

So can you tell us a little bit about the talks? What kind of things are you going to cover and what would attendees learn from this talk?

Tomaz Bratanic (13:52):

The talk will focus how to construct a knowledge graph, and your input data is unstructured, the format text basically you need to somehow extract structured data or information from the text as a first step, and then how to input that information into a knowledge graph.

Tomaz Bratanic (14:18):

I feel like this is a really exciting area because there's a lot of information available out there. Anyone for the knowledge graph that I was talking before, the medical one, is basically they use a very similar procedure, to the one I'm going to talk about in the talk. And basically they looked at all the... like in medicine, if I have a couple of journals online, they publish all the new articles about various subjects, and obviously no human is capable of reading them all and extracting valuable insights.

Tomaz Bratanic (14:59):

So some scientists have devised NLP processes or NLP techniques to first identify mentioned medical concepts in the research papers. And then [inaudible 00:15:15] was also trying to infer relationships between those concepts in the research paper, because obviously there's so many papers that no single person can reach all of them and do it manually. So you have to kind of automate it a little bit.

Tomaz Bratanic (15:36):

It's basically a really hard problem, but once you construct the knowledge graph, the possibilities of finding valuable insights basically endless.

Lju Lazarevic (15:51):

Absolutely. And I guess if nothing else, bringing that structure into a graph, allows you to spot things that weren't necessarily obvious to begin with that were linked, so quite an exciting space.

Tomaz Bratanic (16:05):

It's really in the last year or two, it's also graphs more and more used for political analytics. And I find these really interesting. So if you start with a text, you run some NLP, natural language processing model, which is a machine learning model and you use it to extract valuable information from text, PDFs or whatever, and you impose that extracted data into a knowledge graph. And then once you've got that, you feed that knowledge graph into another machine learning model that can help you predict some type of new links between entities.

Tomaz Bratanic (16:58):

So I'm really excited about this intersection of machine learning, the knowledge graphs, and in this specific example that I'm talking about, it's the first step is using natural language processing techniques to extract valuable information turn it into knowledge graph and then again, use machine learning models to predict cool new stuff.

Tomaz Bratanic (17:23):

So it was basically a knowledge graph is a beckoning intermediate step in machine learning data flow. So I'm really excited about it, and also I am excited about it because everything can be automated and me being a lazy person, I like things being automated. So you could in using this technique, scrape new research papers off whatever you want, it can be medical journals, it can even be like official state papers and even news, for example, like extracting information from the news source.

Tomaz Bratanic (18:06):

So it could be also really helpful and you can do this automated. So while sipping your coffee, your machine scrapes the internet, extracts valuable information and throws it into your knowledge graph. So I find that really awesome.

Lju Lazarevic (18:26):

That does sound really exciting. I guess some of this is touching on some of the things that you see coming in the future, and what's really interesting to you. What else do you see in the more extended future? What are you looking forward to in the world of graphs coming up?

Tomaz Bratanic (18:44):

Yeah, the graph data science is going to be more and more popular because I've not looked but some elections for my fellow country member because of the [inaudible 00:18:57] is also a Slovenian, but also very enthusiastic about graphs. And I've found it really interesting. What she said is that currently machine learning models take two types of data structures.

Tomaz Bratanic (19:17):

So one is, let's say, if you look at texts, it's basically just a linked list that goes on. Just every word can be a node and just a next relationship between words and that's how it gets the sentences barely less than whatever. So it's basically just a single line type of selection, and then the other line is like images. So when you're looking at the images, basically what you're doing is dealing with great data structures. So you've got like X and Y axis and every pixel or whatever, it can be a node in the graph.

Tomaz Bratanic (19:59):

So that's kind of the current focus of machine learning, but like more and more, what we are observing is that data selections are like, how can you describe nature? That is not so simple. And here is where NET will come into play. So with Networks, basically you can describe everything and anything you want.

Tomaz Bratanic (20:24):

It feels bad enough imagination, but basically they don't have the limitations that those two approaches have. So basically you can use it to describe any type of literal scenarios that you want and then feed that network or use that network for political analytics. So I need to feel the future is basically, political analytics.

Lju Lazarevic (20:53):

Yeah. A very exciting space, indeed. So just a quick reminder to everybody, you can check out Tomaz's, talk at NODES. He will be in the data science track and we will have the schedule for NODES out shortly.

Lju Lazarevic (21:09):

We'll have the link to that in the show notes. First of all, thank you very much Tomaz for joining me today. And thank you so much for telling me about your history with graphs and what you're looking forward to in the future.

Tomaz Bratanic (21:24):

Yeah. Awesome. Happy to talk with you and we can do it again sometime.

Lju Lazarevic (21:30):

Oh, absolutely. I think we need to dig a little bit more into what were your first experiences with graph.

Tomaz Bratanic (21:38):

Awesome.

Lju Lazarevic (21:39):

Wonderful. Thank you very much. And we'll speak to you later.