GraphStuff.FM: The Neo4j Graph Database Developer Podcast

Catching Bad Guys using Graph Entity Resolution with Paco Nathan

Episode Summary

Our guest is Paco Nathan, a Principal Developer Relations Engineer at Senzing.com (and listed as an “evil mad scientist” on LinkedIn). Paco works a lot within developer communities and tech events, and tries to provide many pointers to learning materials, industry analysis, and connect with many people. He presents talks, workshops, and hands-on tutorials about entity-resolved knowledge graph practices, and the related AI use cases which are downstream.

Episode Notes

Speaker Resources:

Neo4j+Senzing Tutorial: https://neo4j.com/developer-blog/entity-resolved-knowledge-graphs/#neo4j
When GraphRAG Goes Bad: A Study in Why you Cannot Afford to Ignore Entity Resolution (Dr. Clair Sullivan): https://www.linkedin.com/pulse/when-graphrag-goesbad-study-why-you-cannot-afford-ignore-sullivan-7ymnc/
Paco’s NODES 2024 session: https://neo4j.com/nodes2024/agenda/entity-resolved-knowledge-graphs/
Graph Power Hour: https://www.youtube.com/playlist?list=PL9-tchmsp1WMnZKYti-tMnt_wyk4nwcbH
Tomaz Bratanic on GraphReader: https://towardsdatascience.com/implementing-graphreader-with-neo4j-and-langgraph-e4c73826a8b7

Tools of the Month:

Neo4j GraphRAG Python package: https://pypi.org/project/neo4j-graphrag/
Spring Data Neo4j: https://spring.io/projects/spring-data-neo4j
Entity Linking based on Entity Resolution tutorial: https://github.com/louisguitton/spacy-lancedb-linker
https://github.com/DerwenAI/strwythura
AskNews (build news datasets) https://asknews.app/
The Sentry https://atlas.thesentry.org/azerbaijan-aliyev-empire/

Announcements / News:

Articles:
- GraphRAG – The Card Game https://neo4j.com/developer-blog/graphrag-card-game/
- Turn Your CSVs Into Graphs Using LLMs https://neo4j.com/developer-blog/csv-into-graph-using-llm/
- Detecting Bank Fraud With Neo4j: The Power of Graph Databases https://neo4j.com/developer-blog/detect-bank-fraud-neo4j-graph-database/
- Cypher Performance Improvements in Neo4j 5 https://neo4j.com/developer-blog/cypher-performance-neo4j-5/
- New GraphAcademy Course: Building Knowledge Graphs With LLMs https://neo4j.com/developer-blog/new-building-knowledge-graphs-llms/
- Efficiently Monitor Neo4j and Identify Problematic Queries https://neo4j.com/developer-blog/monitor-and-id-problem-queries/
Videos:
- NODES 2023 playlist https://youtube.com/playlist?list=PL9Hl4pk2FsvUu4hzyhWed8Avu5nSUXYrb&si=8_0sYVRYz8CqqdIc
Events
- All Neo4j events: https://neo4j.com/events/
- (Nov 5) Conference (virtual): XtremeJ https://xtremej.dev/2024/schedule/
- (Nov 7) Conference (virtual): NODES 2024 https://dev.neo4j.com/nodes24
- (Nov 8) Conference (Austin, TX, USA): MLOps World https://mlopsworld.com/
- (Nov 12) Conference (Baltimore, MD, USA): ISWC https://iswc2024.semanticweb.org/event/3715c6fc-e2d7-47eb-8c01-5fe4ac589a52/summary
- (Nov 13) Meetup (Seattle, WA, USA): Puget Sound Programming Python (PuPPY) - Talk night Rover https://www.meetup.com/psppython/events/303896335/?eventOrigin=group_events_list
- (Nov 14) Meetup (Seattle, WA, USA): AI Workflow Essentials (with Pinecone, Neo4J, Boundary, Union) https://lu.ma/75nv6dd3
- (Nov 14) Conference (Reston, VA, USA): Senzing User Conference https://senzing.com/senzing-event-calendar/
- (Nov 18) Meetup (Cleveland, OH, USA): Cleveland Big Data mega-meetup https://www.meetup.com/Cleveland-Hadoop/
- (Nov 19) Chicago Java User Group (Chicago, IL, USA): https://cjug.org/cjug-meeting-intro/#/
- (Dec 3) Conference (London, UK): Linkurious Days https://resources.linkurious.com/linkurious-days-london
- (Dec 10) Meetup (London, UK): ESR meetup in London by Neural Alpha
- (Dec 11-13) Conference (London, UK): Connected Data London https://2024.connected-data.london/

Episode Transcription

Jennifer Reif: Welcome back, Graph Enthusiasts to GraphStuff.FM, a podcast all about graphs and graph-related technologies. I'm your host, Jennifer Reif, and I am joined today by fellow advocate, Jason Koo.

Jason Koo: Hello, everyone

Jennifer Reif: And our special guest today, Paco Nathan, who is a principal developer relations engineer at Senzing.com, and listed as an evil mad scientist on LinkedIn.

Paco Nathan: Thank you very much. So glad to join you, Jennifer and Jason.

Jennifer Reif: So just a little bit of background on Paco. He works a lot within developer communities and tech events and tries to provide many pointers to learning materials, industry analysis and so on, and connect with many people. He presents talks, workshops, and hands-on tutorials about entity-resolved knowledge graph practices, and the related AI use cases which are downstream. So thank you so much for joining us, Paco. Do you want to start just a little bit with how you got into Neo4j?

Paco Nathan: Yeah, no, great. I've been in this space of, well, just to back up and give history back to the dark ages, I went to grad school back in the '80s, started grad school in early '80s in AI. And while I was there, I had a lab partner, while I was at IBM Research on a project, I had a lab partner who just talk, talk, talk, talk, talk about neural networks. And she really got me interested. So in the mid-'80s, I got involved in doing work in neural networks, but I was also already doing work in machine learning and natural language. And I mean, for a long time nobody really cared about machine learning, so I went off and did network engineering and other kinds of data management, but eventually I got into doing graphs because of natural language for natural language understanding.

And I've been involved in some open source projects that have been used a lot in that area. And earlier this year, I had friends at two companies, so Philip, CTO of Neo4j, and also Jeff Jonas, who's the founder of Senzing. And Philip and Jeff wanted to do something together and I was a person with a foot in both camps. And so, I signed up to do a tutorial and it was really fun. I got to work with ABK on the Neo side. We took data, mostly open data that was about businesses in Las Vegas. During the pandemic, there was the PPP loans that the government did, and then there was the forgiveness program for the loans. And there were some businesses, I'm not going to point any fingers or mention names, but there were some businesses who registered multiple times to get the loans.

And so, we were looking at how to use graphs to identify fraud, PPP fraud, and just really focusing on that one geo area because it made loading the data quicker and things like that. But even just within that area, I think we had about an 80,000 node graph and by using Entity Resolution, so Senzing and then organizing this and visualizing it and drilling down in Neo, you could show right away, it's like, wow, there's a really big cluster over there. Let's go find out what's going on. Oh, my goodness, there's a business and they have an address at a residential address or mailbox, et cetera. And there's 50 businesses that are all registered right there, and they're doing, in some cases it was really egregious where you'd see, "Yeah, yeah, we do veterinary clinic, but we also do massage therapy." And so, graphs are really useful for catching bad guys. That's a lot of where I'm at.

Jennifer Reif: Okay, cool.

Jason Koo: That's super exciting. Did you put that project... Sorry, I have three thoughts going at the same time. So is that project available publicly for folks to jump in and explore as well?

Paco Nathan: Yeah, you bet. I'll put it in the show notes. It actually ran on the Neo4j blog, and so we've got the code all up on GitHub. It takes a little over a half hour to run through start to finish, and we're showing how to do all of the steps. I was using graph data science library a whole lot. But then, and doing the visualization inside Neo. And so, basically get all your data prepared and then inside the graph database and then do your visualizations. And then, inside of Neo you can start to constrain what you're visualizing and start to slice the data and just really zero in.

The thing that I think is really startling with this though is when you look at loading up all these records of here's businesses and like, here's PPP loans, and here's how many employees are in each business and how many complaints they've had for employment, blah, blah, blah. It's just a big scattered mess. But once you start to connect it up and building knowledge graph out of it, once you start to bring in sort of the connective tissue, then the real things you need to look at start to pop out right away. You can start to see those clusters in the graph.

Jason Koo: Nice. Was it difficult to get the initial data that they do compile together?

Paco Nathan: That's a really interesting question about that here. I'm just looking in the show notes. What we did was we took data from SafeGraph. They do a commercial places directory of businesses all over the world, and so they opened up a slice, so it's opened to the public. It's just about lost I guess. And then, we also had data from federal government, so we had some from the Department of Labor and then some from the US Chamber of Commerce. So getting that in and basically I just put into JSON and then I was loading in many batches in Neo4j. Getting that into shape, no problem. But you really do want to analyze your data on the way in for this.

When you're going to be building graph, it's probably best to do a little bit of descriptive stats about all the different columns, how much do certain items repeat? Are there any data quality problems like something is supposed to be unique identifier, but it's showing up many times? So you go through and you do some work there, but then Senzing, what's going on is we take PII features, especially connecting features. So we need to have two or more features across two or more data sets, and then we can make determinations on which records can be merged into consistent entities.

And the entities may be people, they may be businesses, they may be maritime vessels, but these end up being nodes, and moreover, they're overlay. So you take the evidence you have and you put it into Neo, and those are your data records, and then you peel off the PII features and run it through Senzing, and you come back with this kind of overlay that organizes it. And it says, "Here's an entity, and under this entity here are connections to five different data records in the graph. But here's Bob R. Smith, Jr. at 101 Main in Las Vegas. But then, there's another record that's like Bob Smith in Las Vegas. We don't know a whole bunch, but we think it might be the same person."

And so, if you're doing an investigation, those probabilistic links are really important of like, "Hey, we don't have enough evidence put together to make a decision right now, but here's one you might want to check out." And so, what you do is you end up getting this overlay of nodes, relations, and properties that you build out on top of your graph to organize it, and you still got the evidence, the raw data underneath. And then, you can start doing analytics. Looking at centrality is a really good thing because it figures out which nodes are more connected than others. Other things too, I mean, if you're looking at other use cases, maybe you're doing rapid movement of funds or other types of anti-money laundering, and there's definitely graph patterns you want to look for. It's a nice blend of using Entity Resolution and knowledge graphs.

Jason Koo: Right. And so, if other people want to replicate the work maybe in different cities and stuff, how do you suggest they get started with this? Do they start with the source data? Should they work with the taxonomy first, the ontology?

Paco Nathan: Yeah, that's a great point. So in what I do in DevRel at Senzing in the knowledge graph practice, I am working with other consultants, other experts in this field. We're developing open source tutorials. So a couple of people I'm working with right now, Clair Sullivan, who used to be at Neo and also Louis Guitton who's also done work with Neo4j, a lot of open source work. And we are building out tutorials showing different kinds of data, so maybe different cities or other kinds of use cases. Clair took the one that I had done in Las Vegas, and then she built on top of it using launching integration with Neo. And so, built out a chat bot so you can explore about fraud patterns in this case.

But I think the larger thing there is there really are a lot of open data providers. So there's OpenSanctions that has sanctions watch lists of known people who've done some egregious things in the past and we're going to watch them in the future, or maybe businesses that have been caught engaging in illegal practices, or people who are just likely suspects like they're in a position where they could be bribed and a lot of people in their position are getting bribed, so let's just keep track. So OpenSanctions is really cool. It's out of Berlin, Friedrich Lindenberg.

But then, there's also the connective data like Open Ownership, Stephen Abbott Pugh out of Open Ownership in London, and that has a lot more links. So I can say, "Here's a person, they're a director of the following companies. These companies have owners elsewhere, on and on." That's really what you build up the relationships in the graph. And then, sometimes if you're lucky, you can get ahold of event data. So for instance, there was a PPP loan. I got that from the federal government. Or in the case of, have you ever seen a Netflix movie called The Laundromat?

Jennifer Reif: No, I think I've seen it advertised, but-

Paco Nathan: It's pretty cool. I mean, it's funny. It's humorous. It's actually a really serious topic. Have you ever heard of The Panama Papers? Actually, yeah, a really good new use case. So the idea is that there's this case, we still don't know all the facts. It happened about 12 years ago, but the deal was about $3 billion are known to have been moved out in illegal means through money laundering. And it went out and there was a whole bunch of, well, there were Russian weapons, illegal weapons trade. There was a whole bunch of people in the country who ended up buying condos in Miami or buying luxury items and stuff. There were a whole bunch of EU officials who were bribed, and probably nobody would've known a whole lot except for somebody at the bank. Somebody at Danske Bank in Estonia leaked records of 17,000 records of money transfers.

And so, this is actually showing the crime, because you're seeing what's getting transferred from whom to whom. And so, this idea of you have lists of data like OpenSanctions that provide risk, talk about whose potential to get bribed or who's known to have done something wrong. And then, you have the link data, which gives you the relations, and then maybe you have an event data that says, "Oh, actually here was a money transfer." And if you can take those three, you can build a graph and show, even though the bad guys are trying to hide everything behind a network like the money goes through maybe five levels of indirection with offshore accounts and different corporations, and then somebody buys a condo in Manhattan for cash, you can still trace back. And so, leveraging a graph to be able to find those connections and identify that maybe in EU, you're only allowed to transfer so much money from one entity to another within a certain time period before you file certain paperwork or else you're probably doing money laundering.

In this case, what they were doing is splitting up the transactions in smaller pieces, funneling them through a lot of offshore companies, and then eventually they hit the target. Long story short, there's been a lot of investigative journalism about it. There've been books written it, there's a movie about it, et cetera, and 12 years later, absolutely nobody's been arrested. Even one of the diplomat, one of the politicos in Italy was like, "Oh yeah, I did the bribes. I took the bribes. I'm never going to go to jail." So it's really interesting.

But this area here, you can get a lot of data from open data sources. Sometimes you can get leaked data from a bank or a law firm where people know where the bodies are buried and they just are trying to do the right thing. There's a couple of organizations like OCCPR and ICIJ which take leaked data, and then will put it into format so that we can build graphs so that investigative journalists and officials, government agencies and whatnot can work with it. So it's a long answer, but I hope it gives a little bit of flavor of a lot of what's going on there. That's a lot of our customer base too.

Jason Koo: Nice. No, thank you. That was a very comprehensive soup to nuts overview, especially talking about the Panama Papers, which was my lead into Neo4j. It's a great fascinating story, right?

Paco Nathan: Oh, yeah.

Jason Koo: It's so much data. Yeah, there was a great book, I think written by two of the journalists or two of the data center folks. I forget the name of it, but it was a great audiobook as well too.

Paco Nathan: Yeah, no, it's really interesting. I've been doing a series of talks called Catching Bad Guys using open data and open models, talking about how to really build graphs and then use what's emerging in terms of AI tools downstream from knowledge graph, whether you're building up GraphRAG or you're working building up agents or you're doing other types of analytics and graph machine learning, you really got to get the graph data first and got to get it right.

And in that talk, I'm showing, yeah, there's like five books that we're tracking. There's one called Moneyland from Oliver Bullough that I highly recommend. He was one of the journalists tracking all this stuff, and he had gotten out of, I think Cambridge and then right around the time that the Soviet Union collapsed, and so he immediately went to Eastern Europe and started traveling around and stayed embedded there as a reporter for many years. And then it's just really fascinating to see what ICIJ has done. Panama Papers, but they've also done Paradise Papers, Bahama Papers. There's one about Cyprus, it's very recent. There's a lot of these different studies, and again, it comes back to some of you decided to do the right thing and they leaked the records.

Jason Koo: You mentioned the movie Laundromat earlier. Is that a similar story to Panama Papers?

Paco Nathan: Yeah, it is. I mean, what they did was they took a lot of different scenarios of what's going on. So if you read Autocracy or you read Moneyland or any of these books that go into a lot of detail, the Laundromat is sort of the condensed crypt notes where they follow a few people and everything bad happens to these people and they go off and investigate it personally flying around the world. And it's like that has never happened to any individual except for maybe Oliver Bullough or Bill Browder. But the fact is that it gives you vignettes of what's really happening. And I mentioned about the Azerbaijani laundromat, nobody's ever gone to jail. We have 17,000 leaked records, and we know that almost $3 billion went through money laundering. That may not be all of it. I mean, there was probably a lot more.

And so, when you look at dark money moving around the world and how much it's adjacent to illegal weapons, human trafficking, even illegal fishing and illegal lumbering is very closely adjacent. It's often very much entangled, but certainly a lot of bribery. And frankly, why is it so expensive to live in San Francisco, New York, Miami, London? Well, a lot of the illegal money is flooding in and paying cash for prime property. So I mean, there's just a lot of things going on in the world. Also, illegal political influence in various countries where there are attempts to overthrow the government, that kind of thing. You look at the impact of what's happening with dark money, and it's really incredible.

On the other side, there's been a lot of moves toward making the data more open. So for instance, after the 2009 global financial crisis, there was legislation that went across worldwide in governments where if you're a corporate entity and you're engaging in certain types of financial trades, you have to have a registration for it.

You have to have a unique identifier. There's an organization called GLEIF, which does global LEI identifiers, and you absolutely must if you're doing those kinds of derivatives that led to the 2009 crisis. Now, there's other types of illegal fin crime that you could be engaged in and no, no problem. Go right ahead. We don't need to know. But that's one case at least where some parts of this are tied down in open data. And there's something else that's rolling out more recently called UBO, which is ultimate beneficial owner. So if there is a company you have to disclose who are the owners. In California, actually, this just went in, I have a company that my consulting firm that was doing before I joined Senzing, it's now a law if you don't disclose the ownership of it as of this tax year, it's a felony.

And so, understanding the chain of ownership is really important. That hasn't been transparent before, but now it's becoming transparent and UBO data collection protocols are going out across the world. Different countries are doing it. Some more or less, like EU is pretty good about it. But then, Malta is like, "Oh no, we're not going to disclose that." We'll crack down on auditing, but we're not going to disclose who owns what. So it's still kind of a haphazard landscape, but that's one thing that the people behind open ownership have really been pushing like UBO standards across the world, having open standards for the data, but also pushing the policy and really going out and doing DevRel for government policy, if you will.

Jason Koo: Which is great. I think we need more of that, right?

Paco Nathan: Yeah, absolutely. I mean, there's really bad stuff. Have you heard of something called speculative libel?

Jason Koo: No. What is that?

Paco Nathan: Oliver Bullough talks about it in Moneyland, and I apologize, I'm not an attorney and so I hope I don't mangle this, but the idea is in the UK there are laws that if you are investigating some criminal behavior and you have data and you're going after the bad guys, if they catch wind of it, they can take you to court beforehand and you have to defend yourself. And those kinds of defenses can be expensive. The case of Bill Browder, he's a US entrepreneur who is in Russia, and he became quite a critic against the Kremlin. Browder is a billionaire. He could afford to drop a half whatever, half million dollars to defend a case. But the idea is that oligarchs who are moving out into London have recognized that these laws are on the books. So if anybody's going to try to investigate them, they'll just put up lawsuits against them.

And in some cases, they also have gone out to small island nations and like, "Here's $5 million for your cancer research institute." "Oh, thank you very much. You've made me an honorary diplomat." So now, with diplomatic status, if anybody takes you to court in UK, the Crown will go up to the judges and say, "No, we can't do this. They've got diplomatic status. If we charge them criminally, then we're going to get killed on tariffs from Belgium. So just reverse the decision." So speculative libel and some of the diplomatic community, it gets really dicey because people are trying to do the right thing. You might even end up going to jail just because you're trying to do the right thing.

Jason Koo: Is that a recent law or is that just a recent reinterpretation existing?

Paco Nathan: No, I think it's recent reinterpretation. I believe that there are laws that are on the books that date back way far. Back in the day, you couldn't say something bad about the king, that would be dangerous.

Jason Koo: If it's okay to transition. Now, you've got a talk coming up for NODES next week, and you talked about a lot of entity stuff, but can you give the audience more of a preview of what you'll be covering in that talk?

Paco Nathan: You bet, you bet. So we're going to talk about Entity Resolution. It's a hard problem and it's so important in graphs, especially in these regulated environments, whether you're talking about fin crime or sanctions or any kind of case where there's some behavior that you have to spot. And usually, there are people who are trying to hide behind a network. They're trying to hide because they've got a lot of offshore corps. So the problem is if you just load up all your data and try to connect it in, oh, that's great. If you're working in one language and all the data is clean and you have unique identifiers for every node, awesome, that's great. Knock yourselves out. If you have a marketing email list and all the records are identified by email addresses, doing some kind of deduplication on that, it's really go for it.

But the thing is, if you have a name of a person and it's in Arabic, and in Arabic, there are cultural conventions like the name changes because you may mention your parentage, your father, the name changes because you may have made a pilgrimage to Mecca. So then the word haj comes in. And there's a lot of ways to abbreviate, but then there's a lot of ways to transliterate. So you might have a data set that has somebody's name in Arabic, but then it's transliterated into Italian, and there might be five different ways to transliterate into Italian. Trying to do string comparisons isn't going to work.

And when you look across the world at these kinds of passport control and voter registration and a lot of things, you look at a world where there's a lot of dirty data and there's a lot of bad actors trying to hide behind networks and you're trying to work with it, Entity Resolution becomes very difficult. There's a lot of edge cases, and the edge cases are actually where the real big problems are.

So for instance, we do work where we can handle data that's in European languages, but then also handle data that is in Korean or Mandarin or Russian, Cyrillic or Arabic or these days even like Burmese and Khmer. And so, if there's an agency in Singapore and they're having to look at maritime traffic where there might be weapons traffic coming through right offshore and the manifests are written in all these different languages and all these different transliterations, how do you make sense of it? Or if you're familiar, have y'all ever been to Singapore, by the way? It's a really cool area because there's this confluence of all these cultures, but you try to find addresses and it's nerve-racking.

And so, you look at addresses in Singapore and the name will be partly a business name, something that has more of an Indian background, and then the street name that has more of a Chinese background. And you look at the abbreviations, and of course Singapore has super high density. So you go into a mall and there might be hundreds of businesses at that street address, but they all have a different number. And so, you try and use things like string distance, string edit distance to compare records, and it just falls apart. There are so many cases where there'll be hundreds of businesses that are basically one or two letters off, and then there'll be a whole bunch of addresses that are exactly the same, but they're completely different representations.

So trying to show these kinds of use cases why Entity Resolution is hard, you have to understand the cultural parts of it. You have to understand the edge cases. I work on a team where most of the core engineering people have been together for 20 years or more, and our technology gets used for the majority of voter registration in the US. It gets used for passport control and counterterrorism and a lot of fin crime work. And these people have been handling these kinds of edge cases for a long time. By the way, another movie, did you ever see a movie, it was about mathematician, math grad students at MIT and they went to Vegas and they were doing this scam where they were card counting at the tables of Vegas?

Jason Koo: Yes.

Paco Nathan: The team I worked with, they actually were part of the bad guys in it. They were part of the technology that caught them.

Jason Koo: Oh, wow.

Paco Nathan: And so, that's taking it back a few years. So we're going to talk about what can we do with data to leverage Entity Resolution and have this idea of generating. And a lot of it is in contrast to the larger dialogue that's going on right now. There's a lot of cases that say, "Hey, just put all of your data through an LLM, the LLM will generate your graph, and then you'll do graph frag downstream from that."

And I mean, that's cool. You can do really cool demos with that. You can do really good applications with that. But if you're in a regulated environment where you're developing probable cause graphs that have to go in front of a judge to be able to get a warrant or maybe like a SWAT team is being deployed or maybe you're trying to get an indictment or other things that these are really big problems and they require a lot of accountability because if you don't have the evidence, nothing's going to happen. In these cases, you can't just throw all the data into an LLM and say, "Okay, we're going to take whatever we get. OpenAI, knock yourself out."

Instead, you really have to be mindful about how you're doing, handling the evidence. And so, the idea that we present is you start out with maybe you have some ontology or taxonomy that you're required to use like there's the follow the money ontology for a lot of fin crime. There's various NIST vocabularies that get used in some of this area. So really understand what kind of schema do you have to use because of the use case. And then, take your structured data and run entity... Well, load up your structured data and then peel off your PII features. Run Entity Resolution to figure out how to merge, and then identify the entities and the relations and the properties about them, and use that as a backbone to build your graph.

And then secondly, what we find is a lot of these use cases, there's a Pareto rule. Maybe an organization has 20% structured data, but 80% unstructured data. That's pretty typical. So you're building these graphs. You might have a lot of data about sanctions and ownership and whatnot, but you probably want to bring in maybe some news stories or some log files or something unstructured. Maybe the shipping manifests. And this is a common problem, whether we're talking about investigative graphs or even trying to untangle supply networks. You might know that your shipping container is going from point A to point B, but there might be something happening in that part of the world where the ship is currently that's going to cause it to be derailed for two months. That's a thing that happens a lot like in supply networks.

So how do you build up the graph like we've mentioned from schema and then structured data and Entity Resolution. Now, the next thing is you want to bring in unstructured data. What we'll show is how to build up a really high quality pipeline where you're using state-of-the-art models for the NLP work, whether you're doing your parse, doing your Named Entity Recognition, doing your relation extraction, but most importantly, how do you do the entity linking? Because that's where you take the nodes and relations that are extracted from your unstructured data. You want to bring them into your graph and be very aware of the context of your domain.

If I see the abbreviation NLP, that might mean natural language processing if I'm talking about machine learning. It might mean neuro-linguistic programming if I'm talking about psychology. Given my domain, what are the words? Even in a corporate sense, if you use the word consideration, that means something very different in HR versus contract law. And if you can't really distinguish the meanings from your unstructured data, then when you go to link stuff into your graph, you get in really big trouble. What we're doing is taking the results out of Entity Resolution after you've run on your structured data with your schema, you take the results from Entity Resolution, you use that to customize an entity linker.

And so, now, you can do domain-aware entity linking and be using state-of-the-art models like GLiNER and others to do your NER, Named Entity Recognition, and you can start to pull the pieces from the structured part and the unstructured part together. Once you've got that graph, then your next step is to go into GraphRAG or building up systems of agents or maybe doing some type of graph analytics to really identify common patterns. How can we do the AI apps downstream? So that's the gist of the talk. Sorry, long-winded but I hope that gives some flavor for it.

Jason Koo: No, that's perfect. The comments you made about starting with structured data as the backbone is a topic you went in-depth in with your Graph Power Hour, I think from last month.

Paco Nathan: Yeah.

Jason Koo: Which is a great episode.

Paco Nathan: Awesome, thanks.

Jason Koo: Yeah, totally right, so I'll put that in the notes as well. I definitely recommend anyone check that out because you also covered the various GraphRAG techniques that are available now, different design patterns. So actually, just real quickly from that, is there a best or most promising GraphRAG design pattern that interests you the most?

Paco Nathan: So one of the things we were saying is that GraphRAG is a pretty big word. The graph in GraphRAG, graph probably gets used, I think we've counted eight different interpretations for graph. So you might be talking about building a graph out of the embeddings of the text chunks, which is one thing you can do. You might talk about having built a graph and using the prompt to generate a Cypher query. You might be talking about doing a lexical graph, on and on and on. We've identified, like I said, multiple meanings for the word graph if you look at GraphRAG literature. And out of all those, I really use a lot of the approach of where we'll take the text chunks from your documents and we'll put them through an embedding model and you put them into your vector store. Great. We'll also parse those. And then, from the parse, we build up a text graph, which is a nice structured way to leverage state-of-the-art models for NER, relation extraction, any linking.

And then, once you've got this staging area of your text graph, now you can run maybe some graph analytics like centrality measures to see what's most connected, what's probably the thing that's really being talked about in the text chunk. And then, from there, use entity linking to link back into a graph that you're collecting everything across all of your documents. That's the approach that I'm using most, which is like lexical graph, but lexical graph plus-plus. But I will give a shout-out, Tomaz Bratanic did some posts a couple of weeks ago about, I think called graphreader, which is really interesting, where you're using the graph to come up with basically a notebook of not just facts, but statements, like logical statements about things that have been extracted from your documents.

And once you've got this notebook, then you can use an AI application, some type of agent to treat the notebook of these statements as hypotheses and start chaining them together in ways where you can actually say something about it. I'll give a shout-out about graphreader and the integration with Neo4j. And I think it was LangChain because I'm old, I have white hair. I studied AI at Stanford in the early '80s. And back then, there was all this work on expert systems.

You may have heard some of that back there. There was something, there was a project from 1972 to 1979 called Hearsay-II. It was an expert system that did work on discovery with blackboard architectures. It's really fascinating to me because when you look at what Tomaz was showing with graphreader today to use with LLMs, if you just change out a couple phrases, it's very similar to these blackboard architectures that we had back in the '70s and '80s. And there was a lot of articles back then by Barbara Hayes-Roth and Penny Nii and some of the other AI experts back in the '80s who were really showing what you could do.

My graduate advisor was Doug Lenat at Stanford, and he did a lot with blackboard architectures with the Eurisko project and others. And arguably, what we were working on back then were generative approaches using blackboard architectures. And now, here we're 40 years later doing generative approaches with language models also, but much better language models, I'll say that, much better compute. So I won't say that it's all reduplicated, but it's like we're revisiting parts of the past and I'm really glad to see that.

Jason Koo: That's great that we can take things that started from before and come back to and go like, "Oh, wait, we can now combine these in new ways, or now we have the technology to really leverage some of the initial thoughts that they came up way early."

Jennifer Reif: And I think it's a good reminder too that this stuff, most of it, at least the foundations of it, is much older than what we give credit the AI wave last year, maybe two years, if you will. But realizing that all of this study and this research actually happened potential decades before that this is not necessarily net new. It's just we're applying new technologies to existing problems that now we can help solve in new ways.

Paco Nathan: Yeah, so very well said. When you look at reinforcement learning, there's incredible things going on there, but it's really based off of optimal control theory, which is from the 1950s, building airframes that are resilient. And the difference is now we have deep learning so we can learn and build policies that are enormous, that have billions or trillions of points in them. Whereas before, optimal control theory was pretty constrained by what you can fit on the hardware that was in the aircraft.

But to that point, I'll shout out, I think in the graph space, I see something really interesting happening that it's not necessarily all about LLMs. I mean, language models are very important, don't get me wrong. And I come from a language background, I love this. But I'll shout out a friend of mine, Urbashi Mitra, she's at USC, University of Southern California, and she leads work in the optimization lab there. Really interesting work. I saw her present at a conference at Stanford a couple months ago where they take graphs and then they build out causal relationships in the graph and study causality. And they look at sub graphs and they're using reinforcement learning to optimize the causality. And then, you can start to explore like I want to throw counter-factuals or I want to change some sort of intervention. So I start to play with the graph to figure out what's my policy really doing? What is my plan? How could I shoot this big corporate plan for a merger and acquisition? How could I shoot it apart by using some AI technology?

And it's not all about LLMs, it's about bringing in stuff like reinforcement learning and causal graphs and other things. And two, the takeaways there are, number one, graphs are at the center of this, and number two, we're reusing a lot of math that comes from decades ago. But then, number three, the cool thing about it, what Urbashi's team is showing is they're using it to understand the interactions of systems of multiple agents like AI agents. So can we build a graph to describe how in a multi-agent environment, what are the outcomes? Where's the biases? How can we try to steer this? I'm really hopeful of seeing a lot of these things come out from the wings and really having graphs be the centerpiece for integration.

Jennifer Reif: Graphs connecting decades of technology together.

Paco Nathan: Well said. Well said.

Jason Koo: It's going to be a big graph. Well, Jen, I feel like our show notes for this episode's going to be incredibly long. We're going to have lots and lots of links. Riffing off of that, we'll add a few more links. If we could talk about what type of technologies everyone's been playing with this last month and would like to recommend to the community. Paco, would you like to share first?

Paco Nathan: You bet. There's a new tutorial that we're just finishing up now. The code is up on GitHub. In fact, there is a new library. The first release is in PyPy, and this is about using the results of Entity Resolution to customize entity linking so that you can be using the context of your domain. So like we said, start with some structured data and do your Entity Resolution. You can understand the definitions. You can almost think of it as like building a thesaurus.

And then, once you do that, now you can go out and use tools like spaCy to go out and parse, here's your 100 million documents for unstructured data, but we're using the results of Entity Resolution to steer the entity linker to make the right choices in context. And there's some interesting ways that this is using embedding models and community summaries. So we're really leveraging the graph technology to make NLP do the right thing. Louis Guitton has been building this out and we'll have a link here for spaCy for the entity linker, and we'll have an article about that coming up pretty shortly.

Jason Koo: Nice. Is that package the input? Is it, do people just put the unstructured data or do they have to run some pre-processing function and then that output gets passed in?

Paco Nathan: Yeah, no, we try to make it simple. The idea is with Senzing, with Entity Resolution, you bring your PII features in and you make a call with some JSON. And then, the results come back as JSON. So you end up with this JSON description for your entities and the links among them and different probabilities and whatnot. We take that as input and then build up inside of spaCy, there's what spaCy calls a knowledge base and their entity linker. It builds off of that.

Basically, you import this file and you can also bring in other definitions that you have in your graph. We've made a way, so if you had a pre-existing graph and you want to leverage the schema and the links that are in that graph, great. You bring it down into spaCy and now you've got a spaCy pipeline component. When you run spaCy to parse your text junk, you get your parses out of that, and you look at the named Entity Resolution. And so, now when you're picking up different noun phrases that are supposed to be entities, if they make sense in context, they'll be linked to your graph. You'll have the unique identifier directly into your graph. So what this means is we can do parsing that's much smarter so that we can do the right kind of integration in a knowledge graph.

Jason Koo: Nice. Jen, did you want to go next?

Jennifer Reif: Sure. I've been working on a graph academy course, hopefully that will be coming soon. But on Spring Data Neo4j, and of course, as with anything, the more you expand your learning a little bit or try to teach it to someone else, the more you learn yourself. So I've come across some things, especially in the delete and update data inside the database using Spring Data Neo4j that I bumped against some rails and things like that, trying to figure out how it works and what you do with it. So just some cool things there.

I hope to have some content a little bit more detailed about what I've learned and where I bumped into walls, but always really great project there, doing some really great things and the people working on that are really fantastic. There's joint work going on the Neo4j side as well as on the Spring side, and it's just a really great partnership there. So if you're in the Java ecosystem, Spring ecosystem, definitely check that project out. And again, I hope to have some more detail coming soon about some of the things that I've learned lately on it.

Jason Koo: Nice. My favorite tool is also a package like Paco's. It's the neo4j-graphrag Python package that came out earlier in the month. And specifically, I've been looking at the vector database retrievers. These external retrievers allow you to combine Neo4j with other Vector databases. So the first three that are available is with Weaviate, Pinecone, and Qdrant. If you already have or are already using one of those vector databases, you can basically put in your collection and your credentials and you can have the vector database store, the embeddings and Neo4j handle all the knowledge graph and graph data.

Paco Nathan: It looks really good.

Jason Koo: Yeah, it's a package that there's quite a lot of features, so it's taken a while for me to unwrap, but for anyone wanting to get in and it's working in a Python environment, it's a good starting position. So November, it seems like we're still in prime conference and event season. Paco, your calendar looks like, I don't know if you have a day where you can sleep. Are there some coming up that you'd like to tell the audience about?

Paco Nathan: Thank you. Yeah, I'd love to. Yeah, I got off the plane from Spain. I was in A Coruña at a conference. I got off the plane last night, so I'm back in California. Yeah, really cool stuff. So next week's going to be busy. We've got NODES. I'm really, really looking forward to that and have so many friends also speaking. At the same time, I'm going to be at MLOps World Summit in Austin, Texas, and also a lot of friends in this space who are presenting there. So keeps me busy.

Jennifer Reif: I was going to say, I think we have colleagues from Neo4j there as well.

Paco Nathan: Oh, yeah, absolutely. Yeah. And just this whole confluence of data engineering and MLOps and cool applications that are built off of LLMs, et cetera, it's all connected. And then, after next week, it gets busy. Also, what am I doing? Oh, we have the ISWC is a conference about knowledge graphs. It's more academic, but it'll be in Baltimore on the 12th. And we're doing a workshop about software engineering and knowledge graphs. And then, Senzing, we have our user conference in Western Virginia, but nearby there on the 14th. And then, on November 18th, I'm going to be at the Cleveland Mega Meetup. So there's a lot of finance and healthcare in Cleveland, and so we're going to bring people together from a lot of big companies and talk about using knowledge graphs and using energy resolution.

And then, London. After Thanksgiving, I'm going to be in UK for half of December, but it gets really busy because Linkurious, which partners with Neo4j, Linkurious is coming over from Paris and they're holding Linkurious days, which I think is going to be held at Lloyds Bank, and that will be on December 3rd. I'm really looking forward to a full day of talks there about like we're saying, entity resolved knowledge graphs used in finance. Then, I think on December 10th, my colleague and friend James Phare at Neural Alpha is doing a meetup about ESG and using graphs for ESG, ethics, et cetera in terms of business practices. That'll be on December 10th.

And then, on December 11th through 13th, we're doing the Connected Data World Conference, Connected Data London, and I believe that Neo4j is a sponsor there. Senzing is a sponsor. I'm one of the co-chairs, so I'll be teaching a masterclass on the 11th about entity-resolved knowledge graphs, catching bad guys using open data. And I'll also be a host for some of the technology sessions. And then, the third day, we're going to have more of an unconference format. And so, I'll be hosting part of that as well as far as domain knowledge and semantic overlay ideas. But I'm super thrilled about what's all the people coming together out of London. Also be able to catch up with, shout out to ABK, who I think is based in London, pretty sure, maybe not every day, but at least I'll be around the area to catch up with friends.

Jason Koo: Nice. That's fantastic. Jen, I don't know if you want to go next.

Jennifer Reif: Yeah, sure. So I got a couple of virtual things next week. One being NODES, of course, on November 7th. But just before that I'm participating in XtremeJ Conference. It's a virtual event as well. So I'll be doing the Java portion, which is the J at the end. They have a JavaScript and a couple others that they do as well. But this will be the Java day, I guess, next week. So I've got that. And then, a couple of days later, I've got the NODES event. And then towards, I guess just before the US Thanksgiving holiday, I've got the Chicago Java User Group or the CJUG, and I'm going to be presenting there. The event's not up on the site just yet, but it should be here in the next week or so. If you're in the area in the Midwest, I'll be there just before the holiday season starts and then things get a little bit quiet for me. But how about you, Jason?

Jason Koo: My schedule is definitely the lightest out of three of us. So next week, NODES, of course. And then, the week after that, I'll be going up to Seattle and doing a talk with the Puget Sound Programming Python Group, PuPPy. And also, we're doing a joint event with Union, Boundary ML, and Pinecone. So looking forward to that. And couple, I'll probably be going to San Francisco at the end of the month, although nothing has been set in stone, but right after that will be re:Invent in Las Vegas. So those are my confirmed events.

Jennifer Reif: November is super busy for all of us.

Paco Nathan: We're covering so many communities and so many geographic areas and all. So I look forward to catching up and finding out your takeaways from all this. I think that's the fun part.

Jason Koo: Yeah, definitely.

Jennifer Reif: Definitely.

Jason Koo: Yeah, definitely envious of you, Paco. You get to connect with so many old friends and really toss around some. I think it's just going to be amazing, the number of connections and new ideas that is going to come out of that, it's just going to be bonkers.

Paco Nathan: I really love that about DevRel. I mean, I've been doing this kind of work. I did DevRel back at, there's a little company called Databricks back when we all fit in one room. So that was really fun. I got to lead community work there and training there a decade ago, and I've just really loved DevRel for exactly the reasons you both were talking about. I mean, you get to meet so many amazing people and you get this bird's eye view of new things coming down the pipeline and a lot of cool projects around the world. I really enjoy that.

Jason Koo: Yeah, same.

Jennifer Reif: Fantastic group of people always. We will have, as always, links to everything else going on. So speaker resources, a bunch of the things that we've talked about with Paco today, as well as just things going on in Neo4j community like blog posts, videos and so on. I'll link all of that. And then, of course, the Neo4j events, which a lot of the things that we've mentioned where we're going to be in the next few weeks will pop up there as well. But definitely reach out to us if you have questions or need something else. Thank you so much, Paco for joining us today and talking a lot about Entity Resolution and the work you're doing at Senzing and the work you have been doing over the last several years in catching bad guys with graphs. We look forward to hearing your talk at NODES too.

Paco Nathan: Looking forward. Thank you so much.

Jennifer Reif: All right. Bye, everyone.

Jason Koo: Thank you, everyone.