When Google crawls the Web, it extracts details from content on the pages it finds in addition to links on pages. How much information does it extract about details on the Web? Microsoft showed off an object-based search about 10 years ago, in the paper, Object-Level Ranking: Bringing Order to Web Objects. .
The team from Microsoft Research Asia informs us in that paper:
Existing Web search engines generally take care of a whole Web page as the device for retrieval and consuming. However, there are various kinds of objects embedded in the static Web pages or Web databases. Normal items are products, people, papers, organizations, etc.. We can imagine that if these items can be extracted and incorporated from the Web, powerful object-level search engines can be built to satisfy users’ information needs more precisely, particularly for some particular domains.
This patent from Google focuses upon extracting factual information regarding entities on the Web. It’s an approach that goes beyond making the Web index that we know Google for because it collects more information that’s connected to each other. The patent informs us:
Information extraction systems automatically extract structured data from unstructured or semi-structured documents. For instance, some data extraction systems which exist extract data from collections of digital documents, with every fact identifying a subject matter, an attribute possessed by the entity, and the value of this attribute for the entity.
I’m reminded of an ancient Google Provisional patent that Sergy Brin came up with in the 1990s. My article about that patent I predicted, Google’s First Semantic Search Invention was Patented in 1999. The patent it’s about was titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip ahead to the next page, where it becomes a lot more readable). This was printed as a paper on the Stanford site. It describes Sergy Brin taking some details about some books, and searching for all those books on the Web; once they are discovered; patterns about the places of those books are accumulated, and information about other books are collected as well. That approach sounds much like the one from this patent given the first week of this month:
Generally speaking, one innovative feature of the subject matter described in this specification can be embodied in methods which include the actions of obtaining a plurality of seed information, wherein every seed fact identifies a topic matter, an attribute possessed by the topic entity, and an object, and wherein the object is an attribute value of this attribute possessed by the topic entity; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern created from a dependency parse, wherein a dependency parse of a text portion corresponds to a directed graph of vertices and edges, wherein each vertex represents a token in the text portion and each edge represents a syntactic relationship between tokens represented by vertices connected by the edge, wherein each vertex is connected to the token represented by the vertex and a part of speech tag, and whereas a dependency pattern corresponds to a sub-graph of a dependency parse with one or more of the vertices in the sub-graph having a token associated with the vertex replaced by a variable; applying the patterns to files in a collection of files to extract a plurality of candidate added facts from the selection of files; and picking one or more additional details from the plurality of candidate added details.
The patent breaks the process it describes into a range of “Benefits” which are worth keeping in mind, because it sounds a lot like how people talking about the Semantic Web characterize the Web as a web of information. These are the Benefits that the patent brings us:
(1) A fact extraction system can accurately extract data, i.e., (subject, attribute, object) triples, from a collection of electronic documents to identify values of attributes, i.e., “objects” in the extracted triples, which are not known to the fact extraction system.
(2)In particular, values of long-tail attributes that appear infrequently in the selection of digital documents relative to other, more frequently occurring attributes can be accurately extracted from the collection. For instance, given a set of attributes for which values are to be extracted from the collection, the attributes in the set can be arranged by the amount of occurrences of each of the attributes in the collection and the fact extraction system can accurately extract attribute values to the long-tail attributes in the set, with the long-tail attributes being the attributes that are ranked below N in the sequence, where N is chosen such that the complete number of appearances of attributes ranked N and above in the ranking equals the whole number of appearances of attributes ranked below N in the ranking.
(3)Moreover, the fact extraction system can accurately extract data to identify values of nominal attributes, i.e., attributes that are expressed as nouns.
The patent is:
Extracting facts from documents
Inventors: Steven Euijong Whang, Rahul Gupta, Alon Yitzchak Halevy, and Mohamed Yahya
Assignee: Google Inc..
US Patent: 9,672,251
Allowed: June 6, 2017
Filed: September 29, 2014
Methods, systems, and devices, including computer applications analyzed on computer storage media, for extracting data from a collection of documents. Among the methods includes obtaining a plurality of seed data; generating a plurality of patterns from the seed data, wherein each of the plurality of patterns is a dependency pattern created from a dependency parse; applying the patterns to files in a collection of files to extract a plurality of candidate added facts from the selection of documents; and picking one or more additional details from the plurality of candidate added details.
The patent contains a list of “additional references” which were cited by the applicants. These are worth spending some time with because they contain a great deal of hints about the direction that Google seems to be moving towards.
Finkel et al., Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling In Proceedings of the 43rd Annual Meeting of the ACL, Ann Arbor, Michigan, USA, Jun. 2005, pp. 363-370. cited by applicant .
Gupta et al, Biperpedia: An Ontology for Search Applications In Proceedings of the VLDB Endowment, 2014, pp. 505-516. cited by applicant .
Haghighi and Klein, Simple Coreference Resolution with Rich Syntactic and Semantic Features In Proceedings of Empirical Methods in Natural Language Processing, Singapore, Aug. 6-7, 2009, pp. 1152-1161. cited by applicant .
Madnani and Dorr, Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods In Computational Linguistics, 2010, 36(3):341-387. cited by applicant .
de Marneffe et al., Generating Typed Dependency Parses from Phrase Structure Parses In Proceedings of Language Resources and Evaluation, 2006, pp. 449-454. cited by applicant .
Mausam et al., Open Language Learning for Information Extraction In Proceedings of Empirical Methods in Natural Language Processing, 2012, 12 pages. cited by applicant .
Mikolov et al., Efficient Estimation of Word Representations in Vector Space International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013, 12 pages. cited by applicant .
Mintz et al, Distant Supervision for Relation Extraction Without Labeled Data In Proceedings of the Association for Computational Linguistics, 2009, 9 pages. cited by applicant.
The patent informs us that entities identified by this extraction process may be kept in an entity database, and they point in the old freebase website (that used to be run by Google).
They give us some insights into how the information extracted from the Web might be used by Google in a fact repository (that is the term they used to refer to an ancient version of their knowledge graph):
Once extracted, the fact extraction system may store the extracted facts in a facts repository or provide the details for use for another purpose. In some cases, the extracted facts may be used by an Internet search engine in providing formatted replies in response to search queries which were classified as seeking to determine the value of an attribute possessed by a specific entity. For instance, a received search query “who’s the chief economist such as organization?” May be categorized by the search engine as trying to determine the value of the “Chief Economist” attribute for the entity “Example Organization.” By accessing the fact repository, the search engine may identify that the fact repository comprises a (Example Organization, Chief Economist, Example Economist) triple and, in response to the search query, can offer a formatted presentation that identifies “Example Economist” as the “Chief Economist” of the entity “Example Organization.”
The patent informs us about how they use patterns to identify additional details:
The machine selects additional facts from one of the candidate added details based on the scores (step 212). For instance, the system can select every candidate additional fact using a score above a threshold value as an additional fact. As another example, the system can select a predetermined number of highest-scoring candidate added facts as additional details. The system can save the selected additional details in a fact repository, e.g., the fact repository of FIG. 1, or offer the chosen additional facts to an outside system to be used for some immediate purpose.
The patent also describes the process which may be followed to score candidate added details.
This fact extraction process does seem to be aimed towards building a repository that may be capable of answering a lot of questions, with a machine learning approach and the sort of semantic vectors that the Google Brain team may have used to develop Google’s Rank Brain strategy.