How can one parse HTML and extract information from it? What libraries exist for that purpose? What are their strengths and drawbacks?
This is a General Reference question for the php tag
NOTE: This question was originally posted at StackOverflow.com by RobertPitt
- Rhonda asked 14 years ago
- last edited 13 years ago
- You must login to post comments
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C’s Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth IMO. Since DOM is a language agnostic interface, you’ll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language’s DOM API then.
A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at Noob question about DOMDocument in php
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.
XMLReader
The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
XMLReader, like DOM, is based on libxml. I am not aware on how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml’s HTML Parser Module.
A basic usage example can be found at getting all values from h1 tags using php
SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don’t even consider SimpleXml because it will choke.
A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd party lib, I’d suggest to use a lib that actually uses DOM/libxml underneath instead of String Parsing.
phpQuery
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).
Zend_Dom
Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
QueryPath
QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface, but it is heavily tuned for server-side use.
FluentDom
FluentDOM ist a jQuery like fluent XML interface for the DOMDocument in PHP.
fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convinience and to simplify the usage of DOM.
3rd Party (not libxml based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd party libs go down this route, some of them listed below
SimpleHtmlDom
- A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
- Require PHP 5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like jQuery.
- Extract contents from HTML in a single line.
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Any of the libxml based libraries should outperform this easily.
Ganon
- A universal tokenizer and HTML/XML/RSS DOM Parser
- Ability to manipulate elements and their attributes
- Supports invalid HTML and UTF8
- Can perform advanced CSS3-like queries on elements (like jQuery — namespaces supported)
- A HTML beautifier (like HTML Tidy)
- Minify CSS and Javascript
- Sort attributes, change character case, correct indentation, etc.
- Extensible
- Parsing documents using callbacks based on current character/token
- Operations separated in smaller functions for easy overriding
- Fast and Easy
Never used it. Can’t tell if it’s any good.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.
We might see more dedicated parsers once HTML5 is finalized.
WebServices
If you don’t feel like programming PHP, you can also utilizes Web Services. In general, I found very little utility for these, but that’s just me and my Use Cases.
YQL
The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.
ScraperWiki.
ScraperWiki’s external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.
Regular Expressions
Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding a space somewhere, can make the Regex fails when it’s not properly written. You should know what you are doing before using Regex on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expression have to be taught them with each new Regex you write. Regex are fine in some cases, but it really depends on your UseCase.
You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
I am not affiliated with PHP Architects or the authors.
NOTE: This answer was originally posted at StackOverflow.com by Gordon
- Pamela answered 14 years ago
- last edited 12 years ago
-
As of Mar 29, 2012, DOM does not support html5, XMLReader does not support HTML and last commit on html5lib for PHP is on Sep 2009. What to use to parse HTML5, HTML4 and XHTML?
NOTE: This comment was originally posted at StackOverflow.com by shiplu.mokadd.im
-
@Shiplu answer above lists all the options I know. DOM can parse anything that has a Schema or a DTD. HTML5 doesnt (officially).
NOTE: This comment was originally posted at StackOverflow.com by Gordon
- You must login to post comments
I recommend PHP Simple HTML DOM Parser
it really has nice features like
foreach($html->find('img') as $element)
echo $element->src . '<br>';
NOTE: This answer was originally posted at StackOverflow.com by user1090298
- Brenda answered 12 years ago
- You must login to post comments
phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That’s also why they’re one of the easiest approaches to properly parse HTML in PHP.
Examples for QueryPath
Basically you first create a queryable DOM tree from a HTML string:
$qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL
The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:
$qp->find("div.classname")->children()->...;
foreach ($qp->find("p img") as $img) {
print qp($img)->attr("src");
}
Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)
$qp->xpath("//div/p[1]"); // get first paragraph in a div
QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).
$qp->find("a[target=_blank]")->toggleClass("usability-blunder");
.
phpQuery or QueryPath?
Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).
For further informations on the differences see this comparison:
http://web.archive.org/web/20101230230134/http://www.tagbytag.org/articles/phpquery-vs-querypath (Original source went missing, so here’s an internet archive link. Yes, you can still locate missing pages, people.)
And here’s a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
Advantages
- Simplicity and Reliability
- Simple to use alternatives ->find(“a img, a object, div a”)
- Proper data unescaping (in comparison to regular expression greping)
NOTE: This answer was originally posted at StackOverflow.com by mario
- Dorothy answered 14 years ago
- last edited 12 years ago
- You must login to post comments
There is also Goutte (PHP Web Scraper) which is now available :
https://github.com/fabpot/Goutte/
NOTE: This answer was originally posted at StackOverflow.com by Shal
- Jeffery answered 12 years ago
- You must login to post comments
Try the Simple HTML Dom Parser:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
NOTE: This answer was originally posted at StackOverflow.com by NAVEED
- Larry answered 14 years ago
- last edited 13 years ago
-
What I did was run my html through tidy before sending it to SimpleDOM.
NOTE: This comment was originally posted at StackOverflow.com by MB34
- You must login to post comments
Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won’t recommend it anyways.
I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.
Kindly check out this link: http://spyderwebtech.wordpress.com/2008/08/07/scraping-websites-with-curl/
NOTE: This answer was originally posted at StackOverflow.com by Spoilt
- Margaret answered 13 years ago
- You must login to post comments
we have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.
NOTE: This answer was originally posted at StackOverflow.com by jancha
- Renee answered 13 years ago
-
+1 The key point is "If you know what you are doing". And I think any good Developer knows what he is doing.
NOTE: This comment was originally posted at StackOverflow.com by shiplu.mokadd.im
- You must login to post comments
With PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website.
NOTE: This answer was originally posted at StackOverflow.com by mnml
- Jacqueline answered 13 years ago
- You must login to post comments
One general approach I haven’t seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ — it’s a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.
NOTE: This answer was originally posted at StackOverflow.com by Eli
- Joyce answered 13 years ago
- You must login to post comments
This sounds like a good task description of W3C XPath technology. It’s easy to express queries like “return all href
attributes in img
tags that are nested in <foo><bar><baz> elements
.” Not being a PHP buff, I can’t tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.
NOTE: This answer was originally posted at StackOverflow.com by Jens
- Allen answered 14 years ago
- You must login to post comments
QueryPath is good, but be careful of “tracking state” cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn’t work.
what it means is that each call on the result set modifies the result set in the object, it’s not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it’ll mirror what happens in jquery much more closely.
$results = qp(“div p”);
$forename = $results->find(“input[name=’forename’]”);
“$results” now contains the result set for “input[name=’forename’]” NOT the original query “div p” this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead
$forename = $results->branch()->find(“input[name=’forname’]”)
then $results won’t be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it’s basically like this from what I’ve found.
NOTE: This answer was originally posted at StackOverflow.com by Christopher Thomas
- Martin answered 13 years ago
-
that said, I love QueryPath, I just wish it would "branch by default" because then it would automatically be similar to jquery without any extra work
NOTE: This comment was originally posted at StackOverflow.com by Christopher Thomas
- You must login to post comments
A few months ago I wrote a library that can help you work with parsing HTML5 code in PHP. It extends the native DOMDocument library, fixes some bugs and adds some new features (innerHTML, querySelector(), …)
It’s available at https://github.com/ivopetkov/html5-dom-document-php
I hope it will be useful for you too.
- Ivo Petkov answered 8 years ago
- You must login to post comments