Home » Questions » How to parse and process HTML with PHP?

How to parse and process HTML with PHP?

Answered
1
1

How can one parse HTML and extract information from it? What libraries exist for that purpose? What are their strengths and drawbacks?

This is a General Reference question for the tag

NOTE: This question was originally posted at StackOverflow.com by RobertPitt

  • You must to post comments
Good Answer
1
1

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C’s Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth IMO. Since DOM is a language agnostic interface, you’ll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language’s DOM API then.

A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at Noob question about DOMDocument in php

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

XMLReader

The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware on how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml’s HTML Parser Module.

A basic usage example can be found at getting all values from h1 tags using php

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don’t even consider SimpleXml because it will choke.

A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.


3rd Party Libraries (libxml based)

If you prefer to use a 3rd party lib, I’d suggest to use a lib that actually uses DOM/libxml underneath instead of String Parsing.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface, but it is heavily tuned for server-side use.

FluentDom

FluentDOM ist a jQuery like fluent XML interface for the DOMDocument in PHP.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convinience and to simplify the usage of DOM.


3rd Party (not libxml based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd party libs go down this route, some of them listed below

SimpleHtmlDom

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Any of the libxml based libraries should outperform this easily.

Ganon

  • A universal tokenizer and HTML/XML/RSS DOM Parser
    • Ability to manipulate elements and their attributes
    • Supports invalid HTML and UTF8
  • Can perform advanced CSS3-like queries on elements (like jQuery — namespaces supported)
  • A HTML beautifier (like HTML Tidy)
    • Minify CSS and Javascript
    • Sort attributes, change character case, correct indentation, etc.
  • Extensible
    • Parsing documents using callbacks based on current character/token
    • Operations separated in smaller functions for easy overriding
  • Fast and Easy

Never used it. Can’t tell if it’s any good.


HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

We might see more dedicated parsers once HTML5 is finalized.


WebServices

If you don’t feel like programming PHP, you can also utilizes Web Services. In general, I found very little utility for these, but that’s just me and my Use Cases.

YQL

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.

ScraperWiki.

ScraperWiki’s external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.


Regular Expressions

Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding a space somewhere, can make the Regex fails when it’s not properly written. You should know what you are doing before using Regex on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expression have to be taught them with each new Regex you write. Regex are fine in some cases, but it really depends on your UseCase.

You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way


Books

If you want to spend some money, have a look at

I am not affiliated with PHP Architects or the authors.

NOTE: This answer was originally posted at StackOverflow.com by Gordon

  • Henry
    As of Mar 29, 2012, DOM does not support html5, XMLReader does not support HTML and last commit on html5lib for PHP is on Sep 2009. What to use to parse HTML5, HTML4 and XHTML?

    NOTE: This comment was originally posted at StackOverflow.com by shiplu.mokadd.im

  • Pamela
    @Shiplu answer above lists all the options I know. DOM can parse anything that has a Schema or a DTD. HTML5 doesnt (officially).

    NOTE: This comment was originally posted at StackOverflow.com by Gordon

  • You must to post comments
1
0

I recommend PHP Simple HTML DOM Parser
it really has nice features like

foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

NOTE: This answer was originally posted at StackOverflow.com by user1090298

  • You must to post comments
1
1

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That’s also why they’re one of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from a HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).

For further informations on the differences see this comparison:
http://web.archive.org/web/20101230230134/http://www.tagbytag.org/articles/phpquery-vs-querypath (Original source went missing, so here’s an internet archive link. Yes, you can still locate missing pages, people.)

And here’s a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP

Advantages

  • Simplicity and Reliability
  • Simple to use alternatives ->find(“a img, a object, div a”)
  • Proper data unescaping (in comparison to regular expression greping)

NOTE: This answer was originally posted at StackOverflow.com by mario

  • You must to post comments
1
0

There is also Goutte (PHP Web Scraper) which is now available :
https://github.com/fabpot/Goutte/

NOTE: This answer was originally posted at StackOverflow.com by Shal

  • You must to post comments
1
0

Try the Simple HTML Dom Parser:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

NOTE: This answer was originally posted at StackOverflow.com by NAVEED

  • Beverly
    What I did was run my html through tidy before sending it to SimpleDOM.

    NOTE: This comment was originally posted at StackOverflow.com by MB34

  • You must to post comments
1
0

Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won’t recommend it anyways.

I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.

Kindly check out this link: http://spyderwebtech.wordpress.com/2008/08/07/scraping-websites-with-curl/

NOTE: This answer was originally posted at StackOverflow.com by Spoilt

  • You must to post comments
1
0

we have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.

NOTE: This answer was originally posted at StackOverflow.com by jancha

  • Henry
    +1 The key point is "If you know what you are doing". And I think any good Developer knows what he is doing.

    NOTE: This comment was originally posted at StackOverflow.com by shiplu.mokadd.im

  • You must to post comments
1
0

With PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website.

NOTE: This answer was originally posted at StackOverflow.com by mnml

  • You must to post comments
1
0

One general approach I haven’t seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ — it’s a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.

NOTE: This answer was originally posted at StackOverflow.com by Eli

  • You must to post comments
1
0

This sounds like a good task description of W3C XPath technology. It’s easy to express queries like “return all href attributes in img tags that are nested in <foo><bar><baz> elements.” Not being a PHP buff, I can’t tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.

NOTE: This answer was originally posted at StackOverflow.com by Jens

  • You must to post comments
0
0

QueryPath is good, but be careful of “tracking state” cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn’t work.

what it means is that each call on the result set modifies the result set in the object, it’s not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.

in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it’ll mirror what happens in jquery much more closely.

$results = qp(“div p”);
$forename = $results->find(“input[name=’forename’]”);

“$results” now contains the result set for “input[name=’forename’]” NOT the original query “div p” this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead

$forename = $results->branch()->find(“input[name=’forname’]”)

then $results won’t be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it’s basically like this from what I’ve found.

NOTE: This answer was originally posted at StackOverflow.com by Christopher Thomas

  • Martin
    that said, I love QueryPath, I just wish it would "branch by default" because then it would automatically be similar to jquery without any extra work

    NOTE: This comment was originally posted at StackOverflow.com by Christopher Thomas

  • You must to post comments
0
0

test

  • You must to post comments
0
0

Yes this is true!

  • You must to post comments
0
0

test

  • You must to post comments
0
0

A few months ago I wrote a library that can help you work with parsing HTML5 code in PHP. It extends the native DOMDocument library, fixes some bugs and adds some new features (innerHTML, querySelector(), …)
It’s available at https://github.com/ivopetkov/html5-dom-document-php
I hope it will be useful for you too.

  • You must to post comments
Showing 15 results
Your Answer
Guest Author
Post as a guest by filling out the fields below or if you already have an account.
Name*
E-mail*
Website
CAPTCHA*
Enter the characters shown on the image.