PHP Classes

HTML SQL: Parse and extract information from HTML using SQL

Recommend this page to a friend!
  Info   Screenshots Screenshots   View files View files (19)   DownloadInstall with Composer Download .zip   Reputation   Support forum (6)   Blog    
Ratings Unique User Downloads Download Rankings
StarStarStarStar 74%Total: 5,526 This week: 1All time: 443 This week: 560Down
Version License Categories
htmlsql 1.0.0BSD LicenseHTML, Text processing
Description 

Author

This class can be used to parse and extract information from HTML documents using a query language similar to SQL to define the information to be extracted.

The class can open HTML documents stored as local files or as remote pages using the Snoopy class.

The class can execute a query with a syntax similar to SQL SELECT statements to search an find certain tags in the opened document with attributes that match query condition.

The occurrences that it find are returned as result set rows that may contain a given list of attributes of the matched tags.

Innovation Award
PHP Programming Innovation award nominee
May 2006
Number 2


Prize: One subscription to the PHP Magazine
Certain types of applications need to retrieve HTML pages and extract information from them to be processed for specific purposes.

Often, parsing HTML pages to extract only the relevant information is not an easy task. On the other hand, most Web developers are very familiar with SQL and can use it to define what information they want from their database tables.

This class provides a means to extract data from HTML pages using a query language very similar to SQL. It simplifies greatly the implementation of scripts that need to process data from HTML pages.

Manuel Lemos
Picture of J.
Name: J. <contact>
Classes: 1 package by
Country: Germany Germany
Age: ???
All time rank: 53834 in Germany Germany
Week rank: 411 Up15 in Germany Germany Up
Innovation award
Innovation award
Nominee: 1x

Details

htmlSQL - Version 0.5 - README --------------------------------------------------------------------- AUTHOR: Jonas John (http://www.jonasjohn.de/) DESCRIPTION: --------------------------------------------------------------------- htmlSQL is a experimental PHP class which allows you to access HTML values by an SQL like syntax. This means that you don't have to write complex functions (regular expressions) to extract specific values. The htmlSQL queries look like this: SELECT href,title FROM a WHERE $class == "list" ^ Attributes ^ ^ search query (can be empty) to return ^ ^ HTML tag to search in "*" is possible = all tags This query returns an array with all links that contain the attribute class="list". All web transfers in htmlSQL are using the awesome Snoopy class (package version 1.2.3 - URL: http://snoopy.sourceforge.net/) But for file or string queries Snoopy is not required. You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ folder. HOW TO USE: --------------------------------------------------------------------- Just include the "snoopy.class.php" and the "htmlsql.class.php" files into your PHP scripts and look at the examples (examples/) to get an idea of how to use the htmlSQL class. It should be very simple :-) BACKGROUND / IDEA: --------------------------------------------------------------------- I had this idea while extracting some data from a website. As I realized that the algorithms and functions to extract links and other tags are often the same - I had the idea to combine all functions to an universal usable class. While drinking a coffee and thinking on that problem, I thought it would be cool to access HTML elements by using SQL. So I started creating this class... WARNING: --------------------------------------------------------------------- The eval() function is used for the WHERE statement. Make sure that all user data is checked and filtered against malicious PHP code. Never trust user input! TODO: --------------------------------------------------------------------- - enhance the HTML parser - test htmlSQL with invalid and bad HTML files - replace the ugly eval() method for the WHERE statement with an own method - more error checks - include the LIMIT function/method like in SQL LICENSE: --------------------------------------------------------------------- htmlSQL uses a modified BSD license, you find the full license text in the "htmlsql.class.php".

Screenshots  
  • htmlsql_syntax_example.png
  Files folder image Files  
File Role Description
Files folder imageexamples (15 files)
Accessible without login Plain text file htmlsql.class.php Class Contains the main htmlSQL class
Accessible without login Plain text file snoopy.class.php Class The famous snoopy class by Monte Ohrt - v1.01
Accessible without login Plain text file readme.txt Doc. English readme with description and todo list
Accessible without login Plain text file readme_german.txt Doc. The same as the readme.txt just in german language

  Files folder image Files  /  examples  
File Role Description
  Accessible without login Plain text file demo_01.php Example Example 1 - Shows a simple query
  Accessible without login Plain text file demo_02.php Example Example 2 - Shows a simple query and the "href as url" usage
  Accessible without login Plain text file demo_03.php Example Example 3 - Shows how to connect to a file and a simple query
  Accessible without login Plain text file demo_04.php Example Examples 4 - Shows a advanced query with preg_match
  Accessible without login Plain text file demo_05.php Example Example 5 - Shows a advanced query (with substr)
  Accessible without login Plain text file demo_06.php Example Example 6 - Show how to connect to a string
  Accessible without login Plain text file demo_07.php Example Example 7 - Shows a complex query
  Accessible without login Plain text file demo_08.php Example Example 8 - Shows how to parse a RSS/XML file with htmlSQL
  Accessible without login Plain text file demo_09.php Example Example 9 - Shows how to use the "select" function
  Accessible without login Plain text file demo_10.php Example Example 10 - Shows how to use the "isolate_content" function
  Accessible without login Plain text file demo_11.php Example Example 11 - Shows how to query a simple XML file
  Accessible without login Plain text file demo_12.php Example Example 12 - Shows how to replace the user agent and the referer with custom values
  Accessible without login Plain text file demo_data.htm Example Demo HTML data (used for parsing examples)
  Accessible without login Plain text file demo_xml.xml Example Example XML file (to test parsing)
  Accessible without login Plain text file query_examples.txt Doc. Some query examples for copy and paste

 Version Control Unique User Downloads Download Rankings  
 0%
Total:5,526
This week:1
All time:443
This week:560Down
User Ratings User Comments (4)
 All time
Utility:95%StarStarStarStarStar
Consistency:91%StarStarStarStarStar
Documentation:86%StarStarStarStarStar
Examples:87%StarStarStarStarStar
Tests:-
Videos:-
Overall:74%StarStarStarStar
Rank:113
 
Really useful!!!
10 years ago (Massimiliano Chichi)
77%StarStarStarStar
Excellent idea, very neat coded and great examples.
14 years ago (Matt)
80%StarStarStarStarStar
Really helpful and efficient
14 years ago (LiliwoL)
75%StarStarStarStar
This is a brilliant class.
15 years ago (Wayne Zeller)
77%StarStarStarStar