Journal
Wednesday,Jun 18 2008, 02:07:20 AMKnoelwsys Web Data Extraction
Web
Data Extraction
The unabated growth of the Web has resulted in a situation
in which more information is available to more people than ever in human
history. Along with this unprecedented growth has come the inevitable problem
of information overload. To counteract this information overload, users
typically rely on search engines (like Google and AllTheWeb) or on
manually-created categorization hierarchies (like Yahoo! and the Open Directory
Project). Though excellent for accessing Web pages on the so-called
"crawlable" web, these approaches overlook a much more massive and
high-quality resource: the Deep Web. 1. Data Collection 2. Data Extraction
The Deep Web (or Hidden Web) comprises all information that
resides in autonomous databases behind portals and information providers' web
front-ends. Web pages in the Deep Web are dynamically-generated in response to
a query through a web site's search form and often contain rich content. A
recent study has estimated the size of the Deep Web to be more than 500 billion
pages, whereas the size of the "crawlable" web is only 1% of the Deep
Web (i.e., less than 5 billion pages).3. Data Extraction
from Web
4. Extracteur Web Even
those web sites with some static links that are "crawlable" by a
search engine often have much more information available only through a query
interface. Unlocking this vast deep web content presents a major research
challenge.
In analogy to search engines over the "crawlable"
web, we argue that one way to unlock the Deep Web is to employ a fully
automated approach to extracting, indexing, and searching the query-related
information-rich regions from dynamic web pages. For this miniproject, we focus
on the first of these: extracting data from the Deep Web.
Extracting the interesting information from a Deep Web site
requires many things: including scalable and robust methods for analyzing
dynamic web pages of a given web site, discovering and locating the
query-related information-rich content regions, and extracting itemized objects
within each region.5. Extraction,Extraction
and Extraction on web! 6. Extraction
Information Information By full automation, we mean that the extraction
algorithms should be designed independently of the presentation features or
specific content of the web pages, such as the specific ways in which the
query-related information is laid out or the specific locations where the
navigational links and advertisement information are placed in the web pages.
There are many possible 7001-miniprojects. Feel free to
talk to either of us for more details. Here are a few possibilities to
consider:
1. Develop a Web-based demo for clustering pages of a
similar type from a single Deep Web source. 21. Web Grabber
22. Web Mining For
example, AllMusic produces three types of pages in response to a user query: a
direct match page (e.g. for Elvis Presley), a list of links to match pages
(e.g. a list of all artists named
2. Design a system for extracting interesting data from a
collection of pages from a Deep Web source. You might define a set of regular
expression that can identify dates, prices, or names.9. Information
Extraction 10. News Content for
Web Site Develop a small program that converts a page into a type
structure. For example, given a DOM model of a web page, identify all of the
types that you have defined, and replace the string tokens with XML tags
identifying the types.11. Screen
Scraping Replace all non-type tokens with a generic type, and return the
tree as a full type structure). Alternatively, you may suggest your own
approach for extracting data.
3. Develop a system to recognize names in page. 12. Site Scraping Given a
list of names and a web page, identify possible matches in the page. Based on
the structure of the page and the distribution of recognized names, identify
strings that may also be names based on their location in the DOM tree
heirarchy representing the page.
4. Write a survey paper about current approaches for 13. Web Data Extraction 14.
Web Data Extraction
understanding and analyzing the Deep Web. Be sure to include many of your own
comments on the viability of the approaches you review.
5. Or, feel free to suggest a
miniproject of your own.
Extracting information from semistructured Web documents is
an important task for many information agents. 15. Web Data
Extraction Service 16. Web Data
Extraction Services Over the past few years, researchers have developed an
extensive family of generic information extraction techniques based on
supervised approaches that learn extraction rules from user-labeled training
examples.
However, annotating training data can be expensive when
thousands of data sources must be wrapped. 17. Web Data Extractor
18. Web Data Grabber Web
Data Miner, a semisupervised IE system, produces extraction rules without
detailed annotation of the training documents. Instead, it gives a rough
segment that contains all that need to be extracted in one record as an
example.
Web Data Miner is designed with visualization support such
that it 19. Web Data
Mining 20. Web
Extraction displays the discovered records in a spreadsheet-like table for
schema assignment. 23. Web
Scraping 24. Website
Extraction Experiments show that Web Data Miner performs well for
program-generated Web pages with very few training pages and little user
intervention.
Index Terms-25. Website Scraping semistructured
data, Web data extraction, multiple string alignment, rule generalization
Build a
website, Direct
Search Engine 1, Direct
Search Engine 2, Web
Data , Web
Content, Web Data
Extraction

