We’re moving right along and liveblogging my Intro to CS class.
What makes the web different from a library?
Anyone can publish to the web! If you have an Internet connection that’s 24/7 you can set up a web server that can serve pages!
You need to be found, however.
Looking on the web without a search engine is like looking in the library with a blindfold on. Doesn’t work.
The information on the web isn’t really *validated* information, though.
You could come up with some pretty crazy results with the web as your source. Google and other web search engines are fairly reliable when searching for credible sites. How do they do it? What are their limitations?
The web has an adversarial component to it: web spamming.
Tons of people out there are spamming search engines to misdirect your searches. How do they do it?
Movie clip: Computers and Automation Scare of the 1950s.
Took 45 minutes to type in, but people can recall it instantly. Giant computer that took up the entire room!
The computer’s display was all fake, though. Only reason was to make it look like it cost millions of dollars to justify the price.
Prior to the appearance of the electronic computer, knowledge was stored largely on books. Libraries were the bastion of power and knowledge. Expensive and difficult to acquire books back in the day.
With the emergence of electronic computers, we found ourselves with Database Management Systems: DBMS.
DBMS is an organized collection of related information.
It’s easier to find information in a database because it has headers, abstracts, etc. A structure of information.
Comparison to library card catalog. The whole physical database has been organized. Computer databases are organized as well.
Unstructured information: a page from a book, perhaps. You need to read the whole text to get the information.
Scan text line by line word by word, to get the information we need.
Scaling factor: how much information can you store?
Libraries: low. Electronic database: moderate. Web: high.
Electronic databases: lots of information stored in one location. There could be a lot of information there, but it’s nothing compared to the web!
Lifetime of content:
Database: mostly static (only going to change if the database manager makes a change)
Web: most of the web is dynamic and very fluid.
Most websites are actually very short. Geocities got shut down, your first website probably was too.
Library: index searching, reference, assistance
Database: index searching, database assistance
Web: content, searching mostly
There’s no master list of all web pages.
Libraries and databases: planned.
Web: totally dynamic.
Content quality, accuracy:
Libraries: mostly reviewed
Databases: very accurate. Reviewed.
Web: mostly not reviewed, extremely variable. Ex: Wikipedia, anyone can edit it, it’s mostly reviewed, but not as good.
Scope of the material:
Web: highly variable. It’s everything.
In libraries, you can go from book to book but it’s not automatic.
Databases: structured links as part of a design. Some linking.
Web: highly linked. There’s a voting process out there: PageRank algorithm, etc. What is a quality link?
Two people who are pioneers in this field:
Hans Peter Luhn (IBM): KWIC (KeyWord In Context)
Gerald Salton (Cornell): vector space index
Information retrieval: IR
Corpora of unstructured information -> IR systems with Query tools
The basic idea of an IR system is that you want to have an ad-hoc query.
Ad-hoc: off the cuff. The system should be able to answer anything at any time.
Keyword searching. Very important. When you put in a keyword, you’re going to scan a whole bunch of unstructured text, and when you find the word, you’re going to save the document.
When you’re done, just look at the list of documents.
Problem, suppose you’re searching for Labrador. Suppose you want to take a trip to Labrador. You might just get a whole bunch of information about dogs instead.
The idea is to filter your search using boolean logic.
Labrador NOT retriever. Labrador NOT dogs. Labrador AND travel. Yep, boolean logic.
Because the web has no central authority or control, it’s going to be complex and spammy at times. The web is also extremely large. So how does a search engine work?
Search interface: A special interface that allows you to access the search system.
User interface: a motif of how to interact with the system. Gives you tools, commands, etc.
The search interface organizes the interactions between the user and the system.
A good user interface will usually give you tips on how to best interact with it.
Behind the interface, you have the query engine.
The query engine needs to recognize that most queries aren’t going to be high quality. There might be misspellings, vagueness, etc.
Where does the system get its information? With indices.
The search index is an inverted file index: a lot of terms + postings.
An index needs to be updated fairly often.
To do this, you need to crawl over the web. You start with seed pages that will point to other pages. Go to them, crawl both the URL and parse the page.
This will create a giant list of new URLs that need to be parsed. It takes a while.
That wraps it up! Next time: more on queries and methods of searching.