Press "Enter" to skip to content

More liveblogging: CSC-105 Intro to Computer Science

We’re moving right along and liveblogging my Intro to CS class.

zach January 30, 201311:35 am

What makes the web different from a library?

zach January 30, 201311:36 am

Anyone can publish to the web! If you have an Internet connection that’s 24/7 you can set up a web server that can serve pages!

zach January 30, 201311:36 am

You need to be found, however.

zach January 30, 201311:37 am

Looking on the web without a search engine is like looking in the library with a blindfold on. Doesn’t work.

zach January 30, 201311:37 am

The information on the web isn’t really *validated* information, though.

zach January 30, 201311:38 am

You could come up with some pretty crazy results with the web as your source. Google and other web search engines are fairly reliable when searching for credible sites. How do they do it? What are their limitations?

zach January 30, 201311:39 am

The web has an adversarial component to it: web spamming.

zach January 30, 201311:39 am

Tons of people out there are spamming search engines to misdirect your searches. How do they do it?

zach January 30, 201311:40 am

Movie clip: Computers and Automation Scare of the 1950s.

zach January 30, 201311:44 am

Took 45 minutes to type in, but people can recall it instantly. Giant computer that took up the entire room!

zach January 30, 201311:45 am

The computer’s display was all fake, though. Only reason was to make it look like it cost millions of dollars to justify the price.

zach January 30, 201311:46 am

Prior to the appearance of the electronic computer, knowledge was stored largely on books. Libraries were the bastion of power and knowledge. Expensive and difficult to acquire books back in the day.

zach January 30, 201311:46 am

With the emergence of electronic computers, we found ourselves with Database Management Systems: DBMS.

zach January 30, 201311:47 am

DBMS is an organized collection of related information.
(Structured information)

zach January 30, 201311:48 am

It’s easier to find information in a database because it has headers, abstracts, etc. A structure of information.

zach January 30, 201311:48 am

Comparison to library card catalog. The whole physical database has been organized. Computer databases are organized as well.

zach January 30, 201311:49 am

Unstructured information: a page from a book, perhaps. You need to read the whole text to get the information.

Scan text line by line word by word, to get the information we need.

zach January 30, 201311:50 am

Scaling factor: how much information can you store?
Libraries: low. Electronic database: moderate. Web: high.

zach January 30, 201311:51 am

Electronic databases: lots of information stored in one location. There could be a lot of information there, but it’s nothing compared to the web!

zach January 30, 201311:52 am

Lifetime of content:
Libraries: static
Database: mostly static (only going to change if the database manager makes a change)
Web: most of the web is dynamic and very fluid.

zach January 30, 201311:53 am

Most websites are actually very short. Geocities got shut down, your first website probably was too.

zach January 30, 201311:54 am

Retrieving information:
Library: index searching, reference, assistance
Database: index searching, database assistance
Web: content, searching mostly

zach January 30, 201311:54 am

There’s no master list of all web pages.

zach January 30, 201311:55 am

Informational organization:
Libraries and databases: planned.
Web: totally dynamic.

zach January 30, 201311:56 am

Content quality, accuracy:
Libraries: mostly reviewed
Databases: very accurate. Reviewed.
Web: mostly not reviewed, extremely variable. Ex: Wikipedia, anyone can edit it, it’s mostly reviewed, but not as good.

zach January 30, 201311:58 am

Scope of the material:
Libraries: variable.
Database: limited.
Web: highly variable. It’s everything.

zach January 30, 201311:59 am

In libraries, you can go from book to book but it’s not automatic.
Databases: structured links as part of a design. Some linking.
Web: highly linked. There’s a voting process out there: PageRank algorithm, etc. What is a quality link?

zach January 30, 201312:01 pm

Two people who are pioneers in this field:
Hans Peter Luhn (IBM): KWIC (KeyWord In Context)
Gerald Salton (Cornell): vector space index

zach January 30, 201312:02 pm

Information retrieval: IR
Corpora of unstructured information -> IR systems with Query tools

zach January 30, 201312:03 pm

The basic idea of an IR system is that you want to have an ad-hoc query.

Ad-hoc: off the cuff. The system should be able to answer anything at any time.

zach January 30, 201312:07 pm

Keyword searching. Very important. When you put in a keyword, you’re going to scan a whole bunch of unstructured text, and when you find the word, you’re going to save the document.
When you’re done, just look at the list of documents.

zach January 30, 201312:07 pm

Problem, suppose you’re searching for Labrador. Suppose you want to take a trip to Labrador. You might just get a whole bunch of information about dogs instead.

zach January 30, 201312:08 pm

The idea is to filter your search using boolean logic.

zach January 30, 201312:08 pm

Labrador NOT retriever. Labrador NOT dogs. Labrador AND travel. Yep, boolean logic.

zach January 30, 201312:11 pm

Because the web has no central authority or control, it’s going to be complex and spammy at times. The web is also extremely large. So how does a search engine work?

zach January 30, 201312:12 pm

Search interface: A special interface that allows you to access the search system.

zach January 30, 201312:12 pm

User interface: a motif of how to interact with the system. Gives you tools, commands, etc.

zach January 30, 201312:13 pm

The search interface organizes the interactions between the user and the system.

zach January 30, 201312:13 pm

A good user interface will usually give you tips on how to best interact with it.

Behind the interface, you have the query engine.

zach January 30, 201312:14 pm

The query engine needs to recognize that most queries aren’t going to be high quality. There might be misspellings, vagueness, etc.

zach January 30, 201312:15 pm

Where does the system get its information? With indices.

zach January 30, 201312:16 pm

The search index is an inverted file index: a lot of terms + postings.

zach January 30, 201312:17 pm

An index needs to be updated fairly often.

zach January 30, 201312:18 pm

To do this, you need to crawl over the web. You start with seed pages that will point to other pages. Go to them, crawl both the URL and parse the page.

zach January 30, 201312:19 pm

This will create a giant list of new URLs that need to be parsed. It takes a while.

zach January 30, 201312:20 pm

That wraps it up! Next time: more on queries and methods of searching.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *