Visa ett inlägg
Oläst 2005-02-08, 21:22 #4
grazzy grazzy är inte uppkopplad
Klarade millennium-buggen
 
Reg.datum: Mar 2004
Inlägg: 3 471
grazzy grazzy är inte uppkopplad
Klarade millennium-buggen
 
Reg.datum: Mar 2004
Inlägg: 3 471
Det här är mina anteckningar från föreläsningen, de är inte särskilt kompletta, men troligen bättre än inget om man inte hade förmånen att få vara där i egen person. Om något är oklart så fråga. Rubrikerna har jag satt "själv". Det är mao ingen manuskript utan bara mina anteckningar. Sa jag att det bara var mina anteckningar?

Om ni har frågor så kan ni maila då jag inte läser WN så ofta längre.

Föreläsning Google, Magnus Sandberg, Linköpings Universitet/Lysator
Subject: Building scalabe systems for web search and beyond.

Mission: To organize the worlds information
Today: 65.9 unique visitors in USA (by cookie).
50% of searches outside of USA. 50% of all searches (75% in sweden).
3000 employees. 20 offices.

Most common question: what does employees at Google do?
There are several problems at Google.
- Growing amounts of data (already huge).
- Search traffic is growing.
- Maintaining search quality.

Scale of problem
There are more than 8 billion webpages.
10 kb / page average. + Images + Non-web-data.

Dealing with scale
- Hardware/networking. Buy basic/cheap pcs instead of servers.
- Distributed system - many pcs.
- Algorithms/data structures - solving problems in new ways.
- Machine learning, data analysis.
- User interface/intuitivity.
... more.

PCs are generally cheaper than servers.
88 rackspace pcs (2cpu xeon) compared to IBM eServe.
1/3 price. 22x CPU, 3x RAM, 1x HDD.

Dealing with failures
A computer lasts 3 years, with 1000 computers you will require to replace one every day.
- Replication & Redundancy: Fault tolerant software makes cheap hardware possible.

Googles index
Structure of index
Looks like a hash-table.
Word1 -> page1, page2, page3 ...
Word2 -> page4,...
Word3 -> ...
1: Use pagerank as a total order.
2: Separate data in "Shards". (Liknande LVM på linux).
3: Replicate the shards.
4: ..
5: ..

A search is done by several clusters of machines. The webserver sends the query to the "shard"-cluster where data is located in the index. The query is then sent to the "doc"-cluster where snippets of data (below each result) is gathered. At the same time information from spell-servers and ad-servers is gathered.

The build of the index takes several days on hundreds of machines.

Some google technologies
- GFS - Google file system.
A master manages metadata. Chunks are replicated on atleast 3 machines on diffrent locales.
Distributed filesyste. 30 clusters. 2000+ chunk/shard-servers. Petabyte filesystem. 2000+ mb/s sustained read/write.

- GWQ - Google work queue.
Master manages slaves. Allocates cpu/disk/memory to tasks. Servers doubles as chunk/cpu-servers (GFS).

- MapReduce.
Automatic & efficient parallel/distributed/faulttolerant framefork for tasks.
Map - Add a pair of data, (some en hash).
Reduce - Reduce all pair to unique keys.

This allows for efficient programming of large tasks in co-operation with GFS/chunks.
Fault-tolerant system, 1800/2000 machines crashed, job finished fine.

Uses: quality experiment. Logfile analysis. Machine translation. Data processing.
Paper about MapReduce: "OSDI '04".

About Google in general
Who does it?
-Talented people
-Small teams of 3-5 people.
-[Solving] Problems that matter
-Freeto explore ideas.
[Google has] Experts in many areas (almost all CS-related areas).

20%-rule - 20% of your time can be spent on alternative research/projects.
25 people in zurich office.

Questions after session
Number of machines in clusters: "A lot".
How many datacenters? About 12, on west/east coast. Close to users to reduce latency.
Google has saved all snapshots of websites (as archive.org).
No connection to developing browsers (firefox/gbrowser.com-rumours). Google has previously hired IE-developers which caused similar rumours.

Bandwidth: No numbers currently. But earlier transfers between datacenters had to be done at night due to high costs.

Incentives at google, invidual bonuses with stock. Group bonuses for outstanding performance by groups or company.
grazzy är inte uppkopplad   Svara med citatSvara med citat