Till KTH:s startsida Till KTH:s startsida

Recitation #2

1. 

It is supposed that through time, 107 billion individual human beings
have lived. Let us assume that each of these people have an 8 byte id,
a 3 byte encoding of their date of birth, a 3 byte encoding of their
date of death, an 8 byte foreign key to their mother and an 8 byte
foreign key to their father. Then we have 10 bytes to encode the
characteristics of the persons life (their outlook, tastes, cause of
death, basic life trajectory, etc.). Assume that disk blocks are 8K
and with 64-bit addresses. Further assume that B+tree pointers are
64-bits and main memory pages/blocks are also 8K.

a. For person data, what is the blocking factor for a B+ tree leaf
node with forward and backward pointer to sequential blocks.

b. How many disk blocks do we need to represent every person who ever
lived?

c. What is the order of an internal B+tree node?

d. Assuming that we cluster on person ids with completely filled leaf
blocks. What is the maximum height B+tree for an index built over id
as the search key.

e.. For d. above, what is the minimum height tree?

2. 

The system part will require you to install postgresql on your laptops and then tune
brainz to make query answering as fast as possible. You may work with one other partner on this system exercise.

First get the data here: brainz.zip
NOTE: Don't use the admin interface. Instead \i brainz.sql from the command line.

  1. Give the name of all artists or groups who have released an album called "The Wall".
  2. Give the name of all artists or groups who have released an album containing the word "love"
  3. Give the name of all the artists or groups who have not released an album containing the word "Babylon".
  4. Give all the names that are shared among several (more than one) different artists. For example I suspect that the there is more than one artist named 'John Smith'.
  5. Give the names all all tracks that have been released by 'Pink Floyd'
  6. Give a list of the names of all tracks by artists who released albums in the '90's
  7. Names of artists that died at age 27 between 1965 and 1995
  8. Count of artists that died at age 27 between 1965 and 1995?
  9. Count of artists that died at age 28 between 1965 and 1995?


If you are unable to have super-user access to a lap-top, let me know
and I will try to arrange something. Whatever you do, DO NOT run this
system exercise on the school's database server. You will certainly
cause a lot of unintended consequences if you (even try to) do so.

What to hand in:

Give a papers that include you answers to problem 1. For problem 2, print out all SQL used (including index declarations and similar), and a copy of the explain analyze parse tree for each answered query. 

Lärare Michael Minock skapade sidan 27 mars 2015

Lärare Michael Minock ändrade rättigheterna 8 april 2015

Kan därmed läsas av alla och ändras av lärare.
kommenterade 21 april 2015

Will there be a sheet to put our claims, or do we have to hand in written solutions? Thank you.

Lärare kommenterade 21 april 2015

I will bring along a claim sheet.

kommenterade 21 april 2015

Is there a list of the book problems for this recitation? 

Lärare kommenterade 21 april 2015

No it is just what is here. I wanted to add more, but I figure you guys will probably be busy enough with the systems part + projects.

kommenterade 11 april 2016

Hey,

when is the deadline for recitation #2?

kommenterade 11 april 2016

Hi Michael, 

Is this the complete list of problems for recitation 2 or do you still plan on adding more problems?

Lärare kommenterade 12 april 2016

I think we will add a little bit. Hopefully by tomorrow. 

Still, these problems are the bulk of rec #2.

kommenterade 12 april 2016

Great! However, how are we supposed to know when we have optimized brainz to a sufficient level? 

Lärare kommenterade 12 april 2016

We are working on that question. I think you should aim for sub-second performance on all these queries.

Lärare kommenterade 13 april 2016

BTW are others having problem with the build? 

Some have reported that it does not build properly? 

kommenterade 13 april 2016

On my machine (Windows 7, PostgreSQL 9.5.2 compiled by Visual C++ build 1800, 64-bit), importing the database prompted a lot of errors that I was barely able to notice because it prompts tons of stuff at the same time and the errors quickly move out of scrolling reach but it appears that some of the several empty tables that I end up with are also empty for people running Ubuntu.

A screenshot at a random time during the prompt showed that all the errors (at that phase of the import) were of the following form:

CREATE [FUNCTION|TABLE|AGGREGATE]

psql:<full path to brainz.sql>:<line number*>: ERREUR: le rôle « johang_2011_musicbrainz » n'existe pas

Which, when unscrambling the weird encoding (that may or may not be the source of the problem but I followed psql's instructions) and the fact that it's all in French, probably gives something in the likes of:

CREATE [FUNCTION|TABLE|AGGREGATE]

psql:<full path to brainz.sql>:<line number*>: ERROR: the role "johang_2011_musicbrainz" does not exist

*In my screenshot, what I assume to be line numbers were the integers in {734, 771, 791, 814, 832, 854, 877, 896, 927, 939, 957, 970} but that's only a small snapshot of all the prompts I had.

Lärare kommenterade 13 april 2016

OK I think you need to use \i brainz.sql to load.

kommenterade 13 april 2016

That's the command I used that generated the errors shown above. In the end, I end up with the following numbers of records in the following tables :

  • 896347 album, albummeta
  • 876388 album_amazon_asin
  • 436007 album_cdtoc
  • 10563902 albumjoin
  • 3590702 albumwords
  • 187107 annotation
  • 608131 artist, artist_meta
  • 97825 artist_tag
  • 96813 artistalias
  • 1492214 artistwords
  • 440145 cdtoc
  • 218 clientversion
  • 254 country
  • 1413 cp_genres
  • 113 currenstat
  • 226355 gid_redirect
  • 270895 historicalstat
  • 207800 isrc

And less than 10 (mostly 0) in Pending, PendingData, artist_alias, artist_annotation, artist_credit, artist_credit_name, artist_gid_redirect, artist_name, artist_relation, artist_type, automod_election, automod_election_vote, editor, editor_collection, editor_collection_release, editor_preference, editor_subscribe_artist, editor_subscribe_editor, editor_subscribe_label, editor_watch_artist, editor_watch_preferences, editor_watch_release_group_type, editor_watch_release_status, gender

Since there are 180 tables in the database, I only considered those whose first letter is before "k".

kommenterade 13 april 2016

Hi Michael,

Why is it that you changed the questions for the systems part of this recitation? I asked you about the questions of this recitation yesterday, where you said that you probably would add some more questions. Now the questions have completely changed? 

We have used \i brainz.sql to build the database, however as mentioned above it results in multiple errors when building the database using Windows 8 and PostgreSQL 9.5.2.

Furthermore, we are wondering exactly with what you mean with "tune brainz"? Do you mean optimizing the queries or the actual database (if the latter, do you have any recommended reading on this subject since the lecture slides aren't really of much help)? 

Lärare kommenterade 13 april 2016

Sorry about the change. I was just going to add questions, but then as the list got longer and longer, I decided to take some away. Sorry about that. Still I didn't want to give you guys too many questions. Better to have better mix of questions, which we now feel we have. 

Tuning means building indexes mostly. Clustering and system configuration (e.g. memory buffers) might also help.

And using explain analyse. 

I would suggest an empirical approach were you try stuff and see what works best.

En användare har tagit bort sin kommentar
kommenterade 13 april 2016

Are you going to keep changing the questions? It's rather difficult to plan your work when it's rendered useless the next day.

Lärare kommenterade 13 april 2016

Yeah, I know how that feels... Trust me I know. Alas...  

There might be changes that correct or help clarify. But otherwise this is stable. The form of answers we want at recitation is still not decided. Hopefully soon. Unfortunately I am racing against another deadline (not DD2471) and will not have much time to devote to this until the weekend. 

En användare har tagit bort sin kommentar
kommenterade 13 april 2016

I am also having errors when building on Windows 8 with PostgreSQL 9.5.2 - does anyone have any tips? 

kommenterade 13 april 2016

It appears that there is an issue with years for the songs. I realized that "track" only contained 0's (and possibly NULL's) and Maximilian Kaul went further and saw that out of the 10 tables contained a column year (track, ascii_cp_track, cp_not_original_version, cp_original_version, cp_categorized_version, cp_uncategorized_version, cp_every_version, mcp_song, medium_format, nv_track), only medium_format had more than 1 distinct value in that column: instead, it has 14 (spanning from 1877 to 2004).

Thus, question 6 becomes a bit tricky.

Lärare kommenterade 14 april 2016

Apologies to all. There was a problem with the earlier query 6 on recitation #2. This has now been changed to a query that is answerable over the version of music brainz provided. 

kommenterade 14 april 2016

When creating these queries are we allowed to use the set of views that are incorporated into the file brainz.sql? 

Lärare kommenterade 14 april 2016

Sure. 

En användare har tagit bort sin kommentar
kommenterade 15 april 2016

Did anyone figure out #6? Is it doable? Because all I could find are 0's.

kommenterade 15 april 2016

For 1a, I just assumed at first that each person's record had the additional 2 forward and back pointers, adding 16 bytes to the record size, but reading it over again, I'm thinking that each leaf node, each block, has 2 additional pointers, not each person. So whatever the block can hold in terms of a person's record, it needs to make room for 16 bytes for the 2 pointers. Is this correct? Or are there 2 additional forward/back pointers, 16 bytes, per person, not 16 bytes per block?

Lärare kommenterade 15 april 2016

> So whatever the block can hold in terms of a person's record, it needs to make room for 16 bytes for > the 2 pointers.

correct.

kommenterade 15 april 2016

Just to be 200% sure, this is emphatically NOT 2 pointers per person record, just 2 pointers per block?

Lärare kommenterade 15 april 2016

If I could fit 50 person records in a block, why would I waste 100 pointers? I just need a pointer to the next/earlier block.

kommenterade 15 april 2016

Hi Michael,
Are you allowed to create new tables in order to restructure the data in order to tune brainz? 

Lärare kommenterade 15 april 2016

You can set up non-materialized views if you like. 

But letting you have materialized views of the query answers would defeat the point of the exercise. So use common sense and be prepared to defend you choices. 

kommenterade 15 april 2016

Did everyone else get something of this magnitude for the # of rows per table?

table

kommenterade 15 april 2016

Yes, see 20 comments above the one I posted two days ago.

En användare har tagit bort sin kommentar
En användare har tagit bort sin kommentar
Lärare kommenterade 18 april 2016

Thanks for pointing this out. I have no idea how that error made it in. Fixed now.

kommenterade 18 april 2016

Oops, removed the comment to amend it, I'll paste it back: 

For question 4, there seems to be a mix of 2 questions:

"Give all the names that are shared among several (more than one) different artists. For example I suspect" AND

"Give a list of the names of all tracks by artists who released albums in the '90's" AND

"that the there is more than one artist named 'John Smith'."

I guess it is supposed to be something like "Give all the names that are shared among several (more than one) different artists. For example I suspect that the there is more than one artist named 'John Smith'."?

The amendment was about question 6. It's still asking about years, but all my tracks years are 0 like people have commented above. Am I not supposed to look in the years for the tracks? Because albums don't have any column related to year of release as far as I can see.

kommenterade 18 april 2016

Does someone understand which tables the age or birth/death attributes in question 7-9 is supposed to be in? 

kommenterade 18 april 2016

Its my understanding those are the impersonal "begindate" and "enddate" in artist

kommenterade 18 april 2016

so then we need to convert the characters to date type and check that the current begin/end date string does not include "00" somewhere in it because the to_date() function does not like months or days that are 00... hmm...?

kommenterade 18 april 2016

For the age query you can look at the helpful view cp_artist. Otherwise you have to use the to_date() function with a formatting string, e.g. 'YYYY-MM-DD'. I would also look up how the function date_part() and age() behave because using them you can solve this quite easily.

kommenterade 28 april 2016

When will we get results for this recitation?

kommenterade 28 april 2016

I know at least of two people that can already see them on Rapp.

Lärare kommenterade 4 maj 2016

Yeah, the results for recitation #2 have been posted for quite some time. If you have a problem, then take contact with your recitation leader.