How-To Create you own DataBase File (CSV) of “Famous Quotes” using Linux Bash Scripts.

Posted: 2 January, 2013 in Computers, Linux, Web
Tags: , , , , , , , , ,

einstein quote

I am currently developing an Android application that will display a quote everyday. I need alot of quotes, so i have been searching for a database of Famous or Historical quotes for quite a while when i just decided to create my own, here i will explain how you can make your own Database File of Quotes using Linux, Bash Script combined with WGET and sed.

First of all you need to search for a source of your quotes, i just typed “famous quotes database” on google and picked the first result. So this web has a nice feature that allow you to put their phrases in you web, it’s called “Quotes to your site”, that’s nice, but we will use this feature to download their  quotes and use them to create a CSV file with all the data.

Downloading the quotes.

We will create a simple loop script to download the quotes and store them in a final file.


#!\bin\bash
for i in `seq 1 1000`;
do
wget -q -O quotes http://www.quotedb.com/quote/quote.php?action=random_quote;
cat quotes>>final.txt;
done

Depending on the number of quotes can take a while. Then we will have all the quotes pages that they display in a file named “final.txt”, so this is HTML formatted with some text that we don’t want.

It will look like this:

document.write('Show me a thoroughly satisfied man and I will show you a failure.<br>');
document.write('<i>More quotes from <a href="http://www.quotedb.com/authors/thomas-edison">Thomas Edison</a></i>');

Now we need to format it.

Parsing the Result.

A parser is something that analyzes one text and extract the important content from it. This is not real a parser, is only text treatment but the result is like if was a parser, extracting the information that we want from the file.

We will use sed command to find the pattern we want and just delete it. We need to get ride of

  • document.write(
  • </a></i>’);
  • <br>’); 
  • <i>More quotes from <a href=”htttp://www.quotedb.com/authors/thomas-edison” >

For that we will use this commands:

sed 's/document.write(//g' final > output.tmp
sed 's/<\/a><\/i>'"'"');//g' -i output.tmp
sed 's/<br>'"'"');//g' -i output.tmp
sed 's/'"'"'<i>More quotes from <a href=\"http:\/\/www.quotedb.com\/authors\/.*\">//g' -i output.tmp

NOTE: You need to escape the special characters using escape character “\” or special way, here i used this:

  • From: /                               Used: \/
  • From: ‘                                Used: ‘”‘”‘   (simple,double,simple,double,simple)
  • From: “random-author”   Used: .* (means any character 0 or N times)

And then we need to join the result line and format it with a nice output view like “‘quote.’ – Author”. For that we will find \n and substitute it for ” ‘ – ” every 2 lines:

sed '$!N;s/\n/'"'"' - /' -i output.tmp

Now we will have our file formatted properly to open with any CSV editor like Excel or LibreOffice Calc or to treat by ourselves with cut or whatever tool you use.

But there is a problem, since i used a random generator online, quotes may be duplicated, we need to remove duplicated lines and keep the random:

sort -u --random-sort output.tmp > quotes.txt

Voilá! Here we have our database file.

Hope it will be useful for you, if you have any question leave a comment 🙂

Advertisements
Comments
  1. Uwais A says:

    Thanks for the tutorial. I have a couple of points though.

    1. Do you have a copy of the quotes database for download? If you do, is there a link I could get it from?

    2. This does not work for me because using the URL – http://www.quotedb.com/quote/quote.php?action=random_quote makes the same quote every time. A workaround I found is to add ‘&=’ (without quotes) after the website every time but I don’t know how to do this because I have no experience with bash. So could you please update this so that each time the for loop is carried out an extra &= is added. So in the first download it uses the URL above then the second download has &= after it, the third has &=&= after it, etc.

    Thank you very much I can’t find a good free quotes database either so this has helped me a lot.

    • Greg says:

      Uwais A
      Were you able to find a solution here or a source for a good free quotes database?

      • Uwais A says:

        I did find a solution. I stored the URL http…….random_quote as a string in the bash program then each time it passes through the for loop it adds another stored string, ‘&=’, to the end and then downloads the resulting URL’s webpage. This works about 800 times or so, before you get the error – URL too long.
        I then had to get rid of repeats.

        Here is a link to the database that I came up with in the end:

        https://docs.google.com/file/d/0B4Vk0VOUi2N_b2FPTmR5Y08zTjA/edit?usp=sharing

        This is in SQLite format with 1359 quotes and 300+ authors.
        Sorry if you don’t need it, but there is also an android metadata table included in that file as I have used this database in my app Quote Unquote Cryptogram:

        https://play.google.com/store/apps/details?id=com.apps4life.quoteunquote2

        It’s free – so check it out if you get time. 🙂

        Hope this helps.

  2. CaPPsiE says:

    Hi, I ran the script but set the initial variable to 100,000. It took some time to get through them all and there are undoubtedly duplicates, plus there are some names with non-standard ASCII characters which caused a problem but my biggest issue was the fact that the quotes were missed up, some with author then quote and then quote then author. I prefer to have my quote first then author second.

    I’m currently working on the enourmous list and adding it into an MSSQL database. From there I can make it avaialble via xls or xlsx if anyone requires…?

    A.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s