![Photo by John Michael Thompson](/assets/images/john-michael-thomson-555276-unsplash_1280x720-0a6a521aeb39e538f8d02b20e9a9ef5c.jpg)
It's not unusual for an established principal investigator (PI) to have desks and shelves full of books, journal articles, grant applications, graduate student theses etc. Often, they'd also have a mountain of such material in electronic form and stored on desktop hard drives that sometimes get forgotten. It takes an extremely good memory to remember which articles are relevant to a particular keyword you might be interested in.
Thankfully, there's Oracle Text (and Oracle Application Express (APEX)) to keep you sane!
Setting the Stage
I'm not a researcher myself, so naturally, I don't horde a lot of these materials. For the purpose of this blog post, I will use open data made available, courtesy of the Canadian Government.
The National Sciences and Engineering Council of Canada (NSERC) is one of Canada's federal agencies that is responsible for promoting and providing funding for research in the natural sciences and engineering fields. They publish annual lists of grants that were successfully awarded. You can find them here. I used only one year's (2017) worth of data. There's a lot of research activity going on in the north!
Start off by first creating the necessary data structure to hold the data. I used the Quick SQL tool to quickly generate the necessary DDL script and then create the database.
Quick SQL Shorthand
# settings = { genPK: false, db: 11g }
award
id number /pk
name varchar2(200)
department varchar2(200)
organization_id number
institution varchar2(200)
province varchar2(100)
country varchar2(100)
fiscal_year number
competition_year number
award_amount number
program_id varchar2(50)
program_name varchar2(200)
program_group varchar2(100)
committee_code number
committee_name varchar2(200)
area_of_application_code number
area_of_application_group varchar2(200)
area_of_application varchar2(200)
research_subject_code number
research_subject_group varchar2(200)
research_subject varchar2(200)
application_title varchar2(500)
source varchar2(50)
application_summary varchar2(4000)
Oracle SQL Output
create table award (
id number not null constraint award_id_pk primary key,
name varchar2(200),
department varchar2(200),
organization_id number,
institution varchar2(200),
province varchar2(100),
country varchar2(100),
fiscal_year number,
competition_year number,
award_amount number,
program_id varchar2(50),
program_name varchar2(200),
program_group varchar2(100),
committee_code number,
committee_name varchar2(200),
area_of_application_code number,
area_of_application_group varchar2(200),
area_of_application varchar2(200),
research_subject_code number,
research_subject_group varchar2(200),
research_subject varchar2(200),
application_title varchar2(500),
source varchar2(50),
application_summary varchar2(4000)
)
;
NOTE
Change the DB version as required. The default is 12c, but since I am running the database Express Edition (XE), I used 11g.
Filling Up the Repository
The data from NSERC is fairly clean and formatted. The 2017 data that I downloaded however, was a little too large to import using the built-in SQL Workshop tool in APEX. Fortunately, SQL Developer has a rather sophisticated functionality for doing the same, though it wasn't without challenges. I'll leave data importing methods for another day's discussion. For the purpose of this exercise, I did managed to import at least 5,000 records.
Search-Enabling the Table of Grants Awarded
Oracle Text is a very powerful, no-cost feature of the Oracle Database that doesn't get enough attention as it should. There are many interesting concepts and customizations that can be applied and tune the Oracle Text Index. However, creating a basic index is rather straightforward and should not be feared.
My goal:
- List grant awards that are about fossil fuels and the effects on the environment.
- The corpus contains applications written in both English and French. Users should be able to search with either language.
- Users enter search terms using only one text field, but search must include the following columns:
application_title
application_summary
research_subject
area_of_application
Required Database Privileges
To work with Oracle Text, the database user requires a few additional privileges that have already been granted to the role CTXAPP
. Assigning the user to the role should be all that's needed:
Create the Oracle Text Index.
begin
begin
ctx_ddl.drop_preference('SEARCH_DATASTORE');
exception
when others then null;
end;
ctx_ddl.create_preference('SEARCH_DATASTORE', 'MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('SEARCH_DATASTORE', 'COLUMNS', 'application_title, application_summary, research_subject, area_of_application');
begin
execute immediate 'drop index award_idx';
exception
when others then null;
end;
execute immediate q'[
create index award_idx on award(application_title)
indextype is ctxsys.context
parameters (q'{
DATASTORE SEARCH_DATASTORE
}')
]';
end;
/
The comments in the procedure shown above explains the basic commands needed to create the index. For convenience, they are listed below:
- Remove any previously created preference with the same name.
- Create the DATASTORE preference.
- Drop any existing index with the same name.
- Create the index.
As mentioned, there are additional indexing elements that could be used to further optimise and customize the search index, but they are out of scope at this time. This simple setup will already allow some powerful search abilities with minimal query design.
Query the Data
There are different operators to use in the WHERE
clause, depending on the index type created. In the example, an index of type CONTEXT
was created, thus the query will have to be constructed using the CONTAINS
operator.
select
id
, application_title
, application_summary
from award
where 1 = 1
and contains(application_title, '"fossil fuels" and environment', 1) > 0
order by search_score desc;
Oracle provides a long list of CONTAINS
query operators that allows users to formulate powerful queries. The example above is a simple boolean search (AND
operator) requiring both the phrase "fossil fuels" and word "environment" for a match. A simple misspelling of "environment" would likely yield no results. That's where other operators like fuzzy
provide a margin error, allowing returning results even when there are no exact matches. The same operator might even forgive simple nuances like "colour" versus "color". There are also wildcard (%
) and other operators that are denoted using special characters that could sometimes spell trouble even for a seasoned Oracle Text user.
To protect the untrained user, it is recommended to pre-process the user input and restructure keywords and options into proper query operator constructs. This isn't always easy, and so, we are fortunate that Roger Ford, Oracle Text's Product Manager, posted a blog and useful utility package for parsing user inputs from a simple text field. The syntax closely resembles Google's search operators and should be familiar with most users. The code is available to download at the end of Ford's article.
After compiling the script, functions such as parser.simpleSearch
and parser.andSearch
will be available to use. Check the comments in the code for details about each function.
SQL> select parser.andsearch('"fossil fuels" environment') from dual;
Querying with APEX