Custom Extractors
Solr, and by extension Riak Search, has default extractors for a wide variety of data types, including JSON, XML, and plaintext. Riak Search ships with the following extractors:
Content Type | Erlang Module |
---|---|
application/json |
yz_json_extractor |
application/xml |
yz_xml_extractor |
text/plain |
yz_text_extractor |
text/xml |
yz_xml_extractor |
No specified type | yz_noop_extractor |
There are also built-in extractors for Riak Data Types.
If you’re working with a data format that does not have a default Solr extractor, you can create your own and register it with Riak Search. We’ll show you how to do so by way of example.
The Extractor Interface
Creating a custom extract involves creating an Erlang interface that implements two functions:
extract/1
— Takes the contents of the object and callsextract/2
with the same contents and an empty listextract/2
— Takes the contents of the object and returns an Erlang proplist with a single field name and a single value associated with that name
The following extractor shows how a pure text extractor implements those two functions:
-module(search_test_extractor).
-include("yokozuna.hrl").
-compile(export_all).
extract(Value) ->
extract(Value, []).
extract(Value, Opts) ->
FieldName = field_name(Opts),
[{FieldName, Value}].
-spec field_name(proplist()) -> any().
field_name(Opts) ->
proplists:get_value(field_name, Opts, text).
This extractor takes the contents of a Value
and returns a proplist
with a single field name (in this case text
) and the single value.
This function can be run in the Erlang shell. Let’s run it providing the
text hello
:
> c(search_test_extractor).
%% {ok, search_test_extractor}
> search_test_extractor:extract("hello").
%% Console output:
[{text, "hello"}]
Upon running this command, the value hello
would be indexed in Solr
under the fieldname text
. If you wanted to find all objects with a
text
field that begins with Fourscore
, you could use the
Solr query text:Fourscore*
, to give just one example.
An Example Custom Extractor
Let’s say that we’re storing HTTP header packet data in Riak. Here’s an example of such a packet:
GET http://www.google.com HTTP/1.1
We want to register the following information in Solr:
Field name | Value | Extracted value in this example |
---|---|---|
method |
The HTTP method | GET |
host |
The URL’s host | www.google.com |
uri |
The URI, i.e. what comes after the host | / |
The example extractor below would provide the three desired
fields/values. It relies on the
decode_packet
function from Erlang’s standard library.
-module(yz_httpheader_extractor).
-compile(export_all).
extract(Value) ->
extract(Value, []).
%% In this example, we can ignore the Opts variable from the example
%% above, hence the underscore:
extract(Value, _Opts) ->
{ok,
{http_request,
Method,
{absoluteURI, http, Host, undefined, Uri},
_Version},
_Rest} = erlang:decode_packet(http, Value, []),
[{method, Method}, {host, list_to_binary(Host)}, {uri, list_to_binary(Uri)}].
This file will be stored in a yz_httpheader_extractor.erl
file (as
Erlang filenames must match the module name). Now that our extractor has
been written, it must be compiled and registered in Riak before it can
be used.
Registering Custom Extractors
In order to use a custom extractor, you must create a compiled .beam
file out of your .erl
extractor file and then tell Riak where that
file is located. Let’s say that we have created a
search_test_extractor.erl
file in the directory /opt/beams
. First,
we need to compile that file:
erlc search_test_extractor.erl
To instruct Riak where to find the resulting
search_test_extractor.beam
file, we’ll need to add a line to an
advanced.config
file in the node’s /etc
directory (more information
can be found in our documentation on advanced). Here’s an
example:
[
%% Other configs
{vm_args, [
{"-pa /opt/beams", ""}
]},
%% Other configs
]
This will instruct the Erlang VM on which Riak runs to look for compiled
.beam
files in the proper directory. You should re-start the node at
this point. Once the node has been re-started, you can use the node’s
Erlang shell to register the yz_httpheader_extractor
. First, attach to
the shell:
riak attach
At this point, we need to choose a MIME type for our extractor. Let’s
call it application/httpheader
. Once you’re in the shell:
> yz_extractor:register("application/httpheader", yz_httpheader_extractor).
If successful, this command will return a list of currently registered extractors. It should look like this:
[{default,yz_noop_extractor},
{"application/httpheader",yz_httpheader_extractor},
{"application/json",yz_json_extractor},
{"application/riak_counter",yz_dt_extractor},
{"application/riak_map",yz_dt_extractor},
{"application/riak_set",yz_dt_extractor},
{"application/xml",yz_xml_extractor},
{"text/plain",yz_text_extractor},
{"text/xml",yz_xml_extractor}]
If the application/httpheader
extractor is part of that list, then the
extractor has been successfully registered.
Verifying Our Custom Extractor
Now that Riak Search knows how to decode and extract HTTP header packet
data, let’s store some in Riak and then query it. We’ll put the example
packet data from above in a google_packet.bin
file. Then, we’ll PUT
that binary to Riak’s /search/extract
endpoint:
curl -XPUT $RIAK_HOST/search/extract \
-H 'Content-Type: application/httpheader' \ # Note that we used our custom MIME type
--data-binary @google_packet.bin
That should return the following JSON:
{
"method": "GET",
"host": "www.google.com",
"uri": "/"
}
We can also verify this in the Erlang shell (whether in a Riak node’s Erlang shell or otherwise):
yz_extractor:run(<<"GET http://www.google.com HTTP/1.1\n">>, yz_httpheader_extractor).
%% Console output:
[{method,'GET'},{host,<<"www.google.com">>},{uri,<<"/">>}]
Indexing and Searching HTTP Header Packet Data
Now that Solr knows how to extract HTTP header packet data, we need to
create a schema that extends the default schema. The following fields should be added
to <fields>
in the schema, which we’ll name http_header_schema
and
store in a http_header_schema.xml
file:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="http_header_schema" version="1.5">
<fields>
<!-- other required fields here -->
<field name="method" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="host" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="uri" type="string" indexed="true" stored="true" multiValued="false"/>
</fields>
Now, we can store the schema:
import org.apache.commons.io.FileUtils
File xml = new File("http_header_schema.xml");
String xmlString = FileUtils.readFileToString(xml);
YokozunaSchema schema = new YokozunaSchema("http_header_schema", xmlString);
StoreSchema storeSchemaOp = new StoreSchema.Builder(schema).build();
client.execute(storeSchemaOp);
schema_xml = File.read('http_header_schema.xml')
client.create_search_schema('http_header_schema', schema_xml)
$schema_string = file_get_contents('http_header_schema.xml');
(new \Riak\Riak\Command\Builder\StoreSchema($riak))
->withName('http_header_schema')
->withSchemaString($schema_string)
->build()
->execute();
import io
schema_xml = open('http_header_schema.xml').read()
client.create_search_schema('http_header_schema', schema_xml)
curl -XPUT $RIAK_HOST/search/schema/http_header_schema \
-H 'Content-Type: application/xml' \
--data-binary @http_header_schema.xml
Riak now has our schema stored and ready for use. Let’s create a search
index called header_data
that’s associated with our new schema:
YokozunaIndex headerDataIndex = new YokozunaIndex("header_data", "http_header_schema");
StoreSearchIndex storeIndex = new StoreSearchIndex.Builder(headerDataIndex)
.build();
client.execute(storeIndex);
client.create_search_index('header_data', 'http_header_schema')
(new \Riak\Riak\Command\Builder\StoreIndex($riak))
->withName('header_data')
->usingSchema('http_header_schema')
->build()
->execute();
client.create_search_index('header_data', 'http_header_schema')
curl -XPUT $RIAK_HOST/search/index/header_data \
-H 'Content-Type: application/json' \
-d '{"schema":"http_header_schema"}'
Now, we can create and activate a bucket type
for all of the HTTP header data that we plan to store. Any bucket that
bears this type will be associated with our header_data
search index.
We’ll call our bucket type http_data_store
.
riak-admin bucket-type create http_data_store '{"props":{"search_index":"header_data"}}'
riak-admin bucket-type activate http_data_store
Let’s use the same google_packet.bin
file that we used previously and
store it in a bucket with the http_data_store
bucket type, making sure
to use our custom application/httpheader
MIME type:
Location key = new Location(new Namespace("http_data_store", "packets"), "google");
File packetData = new File("google_packet.bin");
byte[] packetBinary = FileUtils.readFileToByteArray(packetData);
RiakObject packetObject = new RiakObject()
.setContentType("application/httpheader")
.setValue(BinaryValue.create(packetBinary));
StoreValue storeOp = new StoreValue.Builder(packetObject)
.setLocation(key)
.build();
client.execute(storeOp);
packet_data = File.read('google_packet.bin')
bucket = client.bucket_type('http_data_store').bucket('packets')
obj = Riak::Robject.new(bucket, 'google')
obj.content_type = 'application/httpheader'
obj.raw_data = packetData
obj.store
$object = new Object(file_get_contents("google_packet.bin"), ['Content-Type' => 'application/httpheader']);
(new \Riak\Riak\Command\Builder\StoreObject($riak))
->buildLocation('google', 'packets', 'http_data_store')
->withObject($object)
->build()
->execute();
packet_data = open('google_packet.bin').read()
bucket = client.bucket_type('http_data_store').bucket('packets')
obj = RiakObject(client, bucket, 'google')
obj.content_type = 'application/httpheader'
obj.data = packet_data
obj.store()
curl -XPUT $RIAK_HOST/types/http_data_store/buckets/packets/keys/google \
-H 'Content-Type: application/httpheader' \
--data-binary @google_packet.bin
Now that we have some header packet data stored, we can query our
header_data
index on whatever basis we’d like. First, let’s verify
that we’ll get one result if we query for objects that have the HTTP
method GET
:
// Using the same method from above:
String query = "method:GET";
// Again using the same method from above:
int numberFound = results.numResults(); // 1
results = client.search('http_header_schema', 'method:GET')
results['num_found'] # 1
$response = (\Riak\Riak\Command\Search\FetchObjects($riak))
->withQuery('method:GET')
->withIndexName('header_data')
->build()
->execute();
$response->getNumFound();
results = client.fulltext_search('http_header_schema', 'method:GET')
results['num_found'] # 1
curl "$RIAK_HOST/search/query/header_data?wt=json&q=method:GET"
# This should return a fairly large JSON object with a "num_found" field
# The value of that field should be 1