Some languages (like Romanian) have special characters (diacritics, often called accent marks). It’s generally useful to remove diacritic marks from characters, for example when you create an index with Solr. You don’t want to index text with these characters because you want to find for example both words “propriet??i” and “proprietati”. If you are using Solr to index your text you have to create a Solr filter.
First of all you have to put the filter in the schema.xml configuration file :


<fieldtype name="text_st" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                // ... some other filters for example lower case filter
                <filter class="solr.LowerCaseFilterFactory"/>    <filter class="ro.tremend.solr.diacritics.DiacriticsFilterFactory"/> 
            </analyzer>
</fieldtype>

Then create 3 small classes and a properties file. The filter factory for Solr DiacriticsFilterFactory :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

/**
 * Create a Solr Filter Factory for diacritics
 * 
 * @author Sebastian
 * 
 */
public class DiacriticsFilterFactory extends BaseTokenFilterFactory {
	public TokenStream create(TokenStream input) {
		return new DiacriticsFilter(input);
	}
}

Now you have to create the filter class DiacriticsFilter :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.*;
import java.io.IOException;

/**
 * Create the diacritics filter
 * 
 * @author Sebastian
 * 
 */
public final class DiacriticsFilter extends TokenFilter {
	public DiacriticsFilter(TokenStream in) {
		super(in);
	}

	public final Token next() throws IOException {
		Token t = input.next();

		if (t == null)
			return null;

		t.setTermText(DiacriticsUtils.replaceDiacritics(t.termText()));
		return t;
	}
}

and finally the class that does the work DiacriticsUtils :

package ro.tremend.solr.diacritics;

import java.util.HashMap;
import java.util.Map;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Set;

/**
 * Replace romanian characters
 * 
 * @author Sebastian
 * 
 */
public class DiacriticsUtils {
	private static Map<String, String> diacritics = new HashMap<String, String>();

	static {
		// Get diacritics from diacritics.properties
		try {
			ResourceBundle resource = ResourceBundle.getBundle("diacritics");
			Set keySet = resource.keySet();
			for (String key : keySet) {
				diacritics.put(key, resource.getString(key));
			}
		} catch (MissingResourceException e) {
			e.printStackTrace();
		}
	}

	/**
	 * Replace all diacritics in a string
	 * 
	 * @param s the string
	 * @return the string without diacritics
	 */
	public static String replaceDiacritics(String s) {
		for (String key : diacritics.keySet()) {
			s = s.replaceAll(key, diacritics.get(key));
		}
		return s;
	}

	public static Map<String, String> getDiacritics() {
		return diacritics;
	}
}

This class needs a properties file with the diacritics you want to replace:
diacritics.properties

\\u0102=A
\\u0103=a
... define all your language specific characters

 


Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

textToFind = DiacriticsUtils.replaceDiacritics(textToFind);

 


I hope this will help.

8 responses to “Create a Solr filter that replaces diacritics

  1. Unles you have a very specific need to define a custom list of diacritics to remove, the ISOLatin1AccentFilter that comes with Lucene (and the ISOLatin1AccentFilterFactory that comes with Solr) should solve this problem for you without any custom code.

    (ISOLatin1AccentFilter has had a lot of speed improvements in the trunk which should make it into the next version of Solr as well … see LUCENE-871 in apache Jira for more info)

  2. Hello

    Great example
    I am new to solr and trying to understand as to where i should be applying this last update mentioned
    —————————————————————
    Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

    textToFind = DiacriticsUtils.replaceDiacritics(textToFind);
    ————————————————————–

    I am building an application where I would need to replace an incoming string with another specific to our taxonmy
    Please advice

  3. You can also replace diacritics with the base characters using java.text.Normalizer.
    This is useful when you don’t know which diacritics can appear. This code will always extract the base character.

    This is the code:
    public String toBase(String sText){
    boolean bChar = true;
    int iSize = sText.length(), i = 0;
    String sAux = “”;

    for (i=0; i < iSize; i++){
    String sLetter = new String(new char[]{sText.charAt(i)});

    sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFD);

    try{
    byte[] bLetter = (new String(sLetter)).getBytes(“UTF-8”);
    char cLetter = (char) bLetter[0];
    sAux += “” + cLetter;
    }
    catch(Exception e){
    //do something
    }
    }
    return sAux;
    }

  4. I’m trying to add a factory in solr for tokenizing Arabic text, but I
    receive some error (the one at the last of my email). can u help me solve this problem please.

    Here is my code:

    package org.apache.solr.analysis;

    import gpl.pierrick.brihaye.aramorph.lucene.ArabicTokenizer;
    import java.io.Reader;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.solr.analysis.BaseTokenizerFactory;

    public class ArabicTokenizerFactory extends BaseTokenizerFactory{
    public TokenStream create(Reader input) {
    return new ArabicTokenizer(input);
    }
    }

    Thanks in advance

    HTTP Status 500 – Severe errors in solr configuration. Check your log
    files for more detailed information on what may be wrong. If you want
    solr to continue after configuration errors, change:
    false in
    solrconfig.xml
    ————————————————————-
    java.lang.VerifyError: (class:
    org/apache/solr/analysis/ArabicTokenizerFactory, method: create
    signature: (Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;)
    Wrong return type in function at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source) at
    org.apache.solr.core.Config.findClass(Config.java:308) at
    org.apache.solr.core.Config.newInstance(Config.java:319) at
    org.apache.solr.schema.IndexSchema.readTokenizerFactory(IndexSchema.java
    :631) at
    org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:605) at
    org.apache.solr.schema.IndexSchema.access$000(IndexSchema.java:57) at
    org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:330) at
    org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:353) at
    org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoad
    er.java:140) at
    org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:362) at
    org.apache.solr.schema.IndexSchema.(IndexSchema.java:73) at
    org.apache.solr.core.SolrCore.(SolrCore.java:275) at
    org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:244) at
    org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
    68) at
    org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi
    lterConfig.java:221) at
    org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio
    nFilterConfig.java:302) at
    org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilte
    rConfig.java:78) at
    org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav
    a:3635) at
    org.apache.catalina.core.StandardContext.start(StandardContext.java:4222
    ) at
    org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.ja
    va:760) at
    org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
    at
    org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:
    626) at
    org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java
    :553) at
    org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
    at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138) at
    org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:31
    1) at
    org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSu
    pport.java:120) at
    org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022) at
    org.apache.catalina.core.StandardHost.start(StandardHost.java:736) at
    org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014) at
    org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
    at
    org.apache.catalina.core.StandardService.start(StandardService.java:448)
    at
    org.apache.catalina.core.StandardServer.start(StandardServer.java:700)
    at org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
    sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
    java.lang.reflect.Method.invoke(Unknown Source) at
    org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295) at
    org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433)

Leave a Comment:

Your email address will not be published. Required fields are marked *