Given 15 seconds, what should I know?
- MarkMail lets you search millions of emails across thousands of mailing lists
- Search using keywords as well as
- The GUI doesn't yet expose it, but you can negate any search item, like
- Subdomains are list constraining, so
tomcat.markmail.orgsearches tomcat lists
- Use "n" and "p" keyboard shortcuts to navigate the search results
- You can stay current with the MarkMail blog
What is MarkMail?
MarkMail is a community-focused searchable message archive, accessible at http://markmail.org, developed and hosted by MarkLogic Corporation. It provides end users with powerful search and discovery tools for finding answers and understanding activity in popular mailing lists, such as those used by open source projects.
What powers MarkMail?
MarkMail is powered by MarkLogic Server, an Enterprise NoSQL database, built to load, query, manipulate, and render large amounts of data. In MarkMail every email is represented and held as an XML document. All the text searches, faceted navigation, analytic queries and HTML page renderings you see are performed by a small MarkLogic Server cluster against millions of XML documents. You can download a free copy of MarkLogic Server, if you're interested in checking it out.
Why did you create MarkMail?
One reason we built the site is to show what MarkLogic Server can do. Many of our customers are content publishers who build really interesting sites on our platform but host them behind passwords, for paying customers only. Other customers of ours work in the defense and intelligence sectors, so you're even less likely to see their work. MarkMail gives us a public, open site to demonstrate MarkLogic Server capabilities.
We've chosen to focus on email because we believe there is tremendous value in email archives, and they are so underutilized today. So much information is locked up in email, read once and never to be found again. We're hoping to change that. There's lots of potential. We're starting with public email lists.
Will you load more lists?
Absolutely. We're constantly loading new list archives. We prioritize which list archives to load based on feedback, so let us know if you'd like us to host the archives for your community.
Why would I use MarkMail instead of Google or some other search engine?
1. Scope. Many of the emails we host aren't in Google's index. Or at least they weren't in Google's index before we went online. But even now with Google and the rest spidering us you'll still want to use MarkMail because...
2. Speciality. Search engines only index public email if
it appears on an HTML page somewhere. Google doesn't know why the words are
on the page - doesn't know the sender, the date, the subject, what's in the
body of the attachments, and so on. We do. So if you want to search for an
issue with Apache James involving respooling, with MarkMail you type
list:james respool. And hey, if you remember the email you want
came from your friend James you type
respool and you're there. You can be very specific because MarkMail is
a site built specifically for email.
3. Structure. We know about the structure of email.
This lets us exclude searching those annoying "copyright notice" footers at
the bottom of emails, add relevance weight to the important message headers,
reduce the importance of quoted message text, or let you exclude quoted
message text if you'd prefer (see the
opt:noquote feature). We
even understand the structure inside attachment files.
4. Analytics. Where else have you seen a chart showing
the historical activity corresponding to any arbitrary query you type in?
It's a ton of fun to watch activity trends for lists, people, and keywords, or
any mix of the three. Don't forget to use a minus sign to negate a query:
5. Attachments. Do a search for emails with PowerPoint
attachments (hint, the query is
ext:ppt), click on the attachment
link, and watch how you can view the attachment without leaving the search
results! Same goes for Word files and PDFs. If you include a search term,
we'll even show you which slides include the term:
6. Convenience. We've worked to build the MarkMail site as an immersive experience. We don't like how regular search engines make you click, read, hit the Back button, only to click, read, and hit the Back button again. With MarkMail all the important information stays in front of you all the time, results right next to hits.
7. Shortcuts. Part of convenience is keyboard shortcuts. Try hitting "n" and "p" to move to the next and previous emails in the results. You can hit "s" to jump to the search box, and "x" to close the attachment popup. Fans of the VI editor will find comfort in using "j" and "k" to move up and down the messages listed in the thread view. Want more?
8. Security. From previous experience running email archive systems, we know one of the biggest complaints of users is showing their email addresses out in the open where spam harvesters have a field day collecting them. We obfuscate every email address in the system, even those in the bodies of messages, something most sites overlook. You can still search on email addresses (because we know what they are) and you can view the emails in a particular message if you solve a captcha (one of those squiggly line words that prove to us you're a human).
How do I post a reply to a message I've found?
MarkMail hosts list archives in read-only mode and doesn't yet provide a mechanism for you to participate in mailing list discussions directly. To post an email to one of the lists, you need to use your normal mail program (Outlook, Thunderbird, etc) to send an mail to the list.
You can derive the listname from our archive name. If the archive name is com.domain.project.list then the list's public email address is email@example.com.
Important Note: most of these mailing lists require you to subscribe before posting, as a way to reduce spam. You can find subscribe instructions on the project web sites. If you're interested in having us expose a direct-reply capability, write in.
Is there a discussion forum?
Absolutely. We've setup an email list for discussion. You have to join to post, to reduce spam. We anticipate low traffic. If you'd like to kibitz with the creators of the site, sign up. And of course, there's a searchable archive.
I found a bug, how do I file it?
What's next for the site?
How will I know when you add new features or launch new stuff?
Ah, that would be our blog, "The Making of MarkMail". You can subscribe to stay current or just read what's new. If you prefer email notification, the blog is mirrored on the announcements mailing list. It's archived too.
How many emails and lists do you manage?
The archive includes many millions of emails across thousands of lists. Current counts are always displayed on the home page. New messages to those lists are added continuously throughout the day, and are immediately available for search. We also run a version of MarkMail inside Mark Logic for our internal mailing lists (but it's a lot smaller).
How many senders does that make?
In the original 4,000,000 Apache archives we counted a little over 150,000 unique poster names. In 2007 we saw almost 20,000 posters not seen in years past (and 12,000 repeat posters).
How many emails have attachments?
We find roughly 1% of emails have attachments.
Which browsers do you support?
We strive to work with Firefox 1.5+, Internet Explorer 6+, Opera 9+, and Safari 1.2+. More recent browsers sometimes have more features. The site works best if you leave your font size normal.
How can I help?
Write a blog post, tell your friends, link to us, and subscribe to our blog, The Making of MarkMail.
Can I buy one?
At this point MarkMail is only offered as a free service hosting public emails. If you're interested in getting MarkMail for your own lists, company, or personal inbox, let us know using the feedback form.
What search syntax are you using, and what can it do?
If you're a beginner, just type your word or phrase into the search box. We'll show you relevant results, and you can always use the dynamic faceted navigation links in the left side analytics pane to get more specific based on list, sender, attachment types, and message types.
If you're using the MarkMail gadget on iGoogle.com, the search box can be found in the Edit Settings dialog. Clicking on any message or thread will bring you through to markmail.org. The tabs in the gadget let you see some of the same analytics you can see in the faceted navigation at MarkMail.
Over time you may want to learn our search syntax, explained below. We support searching for...
|Search Capability||Example Query|
|Terms in the sender's name or email:|
|Terms in the subject:|
|Terms in the list name:|
|An attachment file extension:|
|Any part of the attachment name:|
|The classification of the message:|
|Messages by date:|
|Or by date range:|
|Or by one-sided range:|
Search terms are case insensitive. They are also stemmed, meaning a search for "issue" will additionally match "issues" and vice-versa.
Constraints are ANDed together except in the case of multiple fielded
constraints of the same type which will be OR'd together. So for example the
list:tomcat list:struts includes both tomcat and struts
Any of the above constraints may be negated with a minus sign indicating that
matching messages should be excluded from the results. For example, the
following query finds Tomcat traffic excluding CVS or SVN commits:
list:tomcat -type:checkins. Negations are not OR'd; more
negations always exclude more messages.
By default search results are ordered by relevance, most relevant first.
If the search string contains
order:date-forward the results
will be sorted chronologically, oldest to newest; if the string contains
order:date-backward it's the reverse. If that's too much
typing you can try
Searches that don't have a meaningful relevance score, like
list:tomcat, are by default sorted date-backward.
If the search string contains the special
opt:noquote argument the results
will avoid searching text that appears in quoted messages. This can be used
to find messages where a particular person said something.
opt:nostem argument in a search indicates you
want the search without "stemming", meaning a search for run won't
match runs or ran as would normally happen. This lets you be
a bit more precise in what you're looking for.
Are there any useful keyboard shortcuts?
We're geeks, so we love keyboard shortcuts. Here's a list:
|n/p||Next and previous in search results|
|j/k||Up and down in the thread view|
|s||Jump to the search box|
|arrows||Move left and right in the attachment popup|
|v||Toggle text/image view in the attachment popup|
|z||Toggle zoom in the attachment popup|
|x||Close the attachment popup|
Can I put a MarkMail search box on my site?
Sure, and that can be convenient if you want to give people the opportunity to search messages without visiting the MarkMail site first. Just copy and paste this HTML into your site:
Search MarkMail: <input type="text" name="q" size="50"/>
<input type="submit" value="Search"/>
You should replace "tomcat" with the name of your list or project. You can adjust the default search constraint too. For example, the following searches messages sent by a particular person, in this case Sam Ruby:
Search MarkMail: <input type="text" name="q" size="50"/>
<input type="submit" value="Search"/>
Replace "rubys" with the name of your desired poster. You can even combine terms, even negations, into the default search constraint.
How do I add MarkMail to my browser's search box?
If you are running a browser that supports OpenSearch browser search plugins such as FireFox v2 and above and Internet Explorer v7 and above, you can add a MarkMail search plugin to your browser by clicking here.
Then, to switch your browser's search box to use the MarkMail browser search plugin:
- Click on your browser's search box drop down menu
- Click on MarkMail
Note that you can use all of MarkMail search syntax and modifiers in the browser search box.
How do I link to MarkMail?
We love incoming links. You'll probably want to link to one of our project-based homepages:
|What to Search||Sample URL|
|All of MarkMail|
(You get the idea.)
How do I link to an email or thread?
Each email in MarkMail has a special canonical identifier, something people frequently call a "permalink". It looks like http://markmail.org/message/fbdkpdqfgutyp47h. Next to the message headers for each email you'll see a link offered that's that email's permalink, something you can bookmark or email. There's another link near the search results that's a permalink for the full browser state. If bookmarked or emailed this link will recreate the whole view including the search performed and selected message.
How can I avoid having my emails archived?
MarkMail supports the
x-no-archive header. If your email
includes this header with a value of
yes, then in most situations
your email will not be shown publicly on the MarkMail site. Note that because
in Apache's history this header has often been added accidentally by list
management software, we may in some cases choose to display emails with this
header set. There may be other situations in the future where we judge it
best to ignore the header. The best and only 100% reliable way to make sure
your email doesn't appear in MarkMail or an archive system like it is to avoid
posting to a public mailing list.
Can you load my private email?
At this point, no. But if you have specific feature ideas, let us know. And if you'd like to see MarkMail running on your private e-mail, send us a note using the feedback form, because we're curious to hear from you. No promises, mind you.
Can I just browse the lists and messages?
Yes, we have a rudimentary browse interface. It's primarily intended for web crawlers and debugging, but you can use it too if you'd like. No need to bookmark, there's a "Browse" link in the footer of every page.
Tip for geeks: You can manually add a
constraint to the query string of the browse page to restrict your view to messages matching
the constraint. For example manually typing
gets you a browsing look at emails from Criag McClanahan.
Any known issues?
There are a few known issues:
- Use of large fonts or larger DPI settings can cause text to exceed the allotted bounds
- Firefox can report "Transferring data from markmail.org..." in the browser's status bar even after all data has been transferred
How do I request that content be removed?
Please see our removal policy.
What's hard about searching email?
Email doesn't work well in a relational model because there's too much free text (all those words in the message). It doesn't work well in a search engine either because there's too much ad hoc structure (headers, footers, paragraphs) and hierarchy (attachment files containing pages containing paragraphs). And if you try to marry relational with search you only get into trouble with slow joins across the systems and indexes being out of sync and slow to update.
We've found email works naturally as XML, with the structure represented by elements and hierarchy by the nesting of the elements. We're hoping that on this solid base we'll be able to push the envelope for email management.
How do you store the emails?
Each email is stored an XML document inside MarkLogic Server. Every document representing a message has a <message> root element with attributes to hold the message's unique id, thread id, list name, date, and so on.
Underneath the root, there's a <headers> element containing more
elements, one for each of the email's headers. For special headers like
<from> we examine the content and extract it in a normalized format, so
for example <from> has
attributes holding the sender's email and real name. We use these attributes
for search and display instead of the raw
From: header value.
Following <headers> there resides an optional <attachments> element that holds information about the email's attachments - pointers to their original versions, any image representations, and (for attachments we know about like DOC, PPT, and PDF) the content on each page. The binary files reside in MarkLogic Server also.
Lastly, under the root there's a <body> element that hold the message content. The body isn't held as simple text but rather as <para> elements and <quotepara> elements with attributes indicating the quote level. We also have <footer> elements. Each footer has a @type attribute to let us know if it's a person's signature block, a confidentiality statement, a footer added by a free hosting provider, a list subscription management line, or several other classifications. Signature footers we display in italics, confidentiality footers we display in gray, and some footers aren't displayed at all. Inline we've added markup for recognized emails and URLs. This enables targeted searching against these items as well as custom display (i.e. email obfuscation). A fun query we can do with all this markup is: given a person's name find the email address embedded within their most recent footer. (Note that we haven't exposed this query to the public yet. Let us know if you want us to.)
What generates the charts?
The raw data for the charts comes from a MarkLogic Server product feature called lexicons. Lexicons enable fast calculation of the distinct values for a specific element or attribute along with the occurrence frequency of each value. To generate the historical activity chart, we ask for the distinct month values and frequencies across the emails satisfying the user's query.
We pass that raw data as XML to a commercial Flash charting library whose code we have modified to better suit our purposes.
How do you load the emails?
A Java program converts email to XML and loads them into MarkLogic Server. The program can load emails from an mbox archive or catch new emails as they come across the wire. We subscribe a user to each archived list in order to receive the new messages.
The delay between message receipt and searchability is less than a second. You're always searching the very latest.
How do you do the search?
We use the XML-aware full text search capabilities built into MarkLogic Server. With it we're able to use all the markup in each message as part of the query - even though the markup is hierarchical, ordered, and irregular. See How do you store the emails?
What runs the front-end?
We actually use MarkLogic Server to generate the HTML pages. Well, technically they're XHTML. Our back-end data model holds emails as XML, each visitor's browser requires (X)HTML, and in between we have the XML-centric scripting language XQuery. It seems wasteful and unnecessarily difficult to marshall XML into objects or strings and then marshall it back out as XML. Instead we query XML, process XML in an XML-aware scripting language, and output XML as XHTML. It's the only time in our web lives we haven't suffered an impedence mismatch at some level of a web architecture. And as a bonus, the XQuery language always outputs well-formed XHTML.
How will this scale?
At this point we're running on a small MarkLogic Server cluster, utilizing the inbuilt capabilities of MarkLogic Server to scale as necessary for user load and email traffic. The largest clusters deployed at some of our customer projects exceed 100 machines. We have loads of runway left.
This sounds like fun, are you hiring?
Actually, yes, we are hiring. If you're the kind of person who reads to the bottom of a technical faq (ahem), you reside or can relocate to the Bay Area, and you'd like to work as part of a small team focused on making MarkMail even better, let us know.